Arxiv今日论文 | 2025-04-10

本篇博文主要内容为 2025-04-10 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文致力于解决大型语言模型（Large Language Models, LLMs）在持续学习过程中面临的灾难性遗忘问题（Catastrophic Forgetting），即适应新任务会导致之前学得的任务性能显著下降。现有方法通常依赖于低秩、参数高效的更新方式，这限制了模型的表达能力，并为每个任务引入额外参数，导致可扩展性问题。论文的关键解决方案在于提出了一种新颖的全微调方法（continual full fine-tuning），通过利用自适应奇异值分解（Adaptive Singular Value Decomposition, SVD）来动态识别任务特定的低秩参数子空间，并将更新约束为与先前任务的关键方向正交，从而有效减少干扰，同时无需增加额外参数或存储先前任务的梯度。这种方法实现了最先进的性能，在多个基准数据集上的实验表明，其平均准确率比近期基线方法如O-LoRA高出多达7%，并且显著减少了遗忘，保持了模型的通用语言能力、指令跟随准确性以及安全性。

链接: https://arxiv.org/abs/2504.07097
作者: Nikhil Shivakumar Nayak,Krishnateja Killamsetty,Ligong Han,Abhishek Bhandwaldar,Prateek Chanda,Kai Xu,Hao Wang,Aldo Pareja,Oleg Silkin,Mustafa Eyceoz,Akash Srivastava
机构: Red Hat AI Innovation (红帽人工智能创新); IBM Research (IBM研究); IIT Bombay (印度理工学院孟买分校); MIT-IBM Watson AI Lab (麻省理工学院-IBM沃森人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Probability (math.PR); Machine Learning (stat.ML)
备注: 25 pages, 13 figures, 6 tables

点击查看摘要

Abstract:Continual learning in large language models (LLMs) is prone to catastrophic forgetting, where adapting to new tasks significantly degrades performance on previously learned ones. Existing methods typically rely on low-rank, parameter-efficient updates that limit the model’s expressivity and introduce additional parameters per task, leading to scalability issues. To address these limitations, we propose a novel continual full fine-tuning approach leveraging adaptive singular value decomposition (SVD). Our method dynamically identifies task-specific low-rank parameter subspaces and constrains updates to be orthogonal to critical directions associated with prior tasks, thus effectively minimizing interference without additional parameter overhead or storing previous task gradients. We evaluate our approach extensively on standard continual learning benchmarks using both encoder-decoder (T5-Large) and decoder-only (LLaMA-2 7B) models, spanning diverse tasks including classification, generation, and reasoning. Empirically, our method achieves state-of-the-art results, up to 7% higher average accuracy than recent baselines like O-LoRA, and notably maintains the model’s general linguistic capabilities, instruction-following accuracy, and safety throughout the continual learning process by reducing forgetting to near-negligible levels. Our adaptive SVD framework effectively balances model plasticity and knowledge retention, providing a practical, theoretically grounded, and computationally scalable solution for continual learning scenarios in large language models.
zh

[NLP-1] OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens ACL2025

【速读】：该论文试图解决语言模型输出与其大规模训练数据之间的可追溯性问题。传统方法难以实时关联模型输出与训练数据，尤其当训练数据规模达到万亿级 tokens 时。论文提出的解决方案关键在于基于扩展版的 infini-gram 技术（infini-gram (Liu et al., 2024)），能够在几秒内快速找到语言模型输出片段与训练语料库中文本的逐字匹配，并可视化展示这些关联。这一技术突破使得用户能够通过训练数据的视角理解语言模型的行为，特别是在事实核查、幻觉现象（hallucination）及模型创造力等方面的应用具有重要价值。

链接: https://arxiv.org/abs/2504.07096
作者: Jiacheng Liu,Taylor Blanton,Yanai Elazar,Sewon Min,YenSung Chen,Arnavi Chheda-Kothary,Huy Tran,Byron Bischoff,Eric Marsh,Michael Schmitz,Cassidy Trier,Aaron Sarnat,Jenna James,Jon Borchardt,Bailey Kuehl,Evie Cheng,Karen Farley,Sruthi Sreeram,Taira Anderson,David Albright,Carissa Schoenick,Luca Soldaini,Dirk Groeneveld,Rock Yuren Pang,Pang Wei Koh,Noah A. Smith,Sophie Lebrecht,Yejin Choi,Hannaneh Hajishirzi,Ali Farhadi,Jesse Dodge
机构: Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学); UC Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: Under submission at ACL 2025 demo track

点击查看摘要

Abstract:We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
zh

[NLP-2] OmniCaptioner: One Captioner to Rule Them All

【速读】：该论文试图解决跨多种视觉领域生成细粒度文本描述的问题，现有方法多局限于特定类型的图像（如自然图像或几何图形），而OmniCaptioner提供了一个统一的解决方案，适用于自然图像、视觉文本（如海报、UI、教科书）以及结构化视觉内容（如文档、表格、图表）。其关键在于通过将低级像素信息转换为语义丰富的文本表示，弥合视觉与文本模态之间的鸿沟。该框架展示了三个主要优势：(i) 借助大型语言模型 (LLMs) 的增强视觉推理能力；(ii) 改进图像生成任务的性能；(iii) 实现高效的有监督微调 (SFT)，从而加快收敛速度并减少所需数据量。

链接: https://arxiv.org/abs/2504.07089
作者: Yiting Lu,Jiakang Yuan,Zhen Li,Shitian Zhao,Qi Qin,Xinyue Li,Le Zhuo,Licheng Wen,Dongyang Liu,Yuewen Cao,Xiangchao Yan,Xin Li,Botian Shi,Tao Chen,Zhibo Chen,Lei Bai,Bo Zhang,Peng Gao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: More visualizations on Homepage: this https URL and Official code: this https URL

点击查看摘要

Abstract:We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.
zh

[NLP-3] KG-LLM -Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs NAACL

【速读】：该论文试图解决知识图谱（Knowledge Graph）以文本形式输入到大型语言模型（LLMs）过程中，不同编码策略对模型性能影响缺乏系统性研究的问题。论文的关键在于引入了一个涵盖五个知识图谱理解任务的综合且可扩展的基准测试集KG-LLM-Bench，并通过七种语言模型与五种文本化策略的广泛实验，分析不同编码方式对基础模型在知识图谱推理任务上的性能影响，从而为优化LLMs在知识图谱相关任务中的表现提供指导。

链接: https://arxiv.org/abs/2504.07087
作者: Elan Markowitz,Krupa Galiya,Greg Ver Steeg,Aram Galstyan
机构: University of Southern California (南加州大学); University of California, Riverside (加州大学河滨分校); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: To be presented at NAACL-HLT, KnowledgeNLP Workshop (2025)

点击查看摘要

Abstract:Knowledge graphs have emerged as a popular method for injecting up-to-date, factual knowledge into large language models (LLMs). This is typically achieved by converting the knowledge graph into text that the LLM can process in context. While multiple methods of encoding knowledge graphs have been proposed, the impact of this textualization process on LLM performance remains under-explored. We introduce KG-LLM-Bench, a comprehensive and extensible benchmark spanning five knowledge graph understanding tasks, and evaluate how different encoding strategies affect performance across various base models. Our extensive experiments with seven language models and five textualization strategies provide insights for optimizing LLM performance on KG reasoning tasks.
zh

[NLP-4] A Sober Look at Progress in Language Model Reasoning : Pitfalls and Paths to Reproducibility

【速读】：该论文试图解决当前数学推理基准测试中因实施细节选择（如解码参数、随机种子、提示格式以及硬件和软件框架配置）导致的高敏感性问题，以及现有研究中性能提升往往依赖于不透明比较或未报告的变化来源的问题。论文的关键解决方案是提出一个标准化的评估框架，包含明确的最佳实践和报告标准，通过此框架重新评估近期方法，发现强化学习（Reinforcement Learning, RL）方法仅带来适度改进且容易过拟合，尤其是对小规模基准测试；而监督微调（Supervised Fine-Tuning, SFT）方法表现出更强的一致性泛化能力。此外，论文通过公开所有代码、提示和模型输出促进可重复性，为未来研究奠定更严谨的基础。

链接: https://arxiv.org/abs/2504.07086
作者: Andreas Hochlehnert,Hardik Bhatnagar,Vishaal Udandarao,Samuel Albanie,Ameya Prabhu,Matthias Bethge
机构: Tübingen AI Center, University of Tübingen (图宾根人工智能中心，蒂宾根大学); University of Cambridge (剑桥大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on benchmarking practices that lack transparency, robustness, or statistical grounding. In this work, we conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices - including decoding parameters, random seeds, prompt formatting, and even hardware and software-framework configurations. Performance gains reported in recent studies frequently hinge on unclear comparisons or unreported sources of variance. To address these issues, we propose a standardized evaluation framework with clearly defined best practices and reporting standards. Using this framework, we reassess recent methods and find that reinforcement learning (RL) approaches yield only modest improvements - far below prior claims - and are prone to overfitting, especially on small-scale benchmarks like AIME24. In contrast, supervised finetuning (SFT) methods show consistently stronger generalization. To foster reproducibility, we release all code, prompts, and model outputs, for reasoning benchmarks, establishing more rigorous foundations for future work.
zh

[NLP-5] Self-Steering Language Models

【速读】：本文旨在解决在测试时推理过程中，语言模型（Language Models, LMs）进行自然语言搜索或规划时存在的速度慢、成本高且易出错的问题。尽管语言模型在精确模拟解决问题所需的推理步骤方面可能表现不佳，但它们通常擅长描述任务的抽象结构，包括如何验证解决方案以及如何搜索解决方案。为了解决这一问题，论文提出了DisCIPL方法，其关键是通过一个Planner模型生成特定于任务的推理程序，并由一组Follower模型执行该程序。这种方法使语言模型能够编写递归搜索过程以指导其推理，从而实现可验证且高效的推理形式。当使用较小规模的Follower模型（如Llama-3.2-1B）时，DisCIPL方法能够在具有挑战性的约束生成任务上与更大规模的模型（如GPT-4o和o1）相匹配甚至超越它们。通过将规划与执行分离，本研究开辟了一个高度并行化的蒙特卡洛推理策略的设计空间，这些策略优于标准的最佳-of-N采样方法，无需微调，并且可以通过现有的语言模型自动实现。

链接: https://arxiv.org/abs/2504.07081
作者: Gabriel Grand,Joshua B. Tenenbaum,Vikash K. Mansinghka,Alexander K. Lew,Jacob Andreas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While test-time reasoning enables language models to tackle complex tasks, searching or planning in natural language can be slow, costly, and error-prone. But even when LMs struggle to emulate the precise reasoning steps needed to solve a problem, they often excel at describing its abstract structure–both how to verify solutions and how to search for them. This paper introduces DisCIPL, a method for “self-steering” LMs where a Planner model generates a task-specific inference program that is executed by a population of Follower models. Our approach equips LMs with the ability to write recursive search procedures that guide LM inference, enabling new forms of verifiable and efficient reasoning. When instantiated with a small Follower (e.g., Llama-3.2-1B), DisCIPL matches (and sometimes outperforms) much larger models, including GPT-4o and o1, on challenging constrained generation tasks. In decoupling planning from execution, our work opens up a design space of highly-parallelized Monte Carlo inference strategies that outperform standard best-of-N sampling, require no finetuning, and can be implemented automatically by existing LMs.
zh

[NLP-6] DeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理超出标准基准的新型高中数学问题时，尽管在奥林匹克级别推理任务上表现优异，但仍可能遇到困难的问题。论文超越了最终准确性这一单一指标，提出了一种演绎一致性度量（deductive consistency metric），用于分析语言模型（LMs）的链式思维输出。关键在于通过将演绎推理分解为理解输入前提和推导结论两个子任务，评估模型在这两个方面的表现，从而揭示模型在处理新颖问题时的推理错误来源。论文进一步开发了一个管道来评估模型在经过扰动的标准基准问题的新颖版本上的演绎一致性，并发现随着推理跳数的增加，模型的精度显著下降，而这种误差在原始基准中被掩盖。研究还表明，这种现象并非由语言风格的变化或早期错误的自然传播引起，而是主要源于多跳推理过程中的错误累积。因此，该方法提供了一种统一的方式，从输入前提窗口和推理跳数计算的角度来表征语言模型的推理能力，为跨问题领域的模型评价提供了新视角。

链接: https://arxiv.org/abs/2504.07080
作者: Atharva Pandey,Kshitij Dubey,Rahul Sharma,Amit Sharma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite great performance on Olympiad-level reasoning problems, frontier large language models can still struggle on high school math when presented with novel problems outside standard benchmarks. Going beyond final accuracy, we propose a deductive consistency metric to analyze chain-of-thought output from language models (LMs).Formally, deductive reasoning involves two subtasks: understanding a set of input premises and inferring the conclusions that follow from them. The proposed metric studies LMs’ performance on these subtasks, with the goal of explaining LMs’ reasoning errors on novel problems: how well do LMs understand input premises with increasing context lengths, and how well can they infer conclusions over multiple reasoning hops? Since existing benchmarks may be memorized, we develop a pipeline to evaluate LMs’ deductive consistency on novel, perturbed versions of benchmark problems. On novel grade school math problems (GSM-8k), we find that LMs are fairly robust to increasing number of input premises, but suffer significant accuracy decay as the number of reasoning hops is increased. Interestingly, these errors are masked in the original benchmark as all models achieve near 100% accuracy. As we increase the number of solution steps using a synthetic dataset, prediction over multiple hops still remains the major source of error compared to understanding input premises. Other factors, such as shifts in language style or natural propagation of early errors do not explain the trends. Our analysis provides a new view to characterize LM reasoning – as computations over a window of input premises and reasoning hops – that can provide unified evaluation across problem domains.
zh

[NLP-7] SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

【速读】：该论文旨在解决自主网络代理在复杂环境中缺乏关键自我提升能力的问题，具体表现为对程序性知识抽象、技能优化及组合的不足。论文提出的关键解决方案是SkillWeaver，这是一种以技能为中心的框架，使代理能够通过自主合成可重用技能（作为API）实现自我提升。SkillWeaver的核心在于让代理能够自主发现技能、执行实践以及将实践经验提炼为鲁棒的API，并通过迭代探索不断扩展轻量级、即插即用的API库，从而显著提升代理的能力。实验结果表明，SkillWeaver在WebArena和真实网站上的成功率分别提升了31.8%和39.8%，并且强代理合成的API能够通过技能迁移大幅提升弱代理的表现，最高可达54.3%的改进。这些成果验证了将多样化的网站交互提炼为API的有效性，这些API可以无缝共享于各类网络代理之间。

链接: https://arxiv.org/abs/2504.07079
作者: Boyuan Zheng,Michael Y. Fatemi,Xiaolong Jin,Zora Zhiruo Wang,Apurva Gandhi,Yueqi Song,Yu Gu,Jayanth Srinivasa,Gaowen Liu,Graham Neubig,Yu Su
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural knowledge abstraction, refining skills, and skill composition. In this work, we introduce SkillWeaver, a skill-centric framework enabling agents to self-improve by autonomously synthesizing reusable skills as APIs. Given a new website, the agent autonomously discovers skills, executes them for practice, and distills practice experiences into robust APIs. Iterative exploration continually expands a library of lightweight, plug-and-play APIs, significantly enhancing the agent’s capabilities. Experiments on WebArena and real-world websites demonstrate the efficacy of SkillWeaver, achieving relative success rate improvements of 31.8% and 39.8%, respectively. Additionally, APIs synthesized by strong agents substantially enhance weaker agents through transferable skills, yielding improvements of up to 54.3% on WebArena. These results demonstrate the effectiveness of honing diverse website interactions into APIs, which can be seamlessly shared among various web agents.
zh

[NLP-8] Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）评估中多语言和跨文化覆盖不足的问题。现有评估主要依赖于英语基准数据集，而尽管多语言基准数据集在规模和语言种类上有所扩展，但许多仍基于翻译的英文数据集，未能充分捕捉文化细微差别。为了解决这一问题，论文提出了Kaleidoscope，这是迄今为止最全面的多语言VLM评估基准。其关键是构建了一个大规模的同语言多模态基准（Kaleidoscope），包含18种语言和14个不同主题的20,911个多选题，通过开放科学协作确保语言和文化的准确性，并发现顶级多语言VLM在低资源语言和复杂多模态场景下的表现不佳，从而强调了开发文化包容性多模态评估框架的重要性。

链接: https://arxiv.org/abs/2504.07072
作者: Israfel Salazar,Manuel Fernández Burda,Shayekh Bin Islam,Arshia Soltani Moakhar,Shivalika Singh,Fabian Farestam,Angelika Romanou,Danylo Boiko,Dipika Khullar,Mike Zhang,Dominik Krzemiński,Jekaterina Novikova,Luísa Shimabucoro,Joseph Marvin Imperial,Rishabh Maheshwary,Sharad Duwal,Alfonso Amayuelas,Swati Rajwal,Jebish Purbey,Ahmed Ruby,Nicholas Popovič,Marek Suppa,Azmine Toushik Wasi,Ram Mohan Rao Kadiyala,Olga Tsymboi,Maksim Kostritsya,Bardia Soltani Moakhar,Gabriel da Costa Merlin,Otávio Ferracioli Coletti,Maral Jabbari Shiviari,MohammadAmin farahani fard,Silvia Fernandez,María Grandury,Dmitry Abulkhanov,Drishti Sharma,Andre Guarnier De Mitri,Leticia Bossatto Marchezi,Johan Obando-Ceron,Nazar Kohut,Beyza Ermis,Desmond Elliott,Enzo Ferrante,Sara Hooker,Marzieh Fadaee
机构: unknown
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.
zh

[NLP-9] A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models

【速读】：该论文旨在研究大型语言模型（Large Language Models, LLMs）个性化偏好对齐的问题，即如何根据个体用户的偏好定制LLMs。这一方向横跨自然语言处理（NLP）与个性化领域。论文的关键在于提出了一种针对LLMs个性化对齐与建模技术的分类法，包括训练时间（training time）、推理时间（inference time）以及基于用户建模（user-modeling based）的方法，并分析了每类方法的优势与局限性。此外，论文还涵盖了该领域的评估方法、基准数据集以及开放性问题，为推动个性化LLMs的发展提供了系统性的综述与思考。

链接: https://arxiv.org/abs/2504.07070
作者: Zhouhang Xie,Junda Wu,Yiran Shen,Yu Xia,Xintong Li,Aaron Chang,Ryan Rossi,Sachin Kumar,Bodhisattwa Prasad Majumder,Jingbo Shang,Prithviraj Ammanabrolu,Julian McAuley
机构: University of California, San Diego (加州大学圣地亚哥分校); University of California, Los Angeles (加州大学洛杉矶分校); Adobe Research (Adobe 研究院); The Ohio State University (俄亥俄州立大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalized preference alignment for large language models (LLMs), the process of tailoring LLMs to individual users’ preferences, is an emerging research direction spanning the area of NLP and personalization. In this survey, we present an analysis of works on personalized alignment and modeling for LLMs. We introduce a taxonomy of preference alignment techniques, including training time, inference time, and additionally, user-modeling based methods. We provide analysis and discussion on the strengths and limitations of each group of techniques and then cover evaluation, benchmarks, as well as open problems in the field.
zh

[NLP-10] HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification

【速读】：该论文旨在解决企业环境中大型语言模型（Large Language Model, LLM）输出中幻觉现象（hallucination）的检测问题。解决方案的关键在于提出了一种综合系统HDM-2，它基于新颖的LLM响应分类法，将幻觉分为上下文相关、常识性、企业特定以及无害陈述四类。HDM-2通过结合上下文信息与普遍公认的常识（common knowledge）验证LLM响应，并提供幻觉评分及词级注释，从而实现对问题内容的精确识别。此外，为评估其在上下文相关和常识性幻觉检测中的性能，研究引入了新的数据集HDMBench。实验结果表明，HDM-2在RagTruth、TruthfulQA和HDMBench数据集上的表现优于现有方法。这项工作特别关注企业部署中的具体挑战，如计算效率、领域专业化以及细粒度错误识别。相关的评估数据集、模型权重和推理代码已公开可用。

链接: https://arxiv.org/abs/2504.07069
作者: Bibek Paudel,Alexander Lyzhov,Preetam Joshi,Puneet Anand
机构: AIMon Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a comprehensive system for detecting hallucinations in large language model (LLM) outputs in enterprise settings. We present a novel taxonomy of LLM responses specific to hallucination in enterprise applications, categorizing them into context-based, common knowledge, enterprise-specific, and innocuous statements. Our hallucination detection model HDM-2 validates LLM responses with respect to both context and generally known facts (common knowledge). It provides both hallucination scores and word-level annotations, enabling precise identification of problematic content. To evaluate it on context-based and common-knowledge hallucinations, we introduce a new dataset HDMBench. Experimental results demonstrate that HDM-2 out-performs existing approaches across RagTruth, TruthfulQA, and HDMBench datasets. This work addresses the specific challenges of enterprise deployment, including computational efficiency, domain specialization, and fine-grained error identification. Our evaluation dataset, model weights, and inference code are publicly available.
zh

[NLP-11] ASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

【速读】：该论文旨在解决大型语言模型（LLMs）在语音模态下的应用局限性，特别是由于语音与文本模态之间的差异导致的性能差距问题。论文的核心目标是通过构建一个能够处理语音输入和输出的口语语言模型（SLM），以实现更自然的人机交互。为了解决语音与文本序列长度不匹配这一显著模态差异，论文提出了一种名为Text-Aligned Speech Tokenization and Embedding (TASTE) 的方法。TASTE 的关键创新在于通过在分词阶段将语音标记与对应的文本转录对齐，直接弥合语音与文本之间的模态鸿沟。该方法利用特殊的聚合机制，并以语音重建为目标进行训练，从而在大幅缩短标记序列长度的同时保留重要的副语言信息。此外，通过 TASTE，可以使用参数高效的微调技术（如低秩适应 LoRA）将基于文本的 LLMs 转化为有效的 SLMs。实验结果表明，TASTE 方法在基准任务上的表现与完全微调方法相当，且首次实现了通过重建目标自动学习适配于口语语言建模的文本对齐语音分词和嵌入。

链接: https://arxiv.org/abs/2504.07053
作者: Liang-Hsuan Tseng,Yi-Chang Chen,Kuan-Yi Lee,Da-Shan Shiu,Hung-yi Lee
机构: MediaTek Research (联发科技研究院); National Taiwan University (台湾大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint. Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) excel in text-based natural language processing tasks but remain constrained by their reliance on textual inputs and outputs. To enable more natural human-LLM interaction, recent progress have focused on deriving a spoken language model (SLM) that can not only listen but also generate speech. To achieve this, a promising direction is to conduct speech-text joint modeling. However, recent SLM still lag behind text LLM due to the modality mismatch. One significant mismatch can be the sequence lengths between speech and text tokens. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through the special aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. Furthermore, by leveraging TASTE, we can adapt text-based LLMs into effective SLMs with parameter-efficient fine-tuning techniques such as Low-Rank Adaptation (LoRA). Experimental results on benchmark tasks, including SALMON and StoryCloze, demonstrate that TASTE-based SLMs perform similarly to previous full-finetuning methods. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling. Our demo, code, and models are publicly available at this https URL.
zh

[NLP-12] A Unified Agent ic Framework for Evaluating Conditional Image Generation

【速读】：该论文旨在解决conditional image generation（条件图像生成）任务中缺乏任务无关、可靠且可解释的评估指标的问题。为应对这一挑战，论文提出了一种名为CIGEval的统一评估框架。CIGEval的关键在于利用大型多模态模型（Large Multimodal Models, LMMs）作为核心，并结合多功能工具箱与细粒度评估体系，同时通过合成评价轨迹实现微调，使较小规模的LMMs能够自主选择合适的工具并基于工具输出进行细致分析。实验结果表明，CIGEval在多个条件图像生成任务上的表现接近人工评估的可靠性，且其开源版本在资源受限条件下超越了先前的最佳方法。

链接: https://arxiv.org/abs/2504.07046
作者: Jifang Wang,Xue Yang,Longyue Wang,Zhenran Xu,Yiyu Wang,Yaowei Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Meta(未知); Stability.AI(未知); Anthropic(未知); Character.ai(未知); Claude(未知)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Work in progress. GitHub: this https URL

点击查看摘要

Abstract:Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval’s capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.
zh

[NLP-13] Data Augmentation and Hyperparameter Tuning for Low-Resource MFA

【速读】：该论文试图解决在使用计算工具处理濒危及资源匮乏语言时，因数据量较小而导致结果准确性较低的问题。论文的关键解决方案是通过数据增强方法扩大语料库规模，并将其与多语言强制对齐中的超参数调整进行比较。研究发现，音频数据增强并未显著提升性能，而超参数调整则在不增加过多训练时间的前提下实现了显著改进。对于拥有小到中等规模训练数据的语言，这种方法为避免从高资源语言迁移模型提供了一个可行的替代方案。

链接: https://arxiv.org/abs/2504.07024
作者: Alessio Tosolini,Claire Bowern
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A continued issue for those working with computational tools and endangered and under-resourced languages is the lower accuracy of results for languages with smaller amounts of data. We attempt to ameliorate this issue by using data augmentation methods to increase corpus size, comparing augmentation to hyperparameter tuning for multilingual forced alignment. Unlike text augmentation methods, audio augmentation does not lead to substantially increased performance. Hyperparameter tuning, on the other hand, results in substantial improvement without (for this amount of data) infeasible additional training time. For languages with small to medium amounts of training data, this is a workable alternative to adapting models from high-resource languages.
zh

[NLP-14] Evaluating Retrieval Augmented Generative Models for Document Queries in Transportation Safety

【速读】：该论文旨在解决生成式大语言模型（Generative Large Language Models, LLMs）在高风险领域如危险品运输中应用时面临的准确性与可靠性不足的问题。论文的关键在于采用检索增强生成（Retrieval-Augmented Generation, RAG）技术对LLaMA模型进行微调，通过结合结构化监管文档与生成能力，提升模型在检索与解析相关法规信息方面的性能。实验结果表明，RAG-augmented LLaMA模型在提供更详尽且总体准确的信息方面显著优于ChatGPT和Vertex AI，强调了领域特定微调及严格评估方法对于确保高风险环境中模型可靠性的必要性。

链接: https://arxiv.org/abs/2504.07022
作者: Chad Melton,Alex Sorokine,Steve Peterson
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 3 Figures, 3 tables

点击查看摘要

Abstract:Applications of generative Large Language Models LLMs are rapidly expanding across various domains, promising significant improvements in workflow efficiency and information retrieval. However, their implementation in specialized, high-stakes domains such as hazardous materials transportation is challenging due to accuracy and reliability concerns. This study evaluates the performance of three fine-tuned generative models, ChatGPT, Google’s Vertex AI, and ORNL Retrieval Augmented Generation augmented LLaMA 2 and LLaMA in retrieving regulatory information essential for hazardous material transportation compliance in the United States. Utilizing approximately 40 publicly available federal and state regulatory documents, we developed 100 realistic queries relevant to route planning and permitting requirements. Responses were qualitatively rated based on accuracy, detail, and relevance, complemented by quantitative assessments of semantic similarity between model outputs. Results demonstrated that the RAG-augmented LLaMA models significantly outperformed Vertex AI and ChatGPT, providing more detailed and generally accurate information, despite occasional inconsistencies. This research introduces the first known application of RAG in transportation safety, emphasizing the need for domain-specific fine-tuning and rigorous evaluation methodologies to ensure reliability and minimize the risk of inaccuracies in high-stakes environments.
zh

[NLP-15] owards LLM s Robustness to Changes in Prompt Format Styles NAACL

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在处理不同提示格式时因非语义变化导致的性能波动问题，即提示脆弱性（prompt brittleness）。传统方法主要关注针对特定任务设计最优提示的技术，而对提示脆弱性的研究仅限于量化性能变化，未提出简单有效的解决方案。论文提出的解决方案——格式混合（Mixture of Formats, MOF），通过在提示的少量示例中引入多样化的风格来缓解提示脆弱性。这种方法借鉴了计算机视觉领域的思路，利用多样化风格的数据集避免模型将特定风格与目标变量关联起来。实验结果表明，MOF不仅减少了由风格引起的提示脆弱性，还提升了模型在不同提示格式和数据集上的整体性能。

链接: https://arxiv.org/abs/2504.06969
作者: Lilian Ngweta,Kiran Kate,Jason Tsay,Yara Rizk
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL)
备注: NAACL Student Research Workshop (SRW) 2025

点击查看摘要

Abstract:Large language models (LLMs) have gained popularity in recent years for their utility in various applications. However, they are sensitive to non-semantic changes in prompt formats, where small changes in the prompt format can lead to significant performance fluctuations. In the literature, this problem is commonly referred to as prompt brittleness. Previous research on prompt engineering has focused mainly on developing techniques for identifying the optimal prompt for specific tasks. Some studies have also explored the issue of prompt brittleness and proposed methods to quantify performance variations; however, no simple solution has been found to address this challenge. We propose Mixture of Formats (MOF), a simple and efficient technique for addressing prompt brittleness in LLMs by diversifying the styles used in the prompt few-shot examples. MOF was inspired by computer vision techniques that utilize diverse style datasets to prevent models from associating specific styles with the target variable. Empirical results show that our proposed technique reduces style-induced prompt brittleness in various LLMs while also enhancing overall performance across prompt variations and different datasets.
zh

[NLP-16] Adaptive Computation Pruning for the Forgetting Transformer

【速读】：该论文试图解决在生成式 AI (Generative AI) 模型中，由于注意力机制导致的计算冗余问题。具体而言，Forgetting Transformer (FoX) 中许多注意力头会快速遗忘，导致其输出过度依赖局部上下文，从而引入了不必要的全局依赖计算开销。论文的关键解决方案是提出了一种自适应计算修剪 (Adaptive Computation Pruning, ACP) 方法，通过动态设置修剪阈值来剪枝那些因遗忘门显著衰减的输入-输出依赖计算，确保被修剪的注意力权重可以忽略不计。此方法能够在不同模型规模和上下文长度下将softmax注意力的浮点运算量 (FLOPs) 减少约70%，同时提升训练吞吐量10%到35%，且不会带来性能下降。此外，长上下文长度能够进一步提高计算节省效率。

链接: https://arxiv.org/abs/2504.06949
作者: Zhixuan Lin,Johan Obando-Ceron,Xu Owen He,Aaron Courville
机构: Mila (Mila); Université de Montréal (蒙特利尔大学); MakerMaker AI (MakerMaker AI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on the local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. This is achieved using a dynamically set pruning threshold that ensures that the pruned attention weights remain negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 10% to 35% improvement in training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. We also perform several analyses to provide deeper insights into our method, such as examining the pruning patterns and analyzing the distribution of FLOP savings across different attention heads. Our code is available at this https URL.
zh

[NLP-17] RuOpinionNE-2024: Extraction of Opinion Tuples from Russian News Texts

【速读】：该论文旨在解决从俄文新闻文本中提取结构化观点的问题，具体任务是为给定句子提取意见元组，每个元组包含情感持有者、目标、表达以及持有者对目标的情感。论文的关键在于探索不同形式的大语言模型（Large Language Models, LLMs）在零样本（zero-shot）、少样本（few-shot）及微调（fine-tuning）设置下的性能，并通过对比30个提示词（prompts）与11种开源语言模型（参数规模3至32亿）在1-shot和10-shot场景下的表现，确定了最佳模型与提示策略，其中基于微调的大语言模型在测试集上取得了最优结果。

链接: https://arxiv.org/abs/2504.06947
作者: Natalia Loukachevitch,Natalia Tkachenko,Anna Lapanitsyna,Mikhail Tikhomirov,Nicolay Rusnachenko
机构: Lomonosov Moscow State University (莫斯科国立大学); Bauman Moscow State Technical University (鲍曼莫斯科国立技术大学)
类目: Computation and Language (cs.CL)
备注: RuOpinionNE-2024 represent a proceeding of RuSentNE-2023. It contributes with extraction and evaluation of factual statements that support the assigned sentiment

点击查看摘要

Abstract:In this paper, we introduce the Dialogue Evaluation shared task on extraction of structured opinions from Russian news texts. The task of the contest is to extract opinion tuples for a given sentence; the tuples are composed of a sentiment holder, its target, an expression and sentiment from the holder to the target. In total, the task received more than 100 submissions. The participants experimented mainly with large language models in zero-shot, few-shot and fine-tuning formats. The best result on the test set was obtained with fine-tuning of a large language model. We also compared 30 prompts and 11 open source language models with 3-32 billion parameters in the 1-shot and 10-shot settings and found the best models and prompts.
zh

[NLP-18] Data Augmentation for Fake Reviews Detection in Multiple Languages and Multiple Domains

【速读】：该论文试图解决在线评论领域中假评识别的问题，特别是在低资源语言或领域下如何有效训练高性能假评检测模型的挑战。论文的关键解决方案在于利用大语言模型（Large Language Models）生成大规模合成数据集，通过数据增强（Data Augmentation）技术扩充训练数据，从而提升假评检测模型在不同领域（如书籍、餐厅和酒店评论）和语言（英语与中文）上的性能。实验结果表明，采用增强后的数据集可使假评检测模型在多个测试集上的准确率显著提高，例如DeRev TEST提升0.3个百分点，Amazon TEST提升10.9个百分点，Yelp TEST提升8.3个百分点，DianPing TEST提升7.2个百分点。

链接: https://arxiv.org/abs/2504.06917
作者: Ming Liu,Massimo Poesio
机构: Queen Mary University of London (伦敦大学玛丽皇后学院)
类目: Computation and Language (cs.CL)
备注: 32 pages, 15 figures

点击查看摘要

Abstract:With the growth of the Internet, buying habits have changed, and customers have become more dependent on the online opinions of other customers to guide their purchases. Identifying fake reviews thus became an important area for Natural Language Processing (NLP) research. However, developing high-performance NLP models depends on the availability of large amounts of training data, which are often not available for low-resource languages or domains. In this research, we used large language models to generate datasets to train fake review detectors. Our approach was used to generate fake reviews in different domains (book reviews, restaurant reviews, and hotel reviews) and different languages (English and Chinese). Our results demonstrate that our data augmentation techniques result in improved performance at fake review detection for all domains and languages. The accuracy of our fake review detection model can be improved by 0.3 percentage points on DeRev TEST, 10.9 percentage points on Amazon TEST, 8.3 percentage points on Yelp TEST and 7.2 percentage points on DianPing TEST using the augmented datasets.
zh

[NLP-19] Identifying Aspects in Peer Reviews

【速读】：该论文旨在解决学术出版物同行评审过程中因投稿量激增而带来的压力问题，尝试通过计算方法支持同行评审过程。论文的关键在于提出一种操作性的“方面”(aspect)定义，并开发基于数据驱动的方法从同行评审文本语料库中提取细粒度的方面集合。现有方法通常基于主要自然语言处理（NLP）领域的评审表格和指南确定方面集，但缺乏系统性的数据驱动识别方法。为填补这一空白，研究采用自下而上的方法，构建了一个包含标注方面的评审数据集，用于社区层面的评审分析，并探讨了不同方面选择对下游应用（如大型语言模型生成评审检测）的影响。研究结果为基于原则和数据驱动的评审方面探索奠定了基础，并为自然语言处理支持同行评审的新应用铺平了道路。

链接: https://arxiv.org/abs/2504.06910
作者: Sheng Lu,Ilia Kuznetsov,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab) (遍布式知识处理实验室); Department of Computer Science and Hessian Center for AI (hessian.AI) (计算机科学系和黑森人工智能中心); Technical University of Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Peer review is central to academic publishing, but the growing volume of submissions is straining the process. This motivates the development of computational approaches to support peer review. While each review is tailored to a specific paper, reviewers often make assessments according to certain aspects such as Novelty, which reflect the values of the research community. This alignment creates opportunities for standardizing the reviewing process, improving quality control, and enabling computational support. While prior work has demonstrated the potential of aspect analysis for peer review assistance, the notion of aspect remains poorly formalized. Existing approaches often derive aspect sets from review forms and guidelines of major NLP venues, yet data-driven methods for aspect identification are largely underexplored. To address this gap, our work takes a bottom-up approach: we propose an operational definition of aspect and develop a data-driven schema for deriving fine-grained aspects from a corpus of peer reviews. We introduce a dataset of peer reviews augmented with aspects and show how it can be used for community-level review analysis. We further show how the choice of aspects can impact downstream applications, such as LLM-generated review detection. Our results lay a foundation for a principled and data-driven investigation of review aspects, and pave the path for new applications of NLP to support peer review.
zh

[NLP-20] Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games

【速读】：该论文试图解决如何通过赋予人工代理人类个性特征来优化其行为与性能的问题。论文的关键在于提出了一种名为PANDA（Personality-Adapted Neural Decision Agents）的新方法，通过将人类个性特征投影到代理上以指导其行为。具体而言，解决方案的关键包括两部分：(1) 训练一个个性分类器以识别代理行为所表现出的个性类型；(2) 将个性特征直接整合到代理的策略学习管道中。实验结果表明，这种方法能够引导代理的行为趋向特定的个性轮廓，并且某些具有较高开放性等特征的个性类型在性能上展现出显著优势。这一研究为在交互环境中实现更一致、高效且以人为本的决策奠定了基础。

链接: https://arxiv.org/abs/2504.06868
作者: Seungwon Lim,Seungbeen Lee,Dongjun Min,Youngjae Yu
机构: Department of Artificial Intelligence, Yonsei University (延世大学); Department of Computer Science, Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial agents are increasingly central to complex interactions and decision-making tasks, yet aligning their behaviors with desired human values remains an open challenge. In this work, we investigate how human-like personality traits influence agent behavior and performance within text-based interactive environments. We introduce PANDA: PersonalityAdapted Neural Decision Agents, a novel method for projecting human personality traits onto agents to guide their behavior. To induce personality in a text-based game agent, (i) we train a personality classifier to identify what personality type the agent’s actions exhibit, and (ii) we integrate the personality profiles directly into the agent’s policy-learning pipeline. By deploying agents embodying 16 distinct personality types across 25 text-based games and analyzing their trajectories, we demonstrate that an agent’s action decisions can be guided toward specific personality profiles. Moreover, certain personality types, such as those characterized by higher levels of Openness, display marked advantages in performance. These findings underscore the promise of personality-adapted agents for fostering more aligned, effective, and human-centric decision-making in interactive environments.
zh

[NLP-21] Integrating Cognitive Processing Signals into Language Models: A Review of Advances Applications and Future Directions

【速读】：该论文旨在解决自然语言处理（NLP）领域中大型语言模型（Language Models, LMs）和多模态大型语言模型（Multimodal Large Language Models, MLLMs）面临的两个主要挑战：数据稀缺性和大规模模型训练的环境成本。为应对这些挑战，论文提出通过整合认知神经科学中的认知信号（尤其是眼动追踪，Eye-tracking, ET信号）来增强现有模型。关键在于利用用户中心的认知信号实现高效的数据增强（data augmentation）、加速收敛（faster convergence）以及提升人机对齐（human alignment）。此外，论文强调了眼动数据在视觉问答（Visual Question Answering, VQA）任务及减少多模态模型幻觉（mitigating hallucinations）方面的潜力，并探讨了当前存在的研究挑战与未来趋势。

链接: https://arxiv.org/abs/2504.06843
作者: Angela Lopez-Cardona,Sebastian Idesis,Ioannis Arapakis
机构: Telefónica Scientific Research ( Telefonica 科学研究); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, the integration of cognitive neuroscience in Natural Language Processing (NLP) has gained significant attention. This article provides a critical and timely overview of recent advancements in leveraging cognitive signals, particularly Eye-tracking (ET) signals, to enhance Language Models (LMs) and Multimodal Large Language Models (MLLMs). By incorporating user-centric cognitive signals, these approaches address key challenges, including data scarcity and the environmental costs of training large-scale models. Cognitive signals enable efficient data augmentation, faster convergence, and improved human alignment. The review emphasises the potential of ET data in tasks like Visual Question Answering (VQA) and mitigating hallucinations in MLLMs, and concludes by discussing emerging challenges and research trends.
zh

[NLP-22] Open Problems and a Hypothetical Path Forward in LLM Knowledge Paradigms

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在知识范式下存在的三个关键限制其能力的问题：(1) 知识更新的挑战；(2) 反向知识泛化的失败（反转诅咒）；以及(3) 内部知识冲突。论文回顾了近期在解决这些问题方面的进展，并探讨了潜在的通用解决方案。论文的关键在于提出了一种基于上下文知识扩展（Contextual Knowledge Scaling）的假设性知识范式，并详细阐述了在现有技术条件下可行的实现路径。这种方法通过扩展模型的知识表示和处理机制，有望克服现有局限，为下一代模型架构的发展提供方向和灵感。

链接: https://arxiv.org/abs/2504.06823
作者: Xiaotian Ye,Mengqi Zhang,Shu Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Blog post preprint, work in progress

点击查看摘要

Abstract:Knowledge is fundamental to the overall capabilities of Large Language Models (LLMs). The knowledge paradigm of a model, which dictates how it encodes and utilizes knowledge, significantly affects its performance. Despite the continuous development of LLMs under existing knowledge paradigms, issues within these frameworks continue to constrain model potential. This blog post highlight three critical open problems limiting model capabilities: (1) challenges in knowledge updating for LLMs, (2) the failure of reverse knowledge generalization (the reversal curse), and (3) conflicts in internal knowledge. We review recent progress made in addressing these issues and discuss potential general solutions. Based on observations in these areas, we propose a hypothetical paradigm based on Contextual Knowledge Scaling, and further outline implementation pathways that remain feasible within contemporary techniques. Evidence suggests this approach holds potential to address current shortcomings, serving as our vision for future model paradigms. This blog post aims to provide researchers with a brief overview of progress in LLM knowledge systems, while provide inspiration for the development of next-generation model architectures. Comments: Blog post preprint, work in progress Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.06823 [cs.CL] (or arXiv:2504.06823v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.06823 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-23] Inducing Programmatic Skills for Agent ic Tasks

【速读】：该论文旨在解决智能代理在动态网络环境中高效学习和适应多样化任务的问题。传统方法依赖于静态技能或文本表示，但这些方法在复杂任务中的表现有限。论文提出了一种名为“代理技能诱导（Agent Skill Induction, ASI）”的解决方案，其关键是将程序作为技能的有效表示，并通过在线诱导、验证和利用基于程序的技能，使代理能够自适应地完成搜索产品、规划路线等特定任务。ASI 的关键创新在于引入了程序化验证机制，这不仅显著提升了成功率（比静态基线高出 23.5%，比文本技能版本高出 11.3%），还通过组合基础动作（如点击）形成高级技能，减少了 10.7%-15.3% 的操作步骤。此外，ASI 在扩展网络活动场景下仍保持高效与准确，并展示了跨网站迁移时对通用技能的有效复用及对不兼容变化的灵活更新能力。

链接: https://arxiv.org/abs/2504.06821
作者: Zora Zhiruo Wang,Apurva Gandhi,Graham Neubig,Daniel Fried
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To succeed in common digital tasks such as web navigation, agents must carry out a variety of specialized tasks such as searching for products or planning a travel route. To tackle these tasks, agents can bootstrap themselves by learning task-specific skills online through interaction with the web environment. In this work, we demonstrate that programs are an effective representation for skills. We propose agent skill induction (ASI), which allows agents to adapt themselves by inducing, verifying, and utilizing program-based skills on the fly. We start with an evaluation on the WebArena agent benchmark and show that ASI outperforms the static baseline agent and its text-skill counterpart by 23.5% and 11.3% in success rate, mainly thanks to the programmatic verification guarantee during the induction phase. ASI also improves efficiency by reducing 10.7-15.3% of the steps over baselines, by composing primitive actions (e.g., click) into higher-level skills (e.g., search product). We then highlight the efficacy of ASI in remaining efficient and accurate under scaled-up web activities. Finally, we examine the generalizability of induced skills when transferring between websites, and find that ASI can effectively reuse common skills, while also updating incompatible skills to versatile website changes.
zh

[NLP-24] A Graph Diffusion Algorithm for Lexical Similarity Evaluation

【速读】：本文旨在解决给定语言与多个参考语言簇之间的词汇相似性评估问题。解决方案的关键在于通过计算概念及其在所有考虑语言中的对应翻译之间的距离，构建一个加权有向图，并将此图扩散方程的解作为映射结果。具体而言，采用基于音素转写和改进的Damerau-Levenshtein距离的方法计算翻译之间的距离，最终得到的坐标值可解释为属于各簇的概率或参考簇的词汇相似性分布，从而有效分析多语言共存且相互影响强烈的语言关系。

链接: https://arxiv.org/abs/2504.06816
作者: Karol Mikula,Mariana Sarkociová Remešíková
机构: Slovak University of Technology (斯洛伐克技术大学)
类目: Computation and Language (cs.CL)
备注: 28 pages

点击查看摘要

Abstract:In this paper, we present an algorithm for evaluating lexical similarity between a given language and several reference language clusters. As an input, we have a list of concepts and the corresponding translations in all considered languages. Moreover, each reference language is assigned to one of c language clusters. For each of the concepts, the algorithm computes the distance between each pair of translations. Based on these distances, it constructs a weighted directed graph, where every vertex represents a language. After, it solves a graph diffusion equation with a Dirichlet boundary condition, where the unknown is a map from the vertex set to \mathbbR^c . The resulting coordinates are values from the interval [0,1] and they can be interpreted as probabilities of belonging to each of the clusters or as a lexical similarity distribution with respect to the reference clusters. The distances between translations are calculated using phonetic transcriptions and a modification of the Damerau-Levenshtein distance. The algorithm can be useful in analyzing relationships between languages spoken in multilingual territories with a lot of mutual influences. We demonstrate this by presenting a case study regarding various European languages.
zh

[NLP-25] Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

【速读】：该论文旨在解决大规模混合专家（Mixture-of-Experts, MoE）模型中存储所有专家所带来的内存开销过大的问题，特别是在超大规模模型如DeepSeek-R1（671B参数量）中的应用限制。论文通过研究领域专业化和专家冗余现象，发现了一种称为“few-shot expert localization”的一致行为：即在仅使用少量示例的情况下，模型能够稳定激活稀疏的专家子集。基于这一观察，论文提出了一种名为EASY-EP的有效剪枝框架，利用领域特定的少量示例识别并保留最相关的专家。EASY-EP的关键在于其包含两个核心组件：输出感知的专家重要性评估与专家级标记贡献估计。前者通过考虑激活专家的门控分数和输出幅度来评估每个专家的重要性，后者则基于路由前后专家表示相似性的变化评估标记的贡献。实验表明，在相同的内存预算下，该方法可实现与完整DeepSeek-R1相当的性能，但吞吐量提升了2.99倍，并且仅需一半数量的专家。

链接: https://arxiv.org/abs/2504.06792
作者: Zican Dong,Han Peng,Peiyu Liu,Wayne Xin Zhao,Dong Wu,Feng Xiao,Zhifeng Wang
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学); University of International Business and Economics (对外经济贸易大学); YanTron Technology Co. Ltd (炎电科技有限公司); EBTech Co. Ltd (EB科技有限公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models achieve a favorable trade-off between performance and inference efficiency by activating only a subset of experts. However, the memory overhead of storing all experts remains a major limitation, especially in large-scale MoE models such as DeepSeek-R1 (671B). In this study, we investigate domain specialization and expert redundancy in large-scale MoE models and uncover a consistent behavior we term few-shot expert localization, with only a few demonstrations, the model consistently activates a sparse and stable subset of experts. Building on this observation, we propose a simple yet effective pruning framework, EASY-EP, that leverages a few domain-specific demonstrations to identify and retain only the most relevant experts. EASY-EP comprises two key components: output-aware expert importance assessment and expert-level token contribution estimation. The former evaluates the importance of each expert for the current token by considering the gating scores and magnitudes of the outputs of activated experts, while the latter assesses the contribution of tokens based on representation similarities after and before routed experts. Experiments show that our method can achieve comparable performances and 2.99\times throughput under the same memory budget with full DeepSeek-R1 with only half the experts. Our code is available at this https URL.
zh

[NLP-26] FamilyTool: A Multi-hop Personalized Tool Use Benchmark

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理个性化、动态环境中多跳推理及归纳知识适应的复杂任务时表现不足的问题。现有工具学习基准未能充分覆盖这些关键的真实世界场景。为解决此问题，论文提出了FamilyTool这一新颖的基准，基于家庭知识图谱（Knowledge Graph, KG），模拟个性化的多跳工具使用场景，并通过引入归纳知识图谱设置，使模型能够在不重新训练的情况下适应未见过的用户偏好和关系。此外，论文还提出了KGETool，这是一种简单的知识图谱增强评估管道，用于系统性评估LLMs在这些设定下的工具使用能力。关键在于FamilyTool的设计及其对模型在复杂动态环境中的推理、适应性和可扩展性的挑战，从而揭示了当前LLMs在处理个性化、演化的现实情境时的局限性。

链接: https://arxiv.org/abs/2504.06766
作者: Yuxin Wang,Yiran Guo,Yining Zheng,Zhangyue Yin,Shuo Chen,Jie Yang,Jiajun Chen,Xuanjing Huang,Xipeng Qiu
机构: Fudan University (复旦大学); Institute of Modern Languages and Linguistics, Fudan University (复旦大学现代语言学研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of tool learning with Large Language Models (LLMs) has expanded their capabilities in handling complex tasks by leveraging external tools. However, existing benchmarks for tool learning inadequately address critical real-world personalized scenarios, particularly those requiring multi-hop reasoning and inductive knowledge adaptation in dynamic environments. To bridge this gap, we introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios. FamilyTool challenges LLMs with queries spanning 1 to 3 relational hops (e.g., inferring familial connections and preferences) and incorporates an inductive KG setting where models must adapt to unseen user preferences and relationships without re-training, a common limitation in prior approaches that compromises generalization. We further propose KGETool: a simple KG-augmented evaluation pipeline to systematically assess LLMs’ tool use ability in these settings. Experiments reveal significant performance gaps in state-of-the-art LLMs, with accuracy dropping sharply as hop complexity increases and inductive scenarios exposing severe generalization deficits. These findings underscore the limitations of current LLMs in handling personalized, evolving real-world contexts and highlight the urgent need for advancements in tool-learning frameworks. FamilyTool serves as a critical resource for evaluating and advancing LLM agents’ reasoning, adaptability, and scalability in complex, dynamic environments. Code and dataset are available at Github.
zh

[NLP-27] CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers

【速读】：该论文试图解决Transformer在处理长序列时标准注意力机制O(N²)复杂度带来的可扩展性限制问题。解决方案的关键在于引入Circular-convolutional ATtention (CAT)，这是一种基于Fourier的方法，通过高效应用循环卷积降低计算复杂度至O(NlogN)，同时保持表示能力不减。CAT通过简化全连接层减少可学习参数数量，并且不引入额外的复杂操作，从而在ImageNet-1k和WikiText-103等大规模基准测试中实现约10%的速度提升及一致性准确率改进。其设计基于工程同构框架，不仅提供了实际效率与易实现性，还为下一代高性能Transformer架构的发展提供了指导。此外，消融研究揭示了CAT成功背后的关键条件，为可扩展注意力机制提供了更广泛的原理启示。

链接: https://arxiv.org/abs/2504.06704
作者: Yoshihiro Yamada
机构: Preferred Networks (Preferred Networks)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformers have driven remarkable breakthroughs in natural language processing and computer vision, yet their standard attention mechanism still imposes O(N^2) complexity, hindering scalability to longer sequences. We introduce Circular-convolutional ATtention (CAT), a Fourier-based approach that efficiently applies circular convolutions to reduce complexity without sacrificing representational power. CAT achieves O(NlogN) computations, requires fewer learnable parameters by streamlining fully-connected layers, and introduces no heavier operations, resulting in consistent accuracy improvements and about a 10% speedup in naive PyTorch implementations on large-scale benchmarks such as ImageNet-1k and WikiText-103. Grounded in an engineering-isomorphism framework, CAT’s design not only offers practical efficiency and ease of implementation but also provides insights to guide the development of next-generation, high-performance Transformer architectures. Finally, our ablation studies highlight the key conditions underlying CAT’s success, shedding light on broader principles for scalable attention mechanisms.
zh

[NLP-28] NLP Security and Ethics in the Wild ACL

【速读】：该论文试图解决自然语言处理（NLP）领域在安全（NLP Security, NLPSec）研究中面临的伦理挑战，特别是评估模型对恶意攻击的脆弱性以及开发相应防护措施时所存在的伦理盲点。论文关注如何弥合传统网络安全伦理与NLP伦理之间的差距，并提出“白帽NLP”框架以指导研究人员更负责任地开展工作。论文的关键在于识别当前研究中的不足之处，如危害最小化和负责任披露等方面的漏洞，并提供具体建议来促进NLP安全领域的有意图的伦理研究文化建设。

链接: https://arxiv.org/abs/2504.06669
作者: Heather Lent,Erick Galinkin,Yiyi Chen,Jens Myrup Pedersen,Leon Derczynski,Johannes Bjerva
机构: Aalborg University (奥尔堡大学, Denmark); NVIDIA Corporation (NVIDIA公司); IT University of Copenhagen (哥本哈根信息技术大学, Denmark)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to TACL

点击查看摘要

Abstract:As NLP models are used by a growing number of end-users, an area of increasing importance is NLP Security (NLPSec): assessing the vulnerability of models to malicious attacks and developing comprehensive countermeasures against them. While work at the intersection of NLP and cybersecurity has the potential to create safer NLP for all, accidental oversights can result in tangible harm (e.g., breaches of privacy or proliferation of malicious models). In this emerging field, however, the research ethics of NLP have not yet faced many of the long-standing conundrums pertinent to cybersecurity, until now. We thus examine contemporary works across NLPSec, and explore their engagement with cybersecurity’s ethical norms. We identify trends across the literature, ultimately finding alarming gaps on topics like harm minimization and responsible disclosure. To alleviate these concerns, we provide concrete recommendations to help NLP researchers navigate this space more ethically, bridging the gap between traditional cybersecurity and NLP ethics, which we frame as ``white hat NLP’'. The goal of this work is to help cultivate an intentional culture of ethical research for those working in NLP Security.
zh

[NLP-29] SEE: Continual Fine-tuning with Sequential Ensemble of Experts

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在持续微调过程中面临的灾难性遗忘（Catastrophic Forgetting）问题，以及现有方法在缓解此问题时存在的性能损失和任务分配挑战。为应对这些挑战，论文提出的关键解决方案是引入“顺序专家集成（Sequential Ensemble of Experts, SEE）”框架。SEE 框架的核心在于移除了额外的任务路由模块的需求，通过分布式路由机制使每个专家能够独立判断是否处理特定查询，并且在持续微调过程中仅需为新任务训练新的专家，而非重新训练整个系统。这种设计显著提升了性能，同时增强了模型的泛化能力，有效解决了灾难性遗忘问题，并实现了对分布外查询的有效识别与处理。

链接: https://arxiv.org/abs/2504.06664
作者: Zhilin Wang,Yafu Li,Xiaoye Qu,Yu Cheng
机构: Jilin University (吉林大学); Shanghai AI Laboratory (上海人工智能实验室); Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9pages

点击查看摘要

Abstract:Continual fine-tuning of large language models (LLMs) suffers from catastrophic forgetting. Rehearsal-based methods mitigate this problem by retaining a small set of old data. Nevertheless, they still suffer inevitable performance loss. Although training separate experts for each task can help prevent forgetting, effectively assembling them remains a challenge. Some approaches use routers to assign tasks to experts, but in continual learning, they often require retraining for optimal performance. To address these challenges, we introduce the Sequential Ensemble of Experts (SEE) framework. SEE removes the need for an additional router, allowing each expert to independently decide whether a query should be handled. The framework employs distributed routing, and during continual fine-tuning, SEE only requires the training of new experts for incoming tasks rather than retraining the entire system. Experiments reveal that the SEE outperforms prior approaches, including multi-task learning, in continual fine-tuning. It also demonstrates remarkable generalization ability, as the expert can effectively identify out-of-distribution queries, which can then be directed to a more generalized model for resolution. This work highlights the promising potential of integrating routing and response mechanisms within each expert, paving the way for the future of distributed model ensembling.
zh

[NLP-30] Bridging the Gap Between Preference Alignment and Machine Unlearning

【速读】：该论文旨在解决主流Preference Alignment (PA)方法在低资源场景下的应用限制问题，特别是Reinforcement Learning with Human Feedback (RLHF)方法因需要高质量的正向偏好样本数据集以及训练过程中的不稳定性导致的高成本和计算负担。同时，尽管LLM(unlearning)技术通过直接移除负向示例的影响展现出潜力，但现有研究主要停留在经验验证层面，缺乏系统的定量分析。为填补这一空白，论文提出了一种框架来探索PA与LLM unlearning之间的关系。关键在于引入基于双层优化的方法，量化遗忘特定负向示例对PA性能的影响，并发现并非所有负向示例对对齐改进的贡献相同，效果差异显著。基于此洞见，论文提出了Unlearning to Align (U2A)框架，利用双层优化高效选择和遗忘示例以最大化PA性能，通过大量实验验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.06659
作者: Xiaohua Feng,Yuyuan Li,Huwei Ji,Jiaming Zhang,Li Zhang,Tianyu Du,Chaochao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Despite advances in Preference Alignment (PA) for Large Language Models (LLMs), mainstream methods like Reinforcement Learning with Human Feedback (RLHF) face notable challenges. These approaches require high-quality datasets of positive preference examples, which are costly to obtain and computationally intensive due to training instability, limiting their use in low-resource scenarios. LLM unlearning technique presents a promising alternative, by directly removing the influence of negative examples. However, current research has primarily focused on empirical validation, lacking systematic quantitative analysis. To bridge this gap, we propose a framework to explore the relationship between PA and LLM unlearning. Specifically, we introduce a bi-level optimization-based method to quantify the impact of unlearning specific negative examples on PA performance. Our analysis reveals that not all negative examples contribute equally to alignment improvement when unlearned, and the effect varies significantly across examples. Building on this insight, we pose a crucial question: how can we optimally select and weight negative examples for unlearning to maximize PA performance? To answer this, we propose a framework called Unlearning to Align (U2A), which leverages bi-level optimization to efficiently select and unlearn examples for optimal PA performance. We validate the proposed method through extensive experiments, with results confirming its effectiveness.
zh

[NLP-31] A Neuro-inspired Interpretation of Unlearning in Large Language Models through Sample-level Unlearning Difficulty

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在无学习（unlearning）过程中缺乏样本级别无学习难度可解释性的问题。当前研究通常假设所有样本具有相同的无学习难度，这种简化可能导致无学习算法性能的归因偏差，即错误地将性能提升归结于样本选择而非算法设计本身，从而误导LLM无学习的发展方向。为此，论文提出了一种基于记忆移除难度（Memory Removal Difficulty, \mathrm{MRD}）的新方法来量化样本级别的无学习难度，并分析难以与易于无学习样本的特性差异。关键解决方案在于引入\mathrm{MRD}作为衡量标准，通过设计一种基于\mathrm{MRD}加权采样策略优化现有无学习算法，优先处理容易遗忘的样本，从而显著提高无学习的效率和效果。论文通过公共基准和数据集验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.06658
作者: Xiaohua Feng,Yuyuan Li,Chengye Wang,Junlin Liu,Li Zhang,Chaochao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Driven by privacy protection laws and regulations, unlearning in Large Language Models (LLMs) is gaining increasing attention. However, current research often neglects the interpretability of the unlearning process, particularly concerning sample-level unlearning difficulty. Existing studies typically assume a uniform unlearning difficulty across samples. This simplification risks attributing the performance of unlearning algorithms to sample selection rather than the algorithm’s design, potentially steering the development of LLM unlearning in the wrong direction. Thus, we investigate the relationship between LLM unlearning and sample characteristics, with a focus on unlearning difficulty. Drawing inspiration from neuroscience, we propose a Memory Removal Difficulty ( \mathrmMRD ) metric to quantify sample-level unlearning difficulty. Using \mathrmMRD , we analyze the characteristics of hard-to-unlearn versus easy-to-unlearn samples. Furthermore, we propose an \mathrmMRD -based weighted sampling method to optimize existing unlearning algorithms, which prioritizes easily forgettable samples, thereby improving unlearning efficiency and effectiveness. We validate the proposed metric and method using public benchmarks and datasets, with results confirming its effectiveness.
zh

[NLP-32] houghtProbe: Classifier-Guided Thought Space Exploration Leverag ing LLM Intrinsic Reasoning

【速读】：该论文旨在解决预训练大语言模型（Pre-trained Large Language Models, LLMs）内在推理能力的神经表征机制及其最优利用方法理解不足的问题。论文的关键发现是，一个简单的线性分类器能够有效检测LLMs激活空间中的内在推理能力，特别是在特定表示类型和网络层中。基于此，作者提出了一种由分类器引导的搜索框架，通过在树状响应空间中战略性地探索，利用分类器作为评分和排序机制，高效分配计算资源，优先选择更深入的推理方向进行扩展。最终，该框架通过对所有分支收集的答案形成候选答案池，并采用分支聚合选择方法，通过整合各支持分支的深思熟虑得分来确定最佳答案。实验结果表明，该框架不仅全面覆盖有效的推理链，还能显著提升多个算术推理基准的表现。

链接: https://arxiv.org/abs/2504.06650
作者: Zijian Wang,Chang Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained large language models (LLMs) have been demonstrated to possess intrinsic reasoning capabilities that can emerge naturally when expanding the response space. However, the neural representation mechanisms underlying these intrinsic capabilities and approaches for their optimal utilization remain inadequately understood. In this work, we make the key discovery that a simple linear classifier can effectively detect intrinsic reasoning capabilities in LLMs’ activation space, particularly within specific representation types and network layers. Based on this finding, we propose a classifier-guided search framework that strategically explore a tree-structured response space. In each node expansion, the classifier serves as a scoring and ranking mechanism that efficiently allocates computational resources by identifying and prioritizing more thoughtful reasoning directions for continuation. After completing the tree expansion, we collect answers from all branches to form a candidate answer pool. We propose a branch-aggregation selection method that marginalizes over all supporting branches by aggregating their thoughtfulness scores, thereby identifying the optimal answer from the pool. Experimental results show that our framework’s comprehensive exploration not only covers valid reasoning chains but also effectively identifies them, achieving significant improvements across multiple arithmetic reasoning benchmarks.
zh

[NLP-33] Wanting to be Understood

【速读】：该论文试图解决的问题是：人类在缺乏外在奖励的情况下为何仍具有相互理解与被理解的内在动机，并探索这种内在动机如何促进合作行为。论文通过模拟感知交叉范式，研究强化学习代理在不同内部奖励函数下的表现。解决方案的关键在于设计了一种结合主动推理机制的人工好奇心奖励（artificial curiosity reward），用于实现“理解他人的驱动力”；同时引入模仿、影响力/易受影响性以及对他者反应时间的超前预测等内在奖励，以实现“被他人理解的驱动力”。实验结果表明，仅依靠人工好奇心不足以促使代理偏好社交互动，而强调双向理解的内在奖励能够有效驱动代理优先选择交互，从而促进合作行为的发生。

链接: https://arxiv.org/abs/2504.06611
作者: Chrisantha Fernando,Dylan Banarse,Simon Osindero
机构: Google DeepMind (谷歌深思维)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores an intrinsic motivation for mutual awareness, hypothesizing that humans possess a fundamental drive to understand \textitand to be understood even in the absence of extrinsic rewards. Through simulations of the perceptual crossing paradigm, we explore the effect of various internal reward functions in reinforcement learning agents. The drive to understand is implemented as an active inference type artificial curiosity reward, whereas the drive to be understood is implemented through intrinsic rewards for imitation, influence/impressionability, and sub-reaction time anticipation of the other. Results indicate that while artificial curiosity alone does not lead to a preference for social interaction, rewards emphasizing reciprocal understanding successfully drive agents to prioritize interaction. We demonstrate that this intrinsic motivation can facilitate cooperation in tasks where only one agent receives extrinsic reward for the behaviour of the other.
zh

[NLP-34] Automated Business Process Analysis: An LLM -Based Approach to Value Assessment

【速读】：该论文旨在解决业务流程优化中的难题，即传统手动过程分析耗时且效率低下。目前，用于识别非增值步骤的价值增值分析（Value-Added Analysis）技术主要依赖于人工操作，存在耗时及主观性强的问题。为应对这些挑战，论文提出了一种基于大型语言模型（Large Language Models, LLMs）的自动化方法。该方法的关键在于通过两阶段实现：首先将高级活动分解为详细步骤以支持粒度分析；其次按照精益原则对每个步骤进行价值增值分析分类。这种方法不仅能够系统性地识别浪费，还保留了定性分析所需的语义理解能力。论文使用50个业务流程模型开发此方法，并收集发布了人工标注的真实数据集以供评估。结果显示，结构化提示相比零样本基线具有显著优势，且在两项任务上均表现出良好的性能。论文讨论了LLMs如何增强人类专家在定性过程分析中的能力，同时减少手动方法固有的时间和主观性问题。

链接: https://arxiv.org/abs/2504.06600
作者: William De Michele,Abel Armas Cervantes,Lea Frermann
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Business processes are fundamental to organizational operations, yet their optimization remains challenging due to the timeconsuming nature of manual process analysis. Our paper harnesses Large Language Models (LLMs) to automate value-added analysis, a qualitative process analysis technique that aims to identify steps in the process that do not deliver value. To date, this technique is predominantly manual, time-consuming, and subjective. Our method offers a more principled approach which operates in two phases: first, decomposing high-level activities into detailed steps to enable granular analysis, and second, performing a value-added analysis to classify each step according to Lean principles. This approach enables systematic identification of waste while maintaining the semantic understanding necessary for qualitative analysis. We develop our approach using 50 business process models, for which we collect and publish manual ground-truth labels. Our evaluation, comparing zero-shot baselines with more structured prompts reveals (a) a consistent benefit of structured prompting and (b) promising performance for both tasks. We discuss the potential for LLMs to augment human expertise in qualitative process analysis while reducing the time and subjectivity inherent in manual approaches.
zh

[NLP-35] Bypassing Safety Guardrails in LLM s Using Humor

【速读】：该论文试图解决如何通过幽默提示绕过大型语言模型（Large Language Models, LLMs）的安全防护机制的问题。论文的关键在于提出一种不修改不安全请求且遵循固定模板的方法，该方法简单易实现，无需额外的LLMs来生成提示。实验表明，该方法在不同LLMs上均有效。此外，研究发现减少或增加幽默元素都会削弱方法的效果，因为过度的幽默可能分散LLM专注于不安全请求的注意力。因此，论文认为LLM越狱（jailbreaking）发生在不安全请求的关注与幽默的存在之间存在适当平衡之时。

链接: https://arxiv.org/abs/2504.06577
作者: Pedro Cisneros-Velarde
机构: VMware Research (VMware 研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we show it is possible to bypass the safety guardrails of large language models (LLMs) through a humorous prompt including the unsafe request. In particular, our method does not edit the unsafe request and follows a fixed template – it is simple to implement and does not need additional LLMs to craft prompts. Extensive experiments show the effectiveness of our method across different LLMs. We also show that both removing and adding more humor to our method can reduce its effectiveness – excessive humor possibly distracts the LLM from fulfilling its unsafe request. Thus, we argue that LLM jailbreaking occurs when there is a proper balance between focus on the unsafe request and presence of humor.
zh

[NLP-36] Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning

【速读】：该论文致力于解决水印技术在检测大型语言模型（LLMs）生成文本中的安全性问题，特别是针对伪装攻击（spoofing attacks）的不足。现有研究主要集中在水印文本质量、高检测性和抗去除攻击的鲁棒性上，而对防伪装攻击的关注较少。例如，嵌入水印的文本可能被恶意篡改以包含仇恨言论，同时保留原有水印，从而损害LLM提供商的声誉。论文识别出两大核心挑战：一是水印需对语义扭曲变化敏感但对语义保持编辑不敏感；二是检测全局语义变化与大多数水印方案局部自回归特性的矛盾。为应对这些挑战，论文提出了一种语义感知的后处理水印算法，通过引入语义映射模型生成对比训练的绿红标记列表，确保水印嵌入不影响原始文本语义。实验表明，该方法在抗去除攻击和防伪装攻击（如情感反转和毒性内容插入）方面表现出色，同时保持了高检测率。

链接: https://arxiv.org/abs/2504.06575
作者: Li An,Yujian Liu,Yepeng Liu,Yang Zhang,Yuheng Bu,Shiyu Chang
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校); University of Florida (佛罗里达大学); MIT-IBM Watson AI Lab (麻省理工学院-IBM Watson人工智能实验室)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Watermarking has emerged as a promising technique for detecting texts generated by LLMs. Current research has primarily focused on three design criteria: high quality of the watermarked text, high detectability, and robustness against removal attack. However, the security against spoofing attacks remains relatively understudied. For example, a piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark, thereby damaging the reputation of the LLM provider. We identify two core challenges that make defending against spoofing difficult: (1) the need for watermarks to be both sensitive to semantic-distorting changes and insensitive to semantic-preserving edits, and (2) the contradiction between the need to detect global semantic shifts and the local, auto-regressive nature of most watermarking schemes. To address these challenges, we propose a semantic-aware watermarking algorithm that post-hoc embeds watermarks into a given target text while preserving its original meaning. Our method introduces a semantic mapping model, which guides the generation of a green-red token list, contrastively trained to be sensitive to semantic-distorting changes and insensitive to semantic-preserving changes. Experiments on two standard benchmarks demonstrate strong robustness against removal attacks and security against spoofing attacks, including sentiment reversal and toxic content insertion, while maintaining high watermark detectability. Our approach offers a significant step toward more secure and semantically aware watermarking for LLMs. Our code is available at this https URL.
zh

[NLP-37] Do Reasoning Models Show Better Verbalized Calibration?

【速读】：该论文试图解决的问题是：大型推理模型（LRMs）在复杂推理任务中的校准性能，尤其是在口头自信表达方面的表现是否优于指令微调模型。论文关注LRMs通过监督精调蒸馏（SFT推理模型）以及基于结果的强化学习（RL推理模型）训练后，在多个领域的校准特性。

解决方案的关键在于评估不同训练方法下的LRMs与指令微调模型在复杂推理任务中的表现差异，并特别分析其在事实性任务上的校准行为。研究发现，SFT推理模型和RL推理模型在复杂推理任务中显著提高了准确性及校准性能，但在事实性任务中却呈现不同的趋势，尤其是SFT推理模型表现出更严重的过度自信问题。这表明，以推理为导向的强化学习训练可能在提升大规模语言模型生成可信赖且自省输出的能力方面发挥关键作用。

链接: https://arxiv.org/abs/2504.06564
作者: Qingcheng Zeng,Weihao Xuan,Leyang Cui,Rob Voigt
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Large reasoning models (LRMs) have recently shown impressive capabilities in complex reasoning by leveraging increased test-time computation and exhibiting behaviors akin to human-like deliberation. Despite these advances, it remains an open question whether LRMs are better calibrated - particularly in their verbalized confidence - compared to instruction-tuned counterparts. In this paper, we investigate the calibration properties of LRMs trained via supervised fine-tuning distillation on long reasoning traces (henceforth SFT reasoning models) and outcome-based reinforcement learning for reasoning (henceforth RL reasoning models) across diverse domains. Our findings reveal that LRMs significantly outperform instruction-tuned models on complex reasoning tasks in both accuracy and confidence calibration. In contrast, we find surprising trends in the domain of factuality in particular. On factuality tasks, while Deepseek-R1 shows strong calibration behavior, smaller QwQ-32B shows no improvement over instruct models; moreover, SFT reasoning models display worse calibration (greater overconfidence) compared to instruct models. Our results provide evidence for a potentially critical role of reasoning-oriented RL training in improving LLMs’ capacity for generating trustworthy, self-aware outputs.
zh

[NLP-38] FuseRL: Dense Preference Optimization for Heterogeneous Model Fusion

【速读】：该论文旨在解决现有异构模型融合方法因仅依赖于从源模型中选择最佳输出而导致源知识利用不足、优化信号稀疏的问题。为克服这一局限性，论文提出了一种名为FuseRL的新型两阶段框架，其关键在于FuseSFT和FusePO两个组件。FuseSFT通过加权有监督微调（Weighted Supervised Fine-Tuning, SFT）整合不同提示下源模型的多样化输出，构建稳健的初始化；FusePO则基于多个源模型的输出优化加权偏好，以实现卓越的对齐性能。实验结果表明，该框架在多种偏好对齐方法中均表现出色，并在AlpacaEval-2和Arena-Hard基准测试中实现了8B规模LLM的最佳性能。进一步分析显示，FuseSFT通过正则化训练过程减少过拟合，而FusePO引入了密集且多样的偏好优化信号。

链接: https://arxiv.org/abs/2504.06562
作者: Longguang Zhong,Fanqi Wan,Ziyi Yang,Guosheng Liang,Tianyuan Shi,Xiaojun Quan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Heterogeneous model fusion enhances the performance of LLMs by integrating the knowledge and capabilities of multiple structurally diverse models. However, existing approaches often rely solely on selecting the best output for each prompt from source models, which underutilizes their full potential due to limited source knowledge and results in sparse optimization signals. To address this limitation, we propose FuseRL, a novel two-stage framework comprising FuseSFT and FusePO to maximize the utilization of source LLMs. FuseSFT establishes a robust initialization by integrating the strengths of heterogeneous source models through weighted supervised fine-tuning (SFT) on diverse outputs for each prompt. FusePO optimizes weighted preferences based on the outputs of multiple source models to enable superior alignment performance. Extensive experiments demonstrate the effectiveness of our framework across various preference alignment methods, including RLOO, DPO, and SimPO. Using Llama-3.1-8B-Instruct as the target model, our approach achieves state-of-the-art performance among 8B LLMs on the AlpacaEval-2 and Arena-Hard benchmarks. Further analysis suggests that FuseSFT regularizes the training process to reduce overfitting, while FusePO introduces dense and diverse signals for preference optimization.
zh

[NLP-39] NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理长结构化表格数据时缺乏鲁棒性的问题，特别是现有长上下文基准主要关注无结构文本，忽视了长且复杂的结构化表格所面临的挑战。为填补这一空白，论文引入了NeedleInATable (NIAT) 新任务，将每个表格单元视为“针”，要求模型在不同查询下提取目标单元。论文的关键解决方案是提出了一种数据合成方法，通过增强训练数据显著提升了主流LLMs在NIAT任务上的表现，超越了传统的长上下文LLMs和长表格代理方法。这项工作不仅推动了对LLMs真实长结构化表格理解能力的评估，还为长上下文和表格理解应用的进步奠定了基础。

链接: https://arxiv.org/abs/2504.06560
作者: Lanrui Wang,Mingyu Zheng,Hongyin Tang,Zheng Lin,Yanan Cao,Jingang Wang,Xunliang Cai,Weiping Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所), Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院), Beijing, China; Meituan Inc. (美团)
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Processing structured tabular data, particularly lengthy tables, constitutes a fundamental yet challenging task for large language models (LLMs). However, existing long-context benchmarks primarily focus on unstructured text, neglecting the challenges of long and complex structured tables. To address this gap, we introduce NeedleInATable (NIAT), a novel task that treats each table cell as a “needle” and requires the model to extract the target cell under different queries. Evaluation results of mainstream LLMs on this benchmark show they lack robust long-table comprehension, often relying on superficial correlations or shortcuts for complex table understanding tasks, revealing significant limitations in processing intricate tabular data. To this end, we propose a data synthesis method to enhance models’ long-table comprehension capabilities. Experimental results show that our synthesized training data significantly enhances LLMs’ performance on the NIAT task, outperforming both long-context LLMs and long-table agent methods. This work advances the evaluation of LLMs’ genuine long-structured table comprehension capabilities and paves the way for progress in long-context and table understanding applications.
zh

[NLP-40] Lugha-Llama: Adapting Large Language Models for African Languages

【速读】：该论文试图解决低资源非洲语言在大型语言模型（Large Language Models, LLMs）中的识别难题，尤其是在现有大规模训练语料库中代表性不足的问题。论文的关键解决方案在于通过结合精心筛选的非洲语言数据与高质量的英语教育文本，构建一种有效的训练混合策略。这种方法显著提升了模型在非洲语言上的性能，特别是在知识密集型多选题任务（AfriMMLU）和跨语言问答基准（AfriQA）上的表现。研究进一步通过将2亿个英语token翻译成斯瓦希里语并进行分析发现，这些数据的内容是模型取得优异性能的主要原因。这一方法为未来非洲语言的研究提供了重要的参考和资源支持。

链接: https://arxiv.org/abs/2504.06536
作者: Happy Buzaaba,Alexander Wettig,David Ifeoluwa Adelani,Christiane Fellbaum
机构: Princeton University (普林斯顿大学); Mila, McGill University & Canada CIFAR AI Chair (麦吉尔大学CIFAR AI主席; 米拉研究所，麦吉尔大学 & 加拿大 CIFAR AI 主席)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive results in a wide range of natural language applications. However, they often struggle to recognize low-resource languages, in particular African languages, which are not well represented in large training corpora. In this paper, we consider how to adapt LLMs to low-resource African languages. We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model’s performance on these languages. On the challenging IrokoBench dataset, our models consistently achieve the best performance amongst similarly sized baselines, particularly on knowledge-intensive multiple-choice questions (AfriMMLU). Additionally, on the cross-lingual question answering benchmark AfriQA, our models outperform the base model by over 10%. To better understand the role of English data during training, we translate a subset of 200M tokens into Swahili language and perform an analysis which reveals that the content of these data is primarily responsible for the strong performance. We release our models and data to encourage future research on African languages.
zh

[NLP-41] CDER: Collaborative Evidence Retrieval for Document-level Relation Extraction

【速读】：该论文旨在解决现有证据检索系统在文档级关系抽取（DocRE）任务中未能充分利用同一文档内语义相似实体对之间的协作模式的问题，这限制了证据检索的效果。为了解决这一问题，论文提出了一种新的证据检索框架CDER。CDER的关键在于采用基于注意力图的架构来捕捉协作模式，并引入动态子结构以增强证据检索的鲁棒性。

链接: https://arxiv.org/abs/2504.06529
作者: Khai Phan Tran,Xue Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published at ACIIDS 2024

点击查看摘要

Abstract:Document-level Relation Extraction (DocRE) involves identifying relations between entities across multiple sentences in a document. Evidence sentences, crucial for precise entity pair relationships identification, enhance focus on essential text segments, improving DocRE performance. However, existing evidence retrieval systems often overlook the collaborative nature among semantically similar entity pairs in the same document, hindering the effectiveness of the evidence retrieval task. To address this, we propose a novel evidence retrieval framework, namely CDER. CDER employs an attentional graph-based architecture to capture collaborative patterns and incorporates a dynamic sub-structure for additional robustness in evidence retrieval. Experimental results on the benchmark DocRE dataset show that CDER not only excels in the evidence retrieval task but also enhances overall performance of existing DocRE system.
zh

[NLP-42] Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?

【速读】：该论文旨在解决大型语言模型（LLMs）在处理前提缺失（Missing Premises, MiP）的不当问题时普遍存在的过度推理（Overthinking）现象。这种过度推理导致模型生成冗长且无效的响应，违背了“测试时规模定律”（test-time scaling law），同时反映出当前推理LLMs训练方法的一个关键缺陷：未能有效鼓励高效的思维模式，从而导致思维模式的滥用。论文的关键在于揭示过度推理的根本原因，并通过细粒度分析不同类型的LLMs在推理长度、过度推理模式以及关键思维位置上的表现，发现过度推理具有传染性，可通过推理模型响应的蒸馏过程传播。基于这些研究结果，论文为理解和缓解过度推理问题提供了新的见解和方向。

链接: https://arxiv.org/abs/2504.06514
作者: Chenrui Fan,Ming Li,Lichao Sun,Tianyi Zhou
机构: University of Maryland (马里兰大学); Lehigh University (莱斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law’’ but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models’ responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.
zh

[NLP-43] Analyzing Examinee Comments using DistilBERT and Machine Learning to Ensure Quality Control in Exam Content

【速读】：该论文旨在解决通过自然语言处理（NLP）分析考生评论以识别有问题的考试试题的问题。论文的关键解决方案在于开发和验证机器学习模型，这些模型能够自动识别相关的负面反馈，并评估结合心理测量特征的方法是否能提升模型性能。此外，论文还比较了基于NLP标记的问题与传统方法标记的问题，证明了考生反馈可为统计方法提供有价值的补充信息，从而可能在提高考试效度的同时减轻人工审查的工作负担。这一研究为测试机构提供了一种高效机制，将考生的实际体验融入质量保证流程中。

链接: https://arxiv.org/abs/2504.06465
作者: Ye (Cheryl)Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study explores using Natural Language Processing (NLP) to analyze candidate comments for identifying problematic test items. We developed and validated machine learning models that automatically identify relevant negative feedback, evaluated approaches of incorporating psychometric features enhances model performance, and compared NLP-flagged items with traditionally flagged items. Results demonstrate that candidate feedback provides valuable complementary information to statistical methods, potentially improving test validity while reducing manual review burden. This research offers testing organizations an efficient mechanism to incorporate direct candidate experience into quality assurance processes.
zh

[NLP-44] Can LLM s Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在模拟具有逆向表现（reversed performance）的人格时能力不足的问题，这限制了虚拟环境中模拟多样性和实际应用。论文的关键解决方案是提出首个用于评估LLMs模拟逆向表现人格能力的基准数据集，并定义这一能力为“反事实指令跟随”（counterfactual instruction following）。通过数学推理场景的实验，研究发现无论是开源还是闭源的LLMs，包括OpenAI的o1推理模型，在此任务上均表现出显著困难，尤其当同时考虑人格的表现水平与种族人口多样性时，效果进一步恶化。这些结果强调了反事实指令跟随的挑战及未来研究的必要性。

链接: https://arxiv.org/abs/2504.06460
作者: Sai Adith Senthil Kumar,Hao Yan,Saipavan Perepa,Murong Yue,Ziyu Yao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are now increasingly widely used to simulate personas in virtual environments, leveraging their instruction-following capability. However, we discovered that even state-of-the-art LLMs cannot simulate personas with reversed performance (e.g., student personas with low proficiency in educational settings), which impairs the simulation diversity and limits the practical applications of the simulated environments. In this work, using mathematical reasoning as a representative scenario, we propose the first benchmark dataset for evaluating LLMs on simulating personas with reversed performance, a capability that we dub “counterfactual instruction following”. We evaluate both open-weight and closed-source LLMs on this task and find that LLMs, including the OpenAI o1 reasoning model, all struggle to follow counterfactual instructions for simulating reversedly performing personas. Intersectionally simulating both the performance level and the race population of a persona worsens the effect even further. These results highlight the challenges of counterfactual instruction following and the need for further research.
zh

[NLP-45] Dont Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成响应时容易产生幻觉（hallucinations），特别是在用户查询包含错误前提（false premises）的情况下。错误的前提可能与已知事实相矛盾，导致模型输出虚假或误导性信息。现有方法主要依赖于预训练、微调或推理阶段的技术，但这些方法通常计算成本高、需要大量训练数据，或者缺乏事前预防机制来避免生成过程中的幻觉现象，从而限制了其在实时应用中的效率。

论文的关键解决方案是提出了一种基于检索的框架，通过在生成之前识别并处理错误前提来减少幻觉。具体而言，该方法首先将用户的查询转换为逻辑表示形式，然后利用检索增强生成（Retrieval-Augmented Generation, RAG）技术从事实来源评估每个前提的有效性。最后，将验证结果整合到语言模型的提示中，以确保最终输出的事实一致性。这种方法无需访问模型 logits 或进行大规模微调，即可有效降低幻觉现象并提高事实准确性。

链接: https://arxiv.org/abs/2504.06438
作者: Yuehan Qin,Shawn Li,Yi Nian,Xinyan Velocity Yu,Yue Zhao,Xuezhe Ma
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses. However, they can produce hallucinated outputs, especially when a user query includes one or more false premises-claims that contradict established facts. Such premises can mislead LLMs into offering fabricated or misleading details. Existing approaches include pretraining, fine-tuning, and inference-time techniques that often rely on access to logits or address hallucinations after they occur. These methods tend to be computationally expensive, require extensive training data, or lack proactive mechanisms to prevent hallucination before generation, limiting their efficiency in real-time applications. We propose a retrieval-based framework that identifies and addresses false premises before generation. Our method first transforms a user’s query into a logical representation, then applies retrieval-augmented generation (RAG) to assess the validity of each premise using factual sources. Finally, we incorporate the verification results into the LLM’s prompt to maintain factual consistency in the final output. Experiments show that this approach effectively reduces hallucinations, improves factual accuracy, and does not require access to model logits or large-scale fine-tuning.
zh

[NLP-46] Language-Dependent Political Bias in AI: A Study of ChatGPT and Gemini

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在政治倾向性和语言查询差异性方面的问题。研究的核心在于评估ChatGPT和Gemini等领先LLMs是否真正具备政治中立性和无偏见信息提供能力，并探索其在不同查询语言下的表现差异。论文通过设计一种基于14种不同语言的政治轴线测试方法，揭示了这些模型存在自由派和左翼倾向，并发现Gemini相较于ChatGPT表现出更为显著的此类倾向。此外，研究还分析了模型的政治偏差如何随查询语言变化，探讨了教育数据来源与范围、语言结构特征、文化及政治背景以及模型对语言特性的响应等因素的影响。解决方案的关键在于采用多语言测试框架，结合定量与定性分析手段，深入挖掘LLMs的政治倾向及其语言相关性，从而提出从伦理角度出发，AI工具应避免宣称自身无政治倾向或中立，而应努力实现真正的政治中立性并在执行用户请求时体现这一原则。

链接: https://arxiv.org/abs/2504.06436
作者: Dogus Yuksel,Mehmet Cem Catalbas,Bora Oc
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Applications (stat.AP)
备注: 26 pages, 10 figures

点击查看摘要

Abstract:As leading examples of large language models, ChatGPT and Gemini claim to provide accurate and unbiased information, emphasizing their commitment to political neutrality and avoidance of personal bias. This research investigates the political tendency of large language models and the existence of differentiation according to the query language. For this purpose, ChatGPT and Gemini were subjected to a political axis test using 14 different languages. The findings of the study suggest that these large language models do exhibit political tendencies, with both models demonstrating liberal and leftist biases. A comparative analysis revealed that Gemini exhibited a more pronounced liberal and left-wing tendency compared to ChatGPT. The study also found that these political biases varied depending on the language used for inquiry. The study delves into the factors that constitute political tendencies and linguistic differentiation, exploring differences in the sources and scope of educational data, structural and grammatical features of languages, cultural and political contexts, and the model’s response to linguistic features. From this standpoint, and an ethical perspective, it is proposed that artificial intelligence tools should refrain from asserting a lack of political tendencies and neutrality, instead striving for political neutrality and executing user queries by incorporating these tendencies.
zh

[NLP-47] SMoRE: Structural Mixture of Residual Experts for LLM Fine-tuning

【速读】：该论文旨在解决在微调预训练大语言模型（LLMs）过程中参数效率与模型容量之间的双重挑战。现有方法如低秩适应（Low-Rank Adaptations, LoRA）虽然高效但缺乏灵活性，而专家混合（Mixture-of-Experts, MoE）架构虽能增强模型容量，却以更多未充分利用的参数为代价。为克服这些限制，论文提出了一种名为结构化残差专家混合（Structural Mixture of Residual Experts, S’MoRE）的新框架，其关键在于将LoRA的高效性与MoE的灵活性无缝集成。具体而言，S’MoRE通过分层低秩分解专家权重生成不同阶数的残差，并在多层结构中相互连接，通过路由输入标记经过残差子树，仅实例化和组装少量低秩矩阵即可模拟大量专家的能力。此外，论文将S’MoRE残差的层间传播设计为一种特殊的图神经网络（Graph Neural Network, GNN），并证明在相似参数预算下，S’MoRE以指数级提升传统MoE（或Mixture-of-LoRA）的“结构灵活性”。理论分析和实验结果表明，S’MoRE实现了卓越的微调性能，为高效的LLMs适配提供了变革性方法。

链接: https://arxiv.org/abs/2504.06426
作者: Hanqing Zeng,Yinglong Xia,Zhuokai Zhao,Gilbert Jiang,Qiang Zhang,Jiayi Liu,Lizhu Zhang,Xiangjun Fan,Benyu Zhang
机构: Meta (Meta)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-tuning pre-trained large language models (LLMs) presents a dual challenge of balancing parameter efficiency and model capacity. Existing methods like low-rank adaptations (LoRA) are efficient but lack flexibility, while Mixture-of-Experts (MoE) architectures enhance model capacity at the cost of more under-utilized parameters. To address these limitations, we propose Structural Mixture of Residual Experts (S’MoRE), a novel framework that seamlessly integrates the efficiency of LoRA with the flexibility of MoE. Specifically, S’MoRE employs hierarchical low-rank decomposition of expert weights, yielding residuals of varying orders interconnected in a multi-layer structure. By routing input tokens through sub-trees of residuals, S’MoRE emulates the capacity of many experts by instantiating and assembling just a few low-rank matrices. We craft the inter-layer propagation of S’MoRE’s residuals as a special type of Graph Neural Network (GNN), and prove that under similar parameter budget, S’MoRE improves “structural flexibility” of traditional MoE (or Mixture-of-LoRA) by exponential order. Comprehensive theoretical analysis and empirical results demonstrate that S’MoRE achieves superior fine-tuning performance, offering a transformative approach for efficient LLM adaptation.
zh

[NLP-48] Understanding Machine Unlearning Through the Lens of Mode Connectivity

【速读】：该论文试图解决机器学习模型中“未学习（Machine Unlearning）”的问题，即在不从头完全重新训练的情况下，移除模型中不需要的信息。论文的关键在于通过模式连通性（Mode Connectivity）这一视角来研究和分析未学习过程，探索不同条件下的模式连通性特性，包括不同未学习方法之间的连接、有无课程学习（Curriculum Learning）训练的模型连接，以及使用一阶和二阶优化技术训练的模型连接。通过这种方式，论文揭示了不同评估指标沿连通路径的变化模式，以及未学习方法之间的机制相似性或差异性。这是首次在未学习背景下对模式连通性进行系统研究。

链接: https://arxiv.org/abs/2504.06407
作者: Jiali Cheng,Hadi Amiri
机构: University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine Unlearning aims to remove undesired information from trained models without requiring full retraining from scratch. Despite recent advancements, their underlying loss landscapes and optimization dynamics received less attention. In this paper, we investigate and analyze machine unlearning through the lens of mode connectivity - the phenomenon where independently trained models can be connected by smooth low-loss paths in the parameter space. We define and study mode connectivity in unlearning across a range of overlooked conditions, including connections between different unlearning methods, models trained with and without curriculum learning, and models optimized with first-order and secondorder techniques. Our findings show distinct patterns of fluctuation of different evaluation metrics along the curve, as well as the mechanistic (dis)similarity between unlearning methods. To the best of our knowledge, this is the first study on mode connectivity in the context of machine unlearning.
zh

[NLP-49] he Zero Body Problem: Probing LLM Use of Sensory Language

【速读】：该论文试图解决的问题是：语言模型是否能够近似人类在感官语言（Sensory Language）上的使用。感官语言涵盖了从味觉、听觉到情感体验等具身化表达。尽管这些语言模型本身不具备具身性，但研究者们仍对其能否捕捉人类在感官语言中的复杂应用感兴趣。

解决方案的关键在于扩展了一个包含人类与模型对短篇故事提示响应的现有语料库，并新增了由18种流行模型生成的额外18,000个故事。通过分析这些数据，研究发现所有模型生成的故事在感官语言的使用上均显著不同于人类，且不同模型家族之间的差异方向存在明显变化。例如，Gemini模型在大多数维度上使用的感官语言显著多于人类，而其他五个家族的大多数模型则相反，使用的感官语言显著少于人类。此外，通过对五种模型进行线性探查表明，它们能够识别感官语言，但初步证据显示指令微调可能抑制了感官语言的使用。为了支持后续研究，研究者公开了扩充后的短篇故事数据集。

链接: https://arxiv.org/abs/2504.06393
作者: Rebecca M. M. Hicke,Sil Hamilton,David Mimno
机构: Department of Computer Science (计算机科学系); Cornell University (康奈尔大学); Department of Information Science (信息科学系)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sensory language expresses embodied experiences ranging from taste and sound to excitement and stomachache. This language is of interest to scholars from a wide range of domains including robotics, narratology, linguistics, and cognitive science. In this work, we explore whether language models, which are not embodied, can approximate human use of embodied language. We extend an existing corpus of parallel human and model responses to short story prompts with an additional 18,000 stories generated by 18 popular models. We find that all models generate stories that differ significantly from human usage of sensory language, but the direction of these differences varies considerably between model families. Namely, Gemini models use significantly more sensory language than humans along most axes whereas most models from the remaining five families use significantly less. Linear probes run on five models suggest that they are capable of identifying sensory language. However, we find preliminary evidence suggesting that instruction tuning may discourage usage of sensory language. Finally, to support further work, we release our expanded story dataset.
zh

[NLP-50] Query Understanding in LLM -based Conversational Information Seeking WWW’25

【速读】：该论文旨在解决会话信息检索（Conversational Information Seeking, CIS）中查询理解的问题，特别是如何通过上下文感知的交互准确解析用户意图，包括解决歧义、优化查询以及适应不断变化的信息需求。论文的关键在于利用大型语言模型（Large Language Models, LLMs）来增强查询理解过程，具体方法包括解释细微的语言差异和动态适配，从而实时提升搜索结果的相关性和精确性。论文还深入探讨了基于LLM的CIS系统中提升查询理解质量的高级技术，如构建鲁棒评估指标以衡量多轮交互中的查询理解效果、设计更互动的系统策略，以及应用如主动查询管理和查询重构等。同时，论文讨论了在会话搜索系统中集成LLM进行查询理解的关键挑战，并提出了未来的研究方向。

链接: https://arxiv.org/abs/2504.06356
作者: Yifei Yuan,Zahra Abbasiantaeb,Yang Deng,Mohammad Aliannejadi
机构: University of Copenhagen(Danmark); University of Amsterdam(The Netherlands); Singapore Management University(Singapore)
类目: Computation and Language (cs.CL)
备注: WWW’25 Tutorial

点击查看摘要

Abstract:Query understanding in Conversational Information Seeking (CIS) involves accurately interpreting user intent through context-aware interactions. This includes resolving ambiguities, refining queries, and adapting to evolving information needs. Large Language Models (LLMs) enhance this process by interpreting nuanced language and adapting dynamically, improving the relevance and precision of search results in real-time. In this tutorial, we explore advanced techniques to enhance query understanding in LLM-based CIS systems. We delve into LLM-driven methods for developing robust evaluation metrics to assess query understanding quality in multi-turn interactions, strategies for building more interactive systems, and applications like proactive query management and query reformulation. We also discuss key challenges in integrating LLMs for query understanding in conversational search systems and outline future research directions. Our goal is to deepen the audience’s understanding of LLM-based conversational query understanding and inspire discussions to drive ongoing advancements in this field.
zh

[NLP-51] On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在高风险决策任务中种族偏见的理解与缓解问题。论文通过引入“招生与招聘”两类假设性申请人档案的决策任务作为测试平台，揭示了Gemma 2B Instruct和LLaMA 3.2 3B Instruct等模型存在显著的种族偏见，例如Gemma对白人申请者的录取率比黑人高出26%，而LLaMA对亚裔申请者的雇佣率比白人高出60%。研究进一步表明，这些偏见对提示工程方法具有较强的鲁棒性，多种提示策略均未能有效促进公平性。

解决方案的关键在于采用分布对齐搜索（distributed alignment search）方法，通过在模型激活空间中识别“种族子空间”（race subspaces），并在这些子空间中进行干预以减少模型决策中的偏见。具体而言，在种族子空间内对所有种族表示取平均可以将Gemma的偏见降低37%-57%。然而，研究还发现Gemma的种族子空间泛化能力有限，提示格式的变化可能会影响种族表征的一致性。综上所述，论文提出机制性方法（mechanistic approaches）可能是提升LLMs公平性的有前景方向，但尚未实现普遍适用的种族表征方式。

链接: https://arxiv.org/abs/2504.06303
作者: Dang Nguyen,Chenhao Tan
机构: Department of Computer Science (计算机科学系), University of Chicago (芝加哥大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 15 figures, 14 tables

点击查看摘要

Abstract:Understanding and mitigating biases is critical for the adoption of large language models (LLMs) in high-stakes decision-making. We introduce Admissions and Hiring, decision tasks with hypothetical applicant profiles where a person’s race can be inferred from their name, as simplified test beds for racial bias. We show that Gemma 2B Instruct and LLaMA 3.2 3B Instruct exhibit strong biases. Gemma grants admission to 26% more White than Black applicants, and LLaMA hires 60% more Asian than White applicants. We demonstrate that these biases are resistant to prompt engineering: multiple prompting strategies all fail to promote fairness. In contrast, using distributed alignment search, we can identify “race subspaces” within model activations and intervene on them to debias model decisions. Averaging the representation across all races within the subspaces reduces Gemma’s bias by 37-57%. Finally, we examine the generalizability of Gemma’s race subspaces, and find limited evidence for generalization, where changing the prompt format can affect the race representation. Our work suggests mechanistic approaches may provide a promising venue for improving the fairness of LLMs, but a universal race representation remains elusive.
zh

[NLP-52] Reducing Formal Context Extraction: A Newly Proposed Framework from Big Corpora

【速读】：该论文旨在解决从自由文本中提取概念层次结构时因形式上下文（Formal Context, FC）规模庞大导致的计算效率低下及潜在歧义问题。论文的关键解决方案是提出一种框架，通过结合基于WordNet的方法与基于频率的技术，减少形式上下文的规模，从而快速且高效地提取概念格（Concept Lattice）和概念层次结构。研究还利用概念格不变量验证了缩减后形式上下文所生成的概念格在结构上与标准概念格保持一致，并且保留了高达98%的质量。此外，新框架在不同密度的随机数据集上的运行时间测试表明，其性能优于五种基准方法。

链接: https://arxiv.org/abs/2504.06285
作者: Bryar A. Hassan,Shko M. Qader,Alla A. Hassan,Joan Lu,Aram M. Ahmed,Jafar Majidpour,Tarik A. Rashid
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automating the extraction of concept hierarchies from free text is advantageous because manual generation is frequently labor- and resource-intensive. Free result, the whole procedure for concept hierarchy learning from free text entails several phases, including sentence-level text processing, sentence splitting, and tokenization. Lemmatization is after formal context analysis (FCA) to derive the pairings. Nevertheless, there could be a few uninteresting and incorrect pairings in the formal context. It may take a while to generate formal context; thus, size reduction formal context is necessary to weed out irrelevant and incorrect pairings to extract the concept lattice and hierarchies more quickly. This study aims to propose a framework for reducing formal context in extracting concept hierarchies from free text to reduce the ambiguity of the formal context. We achieve this by reducing the size of the formal context using a hybrid of a WordNet-based method and a frequency-based technique. Using 385 samples from the Wikipedia corpus and the suggested framework, tests are carried out to examine the reduced size of formal context, leading to concept lattice and concept hierarchy. With the help of concept lattice-invariants, the generated formal context lattice is compared to the normal one. In contrast to basic ones, the homomorphic between the resultant lattices retains up to 98% of the quality of the generating concept hierarchies, and the reduced concept lattice receives the structural connection of the standard one. Additionally, the new framework is compared to five baseline techniques to calculate the running time on random datasets with various densities. The findings demonstrate that, in various fill ratios, hybrid approaches of the proposed method outperform other indicated competing strategies in concept lattice performance.
zh

[NLP-53] A Diverse and Effective Retrieval-Based Debt Collection System with Expert Knowledge NAACL2025

【速读】：该论文旨在解决金融行业中债务催收系统设计中面临的挑战，特别是如何在保持对话脚本多样性（script diversity）、上下文相关性（contextual relevance）以及连贯性（coherence）的同时提升运营效率并降低成本。论文的关键在于构建了一个基于真实债务人-催收员对话数据的脚本库，并提出了一种两阶段检索驱动的响应系统以增强上下文相关性。此外，通过知识蒸馏（knowledge distillation）技术进一步提升了系统的部署效率。这些方法共同构成了一个可扩展且自动化的解决方案，为实际应用中的债务催收实践提供了有价值的参考。

链接: https://arxiv.org/abs/2504.06273
作者: Jiaming Luo,Weiyi Luo,Guoqing Sun,Mengchen Zhu,Haifeng Tang,Kunyao Lan,Mengyue Wu,Kenny Q. Zhu
机构: X-LANCE Lab, Department of Computer Science and Engineering (计算机科学与工程系), MoE Key Lab of Artificial Intelligence (教育部人工智能重点实验室), AI Institute (人工智能研究所), Shanghai Jiao Tong University (上海交通大学), Shanghai, China; China Merchants Bank Credit Card Center (招商银行信用卡中心), Shanghai, China; University of Texas at Arlington (德克萨斯大学阿灵顿分校), Arlington, Texas, USA
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by NAACL 2025, Industry Track

点击查看摘要

Abstract:Designing effective debt collection systems is crucial for improving operational efficiency and reducing costs in the financial industry. However, the challenges of maintaining script diversity, contextual relevance, and coherence make this task particularly difficult. This paper presents a debt collection system based on real debtor-collector data from a major commercial bank. We construct a script library from real-world debt collection conversations, and propose a two-stage retrieval based response system for contextual relevance. Experimental results show that our system improves script diversity, enhances response relevance, and achieves practical deployment efficiency through knowledge distillation. This work offers a scalable and automated solution, providing valuable insights for advancing debt collection practices in real-world applications.
zh

[NLP-54] ER-RAG : Enhance RAG with ER-Based Unified Modeling of Heterogeneous Data Sources

【速读】：该论文旨在解决现有检索增强生成（Retrieval-Augmented Generation, RAG）方法在低资源或黑盒环境下的适应性不足以及跨异构数据源整合证据时操作复杂的问题。这些挑战源于当前RAG方法依赖于针对特定数据源的策略，导致当证据分散在多个来源时难以高效处理。为了解决这些问题，论文提出了一种名为ER-RAG的新框架，其关键在于通过实体关系（Entity-Relationship, ER）模型实现异构数据源间证据的统一集成。ER-RAG通过基于ER的API标准化了实体检索和关系查询，并采用两阶段生成过程：首先利用偏好优化模块选择最佳数据源；其次根据数据源模式构建API链路。这种统一的方法不仅支持高效的微调，还实现了不同数据源之间的无缝集成。实验结果表明，ER-RAG在2024 KDD Cup CRAG挑战赛的所有三个赛道中均获胜，其性能与使用8B参数量大型语言模型（LLM）的商业RAG流水线相当，同时在LLM评分上比混合竞争对手高出3.1%，检索速度提升了5.5倍。

链接: https://arxiv.org/abs/2504.06271
作者: Yikuan Xia,Jiazun Chen,Yirui Zhan,Suifeng Zhao,Weipeng Jiang,Chaorui Zhang,Wei Han,Bo Bai,Jun Gao
机构: Key Laboratory of High Confidence Software Technologies, CS, Peking University (北京大学); Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd, Huawei Research (华为)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in question-answering (QA) tasks, and retrieval-augmented generation (RAG) enhances their precision by incorporating external evidence from diverse sources like web pages, databases, and knowledge graphs. However, current RAG methods rely on agent-specific strategies for individual data sources, posing challenges low-resource or black-box environments and complicates operations when evidence is fragmented across sources. To address these limitations, we propose ER-RAG, a framework that unifies evidence integration across heterogeneous data sources using the Entity-Relationship (ER) model. ER-RAG standardizes entity retrieval and relationship querying through ER-based APIs with GET and JOIN operations. It employs a two-stage generation process: first, a preference optimization module selects optimal sources; second, another module constructs API chains based on source schemas. This unified approach allows efficient fine-tuning and seamless integration across diverse data sources. ER-RAG demonstrated its effectiveness by winning all three tracks of the 2024 KDDCup CRAG Challenge, achieving performance on par with commercial RAG pipelines using an 8B LLM backbone. It outperformed hybrid competitors by 3.1% in LLM score and accelerated retrieval by 5.5X.
zh

[NLP-55] EXCLAIM: An Explainable Cross-Modal Agent ic System for Misinformation Detection with Hierarchical Retrieval

【速读】：该论文致力于解决Out-of-Context (OOC) 误导信息检测中的挑战，这类问题通过将真实的图像与误导性的文本叙述结合，导致语义扭曲。现有方法主要依赖粗粒度的图文相似性度量，难以捕捉细微的不一致或提供有意义的可解释性。尽管多模态大型语言模型（Multi-modal Large Language Models, MLLMs）在视觉推理和解释生成方面表现出色，但它们尚未具备处理复杂、细粒度及跨模态区分的能力以实现稳健的OOC检测。为此，论文提出了一种名为EXCLAIM的基于检索的框架，其关键在于利用多模态事件和实体的多粒度索引来整合外部知识，并结合多粒度上下文分析与多智能体推理架构，系统性评估多模态新闻内容的一致性和完整性。实验结果表明，EXCLAIM在检测OOC误导信息时比现有最先进方法高出4.3%的准确性，同时提供了可解释且可操作的见解。

链接: https://arxiv.org/abs/2504.06269
作者: Yin Wu,Zhengxuan Zhang,Fuling Wang,Yuyu Luo,Hui Xiong,Nan Tang
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); The Hong Kong University of Science and Technology (香港科技大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 2 figures

点击查看摘要

Abstract:Misinformation continues to pose a significant challenge in today’s information ecosystem, profoundly shaping public perception and behavior. Among its various manifestations, Out-of-Context (OOC) misinformation is particularly obscure, as it distorts meaning by pairing authentic images with misleading textual narratives. Existing methods for detecting OOC misinformation predominantly rely on coarse-grained similarity metrics between image-text pairs, which often fail to capture subtle inconsistencies or provide meaningful explainability. While multi-modal large language models (MLLMs) demonstrate remarkable capabilities in visual reasoning and explanation generation, they have not yet demonstrated the capacity to address complex, fine-grained, and cross-modal distinctions necessary for robust OOC detection. To overcome these limitations, we introduce EXCLAIM, a retrieval-based framework designed to leverage external knowledge through multi-granularity index of multi-modal events and entities. Our approach integrates multi-granularity contextual analysis with a multi-agent reasoning architecture to systematically evaluate the consistency and integrity of multi-modal news content. Comprehensive experiments validate the effectiveness and resilience of EXCLAIM, demonstrating its ability to detect OOC misinformation with 4.3% higher accuracy compared to state-of-the-art approaches, while offering explainable and actionable insights.
zh

[NLP-56] Information-Theoretic Reward Decomposition for Generalizable RLHF

【速读】：该论文旨在解决现有奖励模型在强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）中的泛化能力不足的问题。具体而言，现有的奖励模型通常通过增加选定响应与被拒绝响应之间的奖励差距来训练，而忽略了响应所依赖的提示（prompt），导致在未见的数据分布上评估时，奖励模型可能表现出较差的泛化性能。为了解决这一问题，论文的关键在于将奖励值分解为两个独立的组成部分：无提示奖励（prompt-free reward）和提示相关奖励（prompt-related reward）。其中，无提示奖励仅由响应本身决定，而提示相关奖励则同时反映提示和响应的影响。这种方法基于信息论视角实现分解，无需额外引入模型。进一步地，论文提出了一种新的奖励学习算法，通过优先处理具有高无提示奖励值的数据样本来优化训练过程。实验结果表明，这种分解方式有效刻画了奖励模型的两部分特性，并显著提升了奖励模型的对齐性能和泛化能力。

链接: https://arxiv.org/abs/2504.06020
作者: Liyuan Mao,Haoran Xu,Amy Zhang,Weinan Zhang,Chenjia Bai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work done during internships at Institute of Artificial Intelligence (TeleAI), China Telecom

点击查看摘要

Abstract:A generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models lack this ability, as they are typically trained by increasing the reward gap between chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information-theoretic perspective, which requires no extra models. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize two parts of the reward model. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.
zh

[NLP-57] StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）集成到信息检索系统后引入的新攻击面问题，特别是针对生成式排名操纵的对抗性攻击。论文提出了一种名为StealthRank的新型对抗性排名攻击方法，其目标是在保持文本流畅性和隐蔽性的前提下，操控由LLM驱动的产品推荐系统。解决方案的关键在于采用基于能量的优化框架，并结合Langevin动力学生成StealthRank提示（SRPs），这些提示是以对抗性文本序列的形式嵌入产品描述中，能够隐秘但有效地影响LLM的排名机制。实验结果表明，StealthRank在隐蔽性和有效性方面均优于现有的最先进的对抗性排名基准，揭示了LLM驱动推荐系统的重大脆弱性。

链接: https://arxiv.org/abs/2504.05804
作者: Yiming Tang,Yi Fan,Chenxiao Yu,Tiankai Yang,Yue Zhao,Xiyang Hu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) into information retrieval systems introduces new attack surfaces, particularly for adversarial ranking manipulations. We present StealthRank, a novel adversarial ranking attack that manipulates LLM-driven product recommendation systems while maintaining textual fluency and stealth. Unlike existing methods that often introduce detectable anomalies, StealthRank employs an energy-based optimization framework combined with Langevin dynamics to generate StealthRank Prompts (SRPs)-adversarial text sequences embedded within product descriptions that subtly yet effectively influence LLM ranking mechanisms. We evaluate StealthRank across multiple LLMs, demonstrating its ability to covertly boost the ranking of target products while avoiding explicit manipulation traces that can be easily detected. Our results show that StealthRank consistently outperforms state-of-the-art adversarial ranking baselines in both effectiveness and stealth, highlighting critical vulnerabilities in LLM-driven recommendation systems.
zh

[NLP-58] CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models

【速读】：该论文旨在解决开放域大语言模型（Open Large Language Models, LLMs）在实际应用中对人工输入高度依赖的问题，特别是在对话流程引导和响应优化方面。尽管LLMs在自然语言处理领域取得了显著进展，但其有效运行仍需大量人工干预来调整模型以适应特定任务需求。为了解决这一局限性，论文提出了两个关键创新：一是训练了一个名为TinyAgent的轻量级模型，该模型基于精心筛选的高质量数据集；二是引入了协作多智能体调优（Collaborative Multi-Agent Tuning, CMAT）框架，通过环境反馈驱动的自适应权重更新机制增强语言代理的能力。CMAT框架不仅促进了多个智能体之间的协同学习与实时适应，还提升了它们的上下文感知能力和长期记忆能力。此外，论文设计了一种新的通信代理框架，将多智能体系统与环境反馈机制相结合，提供了一种可扩展的方法来探索合作行为。实验结果显示，TinyAgent-7B模型虽然参数量较少，但在性能上与GPT-3.5相当，表明该方法在提升LLMs效率和效果方面具有重要意义。

链接: https://arxiv.org/abs/2404.01663
作者: Xuechen Liang,Meiling Tao,Yinghui Xia,Tianyu Shi,Jun Wang,JingSong Yang
机构: East China Jiaotong University (华东交通大学); University of Minnesota - Twin Cities (明尼苏达大学双城分校); Guangdong University of Technology (广东工业大学); Autoagents.ai (Autoagents.ai); University of Electronic Science and Technology of China (电子科技大学); University of Toronto (多伦多大学); East China Normal University (华东师范大学); Autoagents.ai (Autoagents.ai)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:

点击查看摘要

Abstract:Open large language models (LLMs) have significantly advanced the field of natural language processing, showcasing impressive performance across various this http URL the significant advancements in LLMs, their effective operation still relies heavily on human input to accurately guide the dialogue flow, with agent tuning being a crucial optimization technique that involves human adjustments to the model for better response to such this http URL this dependency, our work introduces the TinyAgent model, trained on a meticulously curated high-quality dataset. We also present the Collaborative Multi-Agent Tuning (CMAT) framework, an innovative system designed to augment language agent capabilities through adaptive weight updates based on environmental feedback. This framework fosters collaborative learning and real-time adaptation among multiple intelligent agents, enhancing their context-awareness and long-term memory. In this research, we propose a new communication agent framework that integrates multi-agent systems with environmental feedback mechanisms, offering a scalable method to explore cooperative behaviors. Notably, our TinyAgent-7B model exhibits performance on par with GPT-3.5, despite having fewer parameters, signifying a substantial improvement in the efficiency and effectiveness of LLMs.
zh

[NLP-59] MultiDelete for Multimodal Machine Unlearning ECCV2024

【速读】：该论文旨在解决多模态数据和模型中机器遗忘（Machine Unlearning）的问题。传统机器学习模型在处理训练数据样本的知识移除时面临挑战，而多模态设置下的机器遗忘因不同数据模态间的复杂依赖关系以及大规模多模态数据集和架构的高昂训练成本而更具挑战性。论文提出了一种名为MultiDelete的多模态机器遗忘方法，其关键在于通过解耦被删除单模态数据点之间的关联，在不削弱训练模型整体表示能力的前提下实现有效的遗忘。MultiDelete强调了三种关键性质：(a) 模态解耦，确保待删除单模态数据点的关联被有效切断；(b) 多模态知识保留，保证遗忘后仍保持多模态表示能力；© 单模态知识保留，确保遗忘后仍保留单模态表示能力。此外，MultiDelete在训练效率上表现优异，并且不受强凸损失函数的限制，这是现有方法中的常见约束。实验结果表明，MultiDelete在多模态样本遗忘任务中比现有最佳基线平均提升17.6分，同时能够保留原始模型的多模态和单模态知识，并提供更好的遗忘数据对抗攻击保护。

链接: https://arxiv.org/abs/2311.12047
作者: Jiali Cheng,Hadi Amiri
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ECCV 2024

点击查看摘要

Abstract:Machine Unlearning removes specific knowledge about training data samples from an already trained model. It has significant practical benefits, such as purging private, inaccurate, or outdated information from trained models without the need for complete re-training. Unlearning within a multimodal setting presents unique challenges due to the complex dependencies between different data modalities and the expensive cost of training on large multimodal datasets and architectures. This paper presents the first machine unlearning approach for multimodal data and models, titled MultiDelete, which is designed to decouple associations between unimodal data points during unlearning without losing the overall representation strength of the trained model. MultiDelete advocates for three key properties for effective multimodal unlearning: (a): modality decoupling, which effectively decouples the association between individual unimodal data points marked for deletion, rendering them as unrelated data points, (b): multimodal knowledge retention, which retains the multimodal representation post-unlearning, and ©: unimodal knowledge retention, which retains the unimodal representation postunlearning. MultiDelete is efficient to train and is not constrained by using a strongly convex loss – a common restriction among existing baselines. Experiments on two architectures and four datasets, including image-text and graph-text datasets, show that MultiDelete gains an average improvement of 17.6 points over best performing baseline in unlearning multimodal samples, can maintain the multimodal and unimodal knowledge of the original model post unlearning, and can provide better protection to unlearned data against adversarial attacks.
zh

[NLP-60] RNN-Transducer-based Losses for Speech Recognition on Noisy Targets

【速读】：该论文旨在解决在工业管道中使用噪声转录数据训练语音识别系统的问题，其中大规模数据集的存在使得确保每个实例的准确转录变得困难。论文的关键解决方案在于引入新的损失函数以减轻RNN-Transducer模型中转录错误的影响。具体而言，Star-Transducer损失通过在损失格中引入“跳帧”转换来应对删除错误，使系统的性能恢复到与使用准确转录数据训练的模型相当的90%以上；Bypass-Transducer损失利用“跳令牌”转换处理插入错误，恢复超过60%的质量；而Target-Robust Transducer损失结合这两种方法，提供针对任意错误的鲁棒性。实验结果表明，Target-Robust Transducer损失显著提高了RNN-T在噪声数据上的性能，相比高质量转录数据恢复了超过70%的质量。

链接: https://arxiv.org/abs/2504.06963
作者: Vladimir Bataev
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Final Project Report, Bachelor’s Degree in Computer Science, University of London, March 2024

点击查看摘要

Abstract:Training speech recognition systems on noisy transcripts is a significant challenge in industrial pipelines, where datasets are enormous and ensuring accurate transcription for every instance is difficult. In this work, we introduce novel loss functions to mitigate the impact of transcription errors in RNN-Transducer models. Our Star-Transducer loss addresses deletion errors by incorporating “skip frame” transitions in the loss lattice, restoring over 90% of the system’s performance compared to models trained with accurate transcripts. The Bypass-Transducer loss uses “skip token” transitions to tackle insertion errors, recovering more than 60% of the quality. Finally, the Target-Robust Transducer loss merges these approaches, offering robust performance against arbitrary errors. Experimental results demonstrate that the Target-Robust Transducer loss significantly improves RNN-T performance on noisy data by restoring over 70% of the quality compared to well-transcribed data.
zh

计算机视觉

[CV-0] FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution

【速读】：该论文致力于解决视频深度估计中的三个关键挑战：(1) 在连续帧中实现精确且一致的深度估计，(2) 生成高分辨率深度图，以及 (3) 支持实时流媒体处理。论文提出了一种名为 FlashDepth 的方法，通过在预训练的单图像深度模型基础上进行精心修改，实现了在 2044x1148 分辨率、24 FPS 流媒体视频上的高效深度估计。解决方案的关键在于通过对现有单图像深度模型的适配与优化，使其能够在相对较少的数据和训练量下满足上述多维度需求，同时保持边界清晰度和速度优势，并维持与其他先进模型相当的准确性。

链接: https://arxiv.org/abs/2504.07093
作者: Gene Chou,Wenqi Xian,Guandao Yang,Mohamed Abdelfattah,Bharath Hariharan,Noah Snavely,Ning Yu,Paul Debevec
机构: Netflix Eyeline Studios (网飞视线工作室); Cornell University (康奈尔大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics.
zh

[CV-1] Are We Done with Object-Centric Learning?

【速读】：该论文旨在解决对象中心学习（OCL）方法在处理分布外（OOD）泛化挑战中的有效性问题，特别是由虚假背景线索引起的挑战。论文的关键在于提出了一种新颖的、无需训练的探针工具——基于掩码的对象中心分类（OCCAM），通过实验证明基于分割的对象编码在OOD对象发现基准上的零样本性能显著优于基于槽位的OCL方法。尽管在像素空间中分离对象的技术已取得进展，但如何进一步提升OCL方法在更广泛目标如OOD泛化上的贡献仍是一个开放性问题。论文还提供了可扩展的对象中心表示工具箱，并探讨了理解人类认知中物体感知等基础科学问题及其实际应用。

链接: https://arxiv.org/abs/2504.07092
作者: Alexander Rubinstein,Ameya Prabhu,Matthias Bethge,Seong Joon Oh
机构: Tübingen AI Center, University of Tübingen (图宾根大学图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene. This approach underpins various aims, including out-of-distribution (OOD) generalization, sample-efficient composition, and modeling of structured environments. Most research has focused on developing unsupervised mechanisms that separate objects into discrete slots in the representation space, evaluated using unsupervised object discovery. However, with recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently. This achieves remarkable zero-shot performance on OOD object discovery benchmarks, is scalable to foundation models, and can handle a variable number of slots out-of-the-box. Hence, the goal of OCL methods to obtain object-centric representations has been largely achieved. Despite this progress, a key question remains: How does the ability to separate objects within a scene contribute to broader OCL objectives, such as OOD generalization? We address this by investigating the OOD generalization challenge caused by spurious background cues through the lens of OCL. We propose a novel, training-free probe called \textbfObject-Centric Classification with Applied Masks (OCCAM) , demonstrating that segmentation-based encoding of individual objects significantly outperforms slot-based OCL methods. However, challenges in real-world applications remain. We provide the toolbox for the OCL community to use scalable object-centric representations, and focus on practical applications and fundamental questions, such as understanding object perception in human cognition. Our code is available \hrefthis https URLhere .
zh

[CV-2] GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

【速读】：该论文旨在解决现有摄像机轨迹生成方法的局限性问题，包括传统几何优化方法的机械性和手工设计系统的限制，以及基于学习的方法中存在的结构偏差或缺乏文本对齐的问题，这些限制阻碍了创意合成。为了解决这些问题，论文的关键在于引入了一种受摄影指导（Directors of Photography）专业知识启发的自回归模型——GenDoP。该模型基于一个大规模多模态数据集DataDoP进行训练，该数据集包含29K个带有自由移动摄像机轨迹、深度图及详细描述的现实场景镜头。通过结合文本引导和RGBD输入，GenDoP能够实现高质量且上下文感知的摄像机运动生成，提供更好的可控性、更精细的轨迹调整和更高的运动稳定性。这一方法为基于学习的电影摄影设定了新标准，并推动了未来摄像机控制和电影制作的发展。

链接: https://arxiv.org/abs/2504.07083
作者: Mengchen Zhang,Tong Wu,Jing Tan,Ziwei Liu,Gordon Wetzstein,Dahua Lin
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Stanford University (斯坦福大学); The Chinese University of Hong Kong (香港中文大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements in camera control and filmmaking. Our project website: this https URL.
zh

[CV-3] Detecting AI-generated Artwork

【速读】：该论文试图解决人工智能生成的艺术作品与人类创作的艺术作品之间的区分问题。解决方案的关键在于评估多种机器学习（Machine Learning, ML）和深度学习（Deep Learning, DL）模型在处理这一任务上的有效性，特别是针对巴洛克、立体主义和表现主义三种具有挑战性的艺术风格。研究测试了逻辑回归（Logistic Regression, LR）、支持向量机（Support Vector Machine, SVM）、多层感知器（Multilayer Perceptron, MLP）以及卷积神经网络（Convolutional Neural Network, CNN）等模型，并取得了多分类准确率为0.8208、二分类准确率为0.9758的实验结果。这些结果表明，基于深度学习的方法在解决此类复杂艺术风格区分任务中的显著优势。

链接: https://arxiv.org/abs/2504.07078
作者: Meien Li,Mark Stamp
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The high efficiency and quality of artwork generated by Artificial Intelligence (AI) has created new concerns and challenges for human artists. In particular, recent improvements in generative AI have made it difficult for people to distinguish between human-generated and AI-generated art. In this research, we consider the potential utility of various types of Machine Learning (ML) and Deep Learning (DL) models in distinguishing AI-generated artwork from human-generated artwork. We focus on three challenging artistic styles, namely, baroque, cubism, and expressionism. The learning models we test are Logistic Regression (LR), Support Vector Machine (SVM), Multilayer Perceptron (MLP), and Convolutional Neural Network (CNN). Our best experimental results yield a multiclass accuracy of 0.8208 over six classes, and an impressive accuracy of 0.9758 for the binary classification problem of distinguishing AI-generated from human-generated art.
zh

[CV-4] aching pathology foundation models to accurately predict gene expression with parameter efficient knowledge transfer

【速读】：该论文旨在解决利用数字病理图像直接预测基因表达的问题，当前基于图像的基础模型在这一任务上的表现仍有限。论文的关键创新在于提出了一种名为Parameter Efficient Knowledge trAnsfer (PEKA)的新框架，通过Block-Affine Adaptation以及知识蒸馏和结构对齐损失的结合，实现跨模态的知识迁移。这种方案能够有效缓解领域适配的成本问题，并显著提升基因表达预测的性能，在多个空间转录组学数据集上的测试显示，PEKA相比基础模型至少提升了5%的性能，同时优于其他参数高效的微调策略。

链接: https://arxiv.org/abs/2504.07061
作者: Shi Pan,Jianan Chen,Maria Secrier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gene expression profiling provides critical insights into cellular heterogeneity, biological processes and disease mechanisms. There has been an increasing interest in computational approaches that can predict gene expression directly from digitalized histopathology images. While image foundation models have shown promise in a variety of pathology downstream analysis, their performances on gene-expression prediction are still limited. Explicitly incorporating information from the transcriptomic models can help image models to address domain shift, yet the fine-tuning and alignment of foundation models can be expensive. In the work, we propose Parameter Efficient Knowledge trAnsfer (PEKA), a novel framework that leverages Block-Affine Adaptation and integrates knowledge distillation and structure alignment losses for cross-modal knowledge transfer. We evaluated PEKA for gene expression prediction using multiple spatial transcriptomics datasets (comprising 206,123 image tiles with matched gene expression profiles) that encompassed various types of tissue. PEKA achieved at least 5% performance improvement over baseline foundation models while also outperforming alternative parameter-efficient fine-tuning strategies. We will release the code, datasets and aligned models after peer-review to facilitate broader adoption and further development for parameter efficient model alignment.
zh

[CV-5] Generalized Semantic Contrastive Learning via Embedding Side Information for Few-Shot Object Detection

【速读】：本文旨在解决少样本目标检测（Few-Shot Object Detection, FSOD）任务中的两个核心挑战：1）如何在有限数据条件下构建适用于新类别的广义特征空间，以适应未知场景；2）由于新类别样本不足导致的特征混淆及模型过拟合问题。为了解决这些问题，论文提出了一种基于侧信息的新型广义特征表示学习方法。其关键是利用嵌入的侧信息构造知识矩阵量化基础类别与新类别之间的语义关系，并通过上下文语义监督对比学习增强语义相似类别间的区分能力，同时引入由侧信息引导的区域感知掩码模块，通过反事实解释找到并去除区分相似类别的偏见信息，进一步优化判别性特征表示空间。

链接: https://arxiv.org/abs/2504.07060
作者: Ruoyu Chen,Hua Zhang,Jingzhi Li,Li Liu,Zhen Huang,Xiaochun Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); College of Electronic Science and Technology, National University of Defense Technology (国防科技大学电子科学学院); College of Computer, National University of Defense Technology (国防科技大学计算机学院); School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区网络空间科学技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by T-PAMI (IEEE Transactions on Pattern Analysis and Machine Intelligence)

点击查看摘要

Abstract:The objective of few-shot object detection (FSOD) is to detect novel objects with few training samples. The core challenge of this task is how to construct a generalized feature space for novel categories with limited data on the basis of the base category space, which could adapt the learned detection model to unknown scenarios. However, limited by insufficient samples for novel categories, two issues still exist: (1) the features of the novel category are easily implicitly represented by the features of the base category, leading to inseparable classifier boundaries, (2) novel categories with fewer data are not enough to fully represent the distribution, where the model fine-tuning is prone to overfitting. To address these issues, we introduce the side information to alleviate the negative influences derived from the feature space and sample viewpoints and formulate a novel generalized feature representation learning method for FSOD. Specifically, we first utilize embedding side information to construct a knowledge matrix to quantify the semantic relationship between the base and novel categories. Then, to strengthen the discrimination between semantically similar categories, we further develop contextual semantic supervised contrastive learning which embeds side information. Furthermore, to prevent overfitting problems caused by sparse samples, a side-information guided region-aware masked module is introduced to augment the diversity of samples, which finds and abandons biased information that discriminates between similar categories via counterfactual explanation, and refines the discriminative representation space further. Extensive experiments using ResNet and ViT backbones on PASCAL VOC, MS COCO, LVIS V1, FSOD-1K, and FSVOD-500 benchmarks demonstrate that our model outperforms the previous state-of-the-art methods, significantly improving the ability of FSOD in most shots/splits.
zh

[CV-6] Distilling Textual Priors from LLM to Efficient Image Fusion

【速读】：该论文旨在解决多模态图像融合领域中传统方法（如CNNs和GANs）在处理低质量或复杂输入时表现不佳的问题，同时克服现有文本引导方法因依赖大规模模型先验而导致的显著计算开销。论文的关键创新在于提出了一种新颖的框架，用于蒸馏大规模模型的先验知识，使推理过程中无需文本指导，并大幅减小模型规模。解决方案的核心在于采用教师-学生网络架构，其中教师网络整合大规模模型先验并通过定制化的蒸馏过程将知识传递给更小的学生网络。此外，引入空间-通道交叉融合模块以增强模型在空间和通道维度上利用文本先验的能力。最终，该方法在计算效率与融合质量之间实现了良好的平衡，蒸馏后的网络仅需教师网络10%的参数量和推理时间，却保留了其90%以上的性能，并超越了现有的最先进方法。

链接: https://arxiv.org/abs/2504.07029
作者: Ran Zhang,Xuanhua He,Ke Cao,Liu Liu,Li Zhang,Man Zhou,Jie Zhang
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学); Hefei Institutes of Physical Science, Chinese Academy of Sciences (中国科学院合肥物质科学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model’s ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.
zh

[CV-7] Glossy Object Reconstruction with Cost-effective Polarized Acquisition CVPR2025

【速读】：该论文致力于解决基于图像的光泽物体3D重建难题，其核心挑战在于从捕获的图像中分离光泽表面上的漫反射和镜面反射成分，而这一任务因仅依赖RGB数据难以明确辨别照明条件和材料属性而变得更加复杂。现有最先进的方法通常依赖定制或高端的数据采集设备，这不仅繁琐且耗时。论文提出了一种可扩展的偏振辅助方法，采用成本效益高的采集工具。通过在现成的RGB相机上附加线性偏振器，能够捕捉多视角偏振图像，无需提前校准或精确测量偏振器角度，从而大幅降低系统构建成本。该方案的关键在于将表面的偏振BRDF（双向反射分布函数）、Stokes矢量及偏振状态表示为神经隐式场，并结合偏振器角度，通过对输入偏振图像的渲染损失进行优化来恢复这些隐式场。通过利用偏振渲染的基本物理原理进行隐式表示，实验结果表明该方法在公共数据集和真实捕获图像上的重建以及新视图合成任务中优于现有技术。

链接: https://arxiv.org/abs/2504.07025
作者: Bojian Wu,Yifan Peng,Ruizhen Hu,Xiaowei Zhou
机构: Zhejiang University (浙江大学); The University of Hong Kong (香港大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 as highlight

点击查看摘要

Abstract:The challenge of image-based 3D reconstruction for glossy objects lies in separating diffuse and specular components on glossy surfaces from captured images, a task complicated by the ambiguity in discerning lighting conditions and material properties using RGB data alone. While state-of-the-art methods rely on tailored and/or high-end equipment for data acquisition, which can be cumbersome and time-consuming, this work introduces a scalable polarization-aided approach that employs cost-effective acquisition tools. By attaching a linear polarizer to readily available RGB cameras, multi-view polarization images can be captured without the need for advance calibration or precise measurements of the polarizer angle, substantially reducing system construction costs. The proposed approach represents polarimetric BRDF, Stokes vectors, and polarization states of object surfaces as neural implicit fields. These fields, combined with the polarizer angle, are retrieved by optimizing the rendering loss of input polarized images. By leveraging fundamental physical principles for the implicit representation of polarization rendering, our method demonstrates superiority over existing techniques through experiments in public datasets and real captured images on both reconstruction and novel view synthesis.
zh

[CV-8] Latent Diffusion U-Net Representations Contain Positional Embeddings and Anomalies ICLR2025

【速读】：该论文旨在研究扩散模型（Diffusion Models）在合成逼真图像方面的表示（representations）是否具备鲁棒性，以评估其在下游任务中的适用性。论文通过表征相似性和范数（norms）分析了流行的Stable Diffusion模型，并揭示了三个现象：(1) 中间表示中存在学习到的位置嵌入（learned positional embedding），(2) 高相似性角落伪影（high-similarity corner artifacts），以及(3) 异常的高范数伪影（anomalous high-norm artifacts）。这些发现表明，在将扩散模型用于需要鲁棒特征的下游任务之前，需进一步深入探究其表示的特性。论文的关键在于通过系统的表征分析方法揭示了扩散模型表示中存在的潜在问题，从而为后续研究提供了明确的方向。

链接: https://arxiv.org/abs/2504.07008
作者: Jonas Loos,Lorenz Linhardt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025 Workshop on Deep Generative Models: Theory, Principle, and Efficacy

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable capabilities in synthesizing realistic images, spurring interest in using their representations for various downstream tasks. To better understand the robustness of these representations, we analyze popular Stable Diffusion models using representational similarity and norms. Our findings reveal three phenomena: (1) the presence of a learned positional embedding in intermediate representations, (2) high-similarity corner artifacts, and (3) anomalous high-norm artifacts. These findings underscore the need to further investigate the properties of diffusion model representations before considering them for downstream tasks that require robust features. Project page: this https URL
zh

[CV-9] RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration

【速读】：该论文致力于解决开放世界机器人中的开集语义映射问题，当前方法要么受限于深度范围，要么仅在受限设置下处理超出范围的实体，整体上未能有效结合范围内和范围外的观测。此外，这些方法在细粒度语义信息与效率之间需要权衡。论文提出的关键解决方案是RayFronts，这是一种统一的表示方法，能够实现密集且高效的范围内及范围外语义映射。RayFronts通过在地图边界处对任务无关的开集语义进行编码，不仅提升了机器人在感测范围内外的决策能力，还显著减少了搜索体积，并且在NVIDIA Orin AGX上达到了8.84 Hz的运行频率。实验表明，RayFronts在范围内语义方面提供了1.34倍的零样本3D语义分割性能提升，同时吞吐量提高了16.5倍。此外，论文还提出了一种与规划器无关的评估框架，用于衡量在线范围外搜索和探索的效用，结果显示RayFronts比最近的在线基线更高效地减少了搜索体积达2.2倍。

链接: https://arxiv.org/abs/2504.06994
作者: Omar Alama,Avigyan Bhattacharya,Haoyang He,Seungchan Kim,Yuheng Qiu,Wenshan Wang,Cherie Ho,Nikhil Keetha,Sebastian Scherer
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Open-set semantic mapping is crucial for open-world robots. Current mapping approaches either are limited by the depth range or only map beyond-range entities in constrained settings, where overall they fail to combine within-range and beyond-range observations. Furthermore, these methods make a trade-off between fine-grained semantics and efficiency. We introduce RayFronts, a unified representation that enables both dense and beyond-range efficient semantic mapping. RayFronts encodes task-agnostic open-set semantics to both in-range voxels and beyond-range rays encoded at map boundaries, empowering the robot to reduce search volumes significantly and make informed decisions both within beyond sensory range, while running at 8.84 Hz on an Orin AGX. Benchmarking the within-range semantics shows that RayFronts’s fine-grained image encoding provides 1.34x zero-shot 3D semantic segmentation performance while improving throughput by 16.5x. Traditionally, online mapping performance is entangled with other system components, complicating evaluation. We propose a planner-agnostic evaluation framework that captures the utility for online beyond-range search and exploration, and show RayFronts reduces search volume 2.2x more efficiently than the closest online baselines.
zh

[CV-10] SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets

【速读】：该论文致力于解决3D人体数字化任务中高质量生成的挑战，特别是现有方法在速度、质量、遮挡处理及3D人体资产稀缺性方面的局限性。论文的关键解决方案在于提出了一种基于隐空间生成的新范式，通过UV结构化变分自编码器（UV-structured VAE）将多视角图像压缩为高斯表示，并结合DiT条件生成模型，将低维到高维映射这一不适定问题转化为可学习的分布偏移，同时支持端到端推断。此外，通过多视角优化与合成数据构建了包含100万个3D高斯资产的HGS-1M数据集，以支持大规模训练。实验结果表明，该方法能够生成具有复杂纹理、面部细节和宽松服装变形的高质量3D人体高斯表示。

链接: https://arxiv.org/abs/2504.06982
作者: Yuhang Yang,Fengqi Liu,Yixing Lu,Qin Zhao,Pingyu Wu,Wei Zhai,Ran Yi,Yang Cao,Lizhuang Ma,Zheng-Jun Zha,Junting Dong
机构: USTC (中国科学技术大学); Shanghai AI Lab (上海人工智能实验室); SJTU (上海交通大学); CMU (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains 1 million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.
zh

[CV-11] Wheat3DGS: In-field 3D Reconstruction Instance Segmentation and Phenotyping of Wheat Heads with Gaussian Splatting CVPR

【速读】：该论文旨在解决通过高通量田间表型分析（High-Throughput Field Phenotyping, HTFP）自动提取小麦等作物形态特征的问题，特别是针对复杂结构如单个麦穗的精确三维实例分割与测量。现有方法如基于Neural Radiance Fields (NeRF) 的技术虽有潜力，但仅适用于少数植物或器官，并且难以应对田间环境下因作物冠层遮挡和密集排列导致的测量挑战。论文的关键在于提出了一种结合3D Gaussian Splatting (3DGS) 和 Segment Anything Model (SAM) 的新方法——Wheat3DGS，利用其高质量的三维重建能力和显式的点云表示，实现对数百个麦穗的自动化精准分割与形态测量，这是首次将3DGS应用于HTFP领域。实验验证表明，该方法在长度、宽度和体积的提取精度上优于传统多视图立体视觉（MVS）及NeRF基方法。

链接: https://arxiv.org/abs/2504.06978
作者: Daiwei Zhang,Joaquin Gajardo,Tomislav Medic,Isinsu Katircioglu,Mike Boss,Norbert Kirchgessner,Achim Walter,Lukas Roth
机构: ETH Zürich (苏黎世联邦理工学院); Swiss Data Science Center (瑞士数据科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Copyright 2025 IEEE. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published in the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

点击查看摘要

Abstract:Automated extraction of plant morphological traits is crucial for supporting crop breeding and agricultural management through high-throughput field phenotyping (HTFP). Solutions based on multi-view RGB images are attractive due to their scalability and affordability, enabling volumetric measurements that 2D approaches cannot directly capture. While advanced methods like Neural Radiance Fields (NeRFs) have shown promise, their application has been limited to counting or extracting traits from only a few plants or organs. Furthermore, accurately measuring complex structures like individual wheat heads-essential for studying crop yields-remains particularly challenging due to occlusions and the dense arrangement of crop canopies in field conditions. The recent development of 3D Gaussian Splatting (3DGS) offers a promising alternative for HTFP due to its high-quality reconstructions and explicit point-based representation. In this paper, we present Wheat3DGS, a novel approach that leverages 3DGS and the Segment Anything Model (SAM) for precise 3D instance segmentation and morphological measurement of hundreds of wheat heads automatically, representing the first application of 3DGS to HTFP. We validate the accuracy of wheat head extraction against high-resolution laser scan data, obtaining per-instance mean absolute percentage errors of 15.1%, 18.3%, and 40.2% for length, width, and volume. We provide additional comparisons to NeRF-based approaches and traditional Muti-View Stereo (MVS), demonstrating superior results. Our approach enables rapid, non-destructive measurements of key yield-related traits at scale, with significant implications for accelerating crop breeding and improving our understanding of wheat development.
zh

[CV-12] A Deep Single Image Rectification Approach for Pan-Tilt-Zoom Cameras ICME2025

【速读】：该论文旨在解决广角镜头Pan-Tilt-Zoom (PTZ) 摄像头因固有非线性畸变导致的图像校正难题，现有深度学习方法通常难以保持细粒度几何细节，从而产生不准确的校正结果。论文提出了一种前向畸变与后向变形网络（Forward Distortion and Backward Warping Network, FDBW-Net），其关键是通过前向畸变模型合成桶形畸变图像以减少像素冗余并避免模糊，同时利用具有注意力机制的金字塔上下文编码器生成包含几何细节的后向变形流，并结合多尺度解码器恢复畸变特征并输出校正图像。这一方案有效提升了畸变校正性能，在多个数据集上的验证表明FDBW-Net达到了当前最优（SOTA）表现，增强了PTZ摄像头在实际视觉应用中的适应性。

链接: https://arxiv.org/abs/2504.06965
作者: Teng Xiao,Qi Hu,Qingsong Yan,Wei Liu,Zhiwei Ye,Fei Deng
机构: School of Computer Science, Hubei University of Technology (湖北工业大学计算机学院), Wuhan, China; Hubei Key Laboratory of Green Intelligent Computing Power Network (湖北省绿色智能计算与网络重点实验室), Wuhan, China; School of Surveying and Mapping, Wuhan University (武汉大学测绘学院), Wuhan, China; Wuhan Tianjihang Information Technology Co., Ltd. (武汉天际航信息技术有限公司), Wuhan, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2025

点击查看摘要

Abstract:Pan-Tilt-Zoom (PTZ) cameras with wide-angle lenses are widely used in surveillance but often require image rectification due to their inherent nonlinear distortions. Current deep learning approaches typically struggle to maintain fine-grained geometric details, resulting in inaccurate rectification. This paper presents a Forward Distortion and Backward Warping Network (FDBW-Net), a novel framework for wide-angle image rectification. It begins by using a forward distortion model to synthesize barrel-distorted images, reducing pixel redundancy and preventing blur. The network employs a pyramid context encoder with attention mechanisms to generate backward warping flows containing geometric details. Then, a multi-scale decoder is used to restore distorted features and output rectified images. FDBW-Net’s performance is validated on diverse datasets: public benchmarks, AirSim-rendered PTZ camera imagery, and real-scene PTZ camera datasets. It demonstrates that FDBW-Net achieves SOTA performance in distortion rectification, boosting the adaptability of PTZ cameras for practical visual applications.
zh

[CV-13] Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation CVPR

【速读】：该论文旨在解决自监督学习（Self-supervised Learning, SSL）在地球观测（Earth Observation, EO）领域预训练数据集平衡与多样化方面的不足。论文指出，卫星影像中存在的冗余性和长尾分布可能导致特征表示偏差及训练效率低下。为应对这一挑战，作者提出了一种动态数据集剪枝策略（dynamic dataset pruning strategy），通过最大化数据集的多样性和平衡性来提升SSL预训练的效果。该方法的关键在于无需预先训练的特征提取器即可迭代优化训练集，使其特别适用于数据资源有限或难以获取的领域。实验结果显示，此方法不仅提高了计算效率，还增强了特征表示质量，从而提升了跨任务的迁移能力。

链接: https://arxiv.org/abs/2504.06962
作者: Thomas Kerdreux,Alexandre Tuel,Quentin Febvre,Alexis Mouche,Bertrand Chapron
机构: Galeio (加莱奥); Ifremer, UMR CNRS LOPS (法国国家科学研究中心海洋与气候实验室联合研究单位)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR Workshop : The First Workshop on Foundation and Large Vision Models in Remote Sensing

点击查看摘要

Abstract:Self-supervised learning (SSL) has enabled the development of vision foundation models for Earth Observation (EO), demonstrating strong transferability across diverse remote sensing tasks. While prior work has focused on network architectures and training strategies, the role of dataset curation, especially in balancing and diversifying pre-training datasets, remains underexplored. In EO, this challenge is amplified by the redundancy and heavy-tailed distributions common in satellite imagery, which can lead to biased representations and inefficient training. In this work, we propose a dynamic dataset pruning strategy designed to improve SSL pre-training by maximizing dataset diversity and balance. Our method iteratively refines the training set without requiring a pre-existing feature extractor, making it well-suited for domains where curated datasets are limited or unavailable. We demonstrate our approach on the Sentinel-1 Wave Mode (WV) Synthetic Aperture Radar (SAR) archive, a challenging dataset dominated by ocean observations. We train models from scratch on the entire Sentinel-1 WV archive spanning 10 years. Across three downstream tasks, our results show that dynamic pruning improves both computational efficiency and representation quality, leading to stronger transferability. We also release the weights of Nereus-SAR-1, the first model in the Nereus family, a series of foundation models for ocean observation and analysis using SAR imagery, at this http URL. Comments: Accepted at CVPR Workshop : The First Workshop on Foundation and Large Vision Models in Remote Sensing Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.06962 [cs.CV] (or arXiv:2504.06962v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.06962 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-14] wo by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation CVPR2025

【速读】：本文针对现有基准数据集主要关注几何碎片或工厂部件组装的问题，未能充分反映日常物体交互与组装复杂性的现状，提出了解决方案。论文引入了一个名为2BY2的大规模标注数据集，涵盖18种细粒度的真实生活场景任务（如插座插拔、花瓶插花和吐司机放面包等），包含1,034个实例和517对物体及其姿态与对称性标注。为处理这些任务，论文提出了一个基于等变特征的两步SE(3)位姿估计方法，以满足组装约束条件。该方法的关键在于结合了几何形状对齐以及物体间功能性和空间关系的考虑，从而在2BY2数据集的所有18项任务中实现了最先进的性能表现，并通过机器人实验验证了其可靠性和对复杂三维组装任务的泛化能力。

链接: https://arxiv.org/abs/2504.06961
作者: Yu Qi,Yuanchen Ju,Tianming Wei,Chi Chu,Lawson L.S. Wong,Huazhe Xu
机构: Shanghai Qi Zhi Institute (上海智研究院); Northeastern University (东北大学); IIIS, Tsinghua University (清华大学交叉信息研究院); Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 (Conference on Computer Vision and Pattern Recognition)

点击查看摘要

Abstract:3D assembly tasks, such as furniture assembly and component fitting, play a crucial role in daily life and represent essential capabilities for future home robots. Existing benchmarks and datasets predominantly focus on assembling geometric fragments or factory parts, which fall short in addressing the complexities of everyday object interactions and assemblies. To bridge this gap, we present 2BY2, a large-scale annotated dataset for daily pairwise objects assembly, covering 18 fine-grained tasks that reflect real-life scenarios, such as plugging into sockets, arranging flowers in vases, and inserting bread into toasters. 2BY2 dataset includes 1,034 instances and 517 pairwise objects with pose and symmetry annotations, requiring approaches that align geometric shapes while accounting for functional and spatial relationships between objects. Leveraging the 2BY2 dataset, we propose a two-step SE(3) pose estimation method with equivariant features for assembly constraints. Compared to previous shape assembly methods, our approach achieves state-of-the-art performance across all 18 tasks in the 2BY2 dataset. Additionally, robot experiments further validate the reliability and generalization ability of our method for complex 3D assembly tasks.
zh

[CV-15] VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

【速读】：该论文旨在解决视频多模态大型语言模型（Video MLLMs）在时空感知任务上的性能提升问题，同时保持其通用对话能力。传统方法如基于规则的奖励机制在文本和图像领域表现良好，但在视频理解中的应用受限。论文的关键在于提出了一种结合Group Relative Policy Optimization (GRPO) 的强化微调（Reinforcement Fine-Tuning, RFT）方法，通过高效的特定任务数据利用，实现对视频 MLLMs 的时空感知能力增强。实验表明，RFT 方法在少量样本条件下显著提升了时空感知任务的表现，开发出的 VideoChat-R1 模型在时空推理任务中达到当前最优性能，同时在问答基准测试中也表现出色，证明了该方法在专门任务优化方面的潜力。

链接: https://arxiv.org/abs/2504.06958
作者: Xinhao Li,Ziang Yan,Desen Meng,Lu Dong,Xiangyu Zeng,Yinan He,Yali Wang,Yu Qiao,Yi Wang,Limin Wang
机构: Shanghai AI Laboratory (上海人工智能实验室); Nanjing University (南京大学); Zhejiang University (浙江大学); University of Science and Technology of China (中国科学技术大学); Shanghai Innovation Institute (上海创新研究院); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); OpenGVLab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.
zh

[CV-16] A Comparison of Deep Learning Methods for Cell Detection in Digital Cytology

【速读】：该论文旨在解决在巴氏染色细胞学全片图像（Papanicolaou-stained cytological Whole Slide Images, WSIs）中实现准确且高效的细胞检测问题。论文评估了几种深度学习方法在检测精度和计算效率方面的性能，重点关注了基于质心的方法与基于分割的方法之间的对比。论文的关键解决方案在于引入了一种改进的基于质心的全卷积回归网络（Improved Fully Convolutional Regression Network, IFCRN），该方法通过优化检测精度和降低计算资源需求，在资源受限环境中表现出显著优势。此外，研究还探讨了数据集规模和数据增强技术对模型性能的影响，并提出了一种基于预测位置与真实值距离的评价指标，以更精确地衡量检测准确性。

链接: https://arxiv.org/abs/2504.06957
作者: Marco Acerbis,Nataša Sladoje,Joakim Lindblad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures, SCIA2025

点击查看摘要

Abstract:Accurate and efficient cell detection is crucial in many biomedical image analysis tasks. We evaluate the performance of several Deep Learning (DL) methods for cell detection in Papanicolaou-stained cytological Whole Slide Images (WSIs), focusing on accuracy of predictions and computational efficiency. We examine recentoff-the-shelf algorithms as well as custom-designed detectors, applying them to two datasets: the CNSeg Dataset and the Oral Cancer (OC) Dataset. Our comparison includes well-established segmentation methods such as StarDist, Cellpose, and the Segment Anything Model 2 (SAM2), alongside centroid-based Fully Convolutional Regression Network (FCRN) approaches. We introduce a suitable evaluation metric to assess the accuracy of predictions based on the distance from ground truth positions. We also explore the impact of dataset size and data augmentation techniques on model performance. Results show that centroid-based methods, particularly the Improved Fully Convolutional Regression Network (IFCRN) method, outperform segmentation-based methods in terms of both detection accuracy and computational efficiency. This study highlights the potential of centroid-based detectors as a preferred option for cell detection in resource-limited environments, offering faster processing times and lower GPU memory usage without compromising accuracy.
zh

[CV-17] PathSegDiff: Pathology Segmentation using Diffusion model representations

【速读】：该论文旨在解决组织病理学图像分割任务中传统方法性能受限的问题。传统方法依赖于预训练特征提取器和配对标记数据（图像与掩膜）来训练轻量级预测模型，但其性能高度依赖特征提取器的选择。为提升分割精度，现有研究主要集中在设计更适合预训练任务的特征提取器。本文提出了一种名为PathSegDiff的新方法，其关键是利用领域特定的潜扩散模型（Latent Diffusion Model, LDM）作为预训练特征提取器。通过结合自监督编码器引导的病理学专用LDM，从H&E染色的组织病理学图像中提取丰富的语义信息，并采用简单的全卷积网络处理这些特征以生成分割掩膜。实验结果表明，该方法在BCSS和GlaS数据集上的表现显著优于传统方法，证明了领域特定扩散预训练在捕捉复杂组织结构和提高分割准确性方面的有效性。

链接: https://arxiv.org/abs/2504.06950
作者: Sachin Kumar Danisetty,Alexandros Graikos,Srikar Yellapragada,Dimitris Samaras
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image segmentation is crucial in many computational pathology pipelines, including accurate disease diagnosis, subtyping, outcome, and survivability prediction. The common approach for training a segmentation model relies on a pre-trained feature extractor and a dataset of paired image and mask annotations. These are used to train a lightweight prediction model that translates features into per-pixel classes. The choice of the feature extractor is central to the performance of the final segmentation model, and recent literature has focused on finding tasks to pre-train the feature extractor. In this paper, we propose PathSegDiff, a novel approach for histopathology image segmentation that leverages Latent Diffusion Models (LDMs) as pre-trained featured extractors. Our method utilizes a pathology-specific LDM, guided by a self-supervised encoder, to extract rich semantic information from H\E stained histopathology images. We employ a simple, fully convolutional network to process the features extracted from the LDM and generate segmentation masks. Our experiments demonstrate significant improvements over traditional methods on the BCSS and GlaS datasets, highlighting the effectiveness of domain-specific diffusion pre-training in capturing intricate tissue structures and enhancing segmentation accuracy in histopathology images.
zh

[CV-18] Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition CVPR

【速读】：该论文旨在解决基于食物图像的自动膳食评估难题，这一挑战主要体现在精确的食物检测、分割和分类上。论文的关键解决方案是利用视觉-语言模型（Vision-Language Models, VLMs）集成视觉与文本推理的能力，评估了六种最先进的闭源和开源VLMs（包括ChatGPT、Gemini、Claude、Moondream、DeepSeek和LLaVA）在不同层次食物识别中的性能。为实现这一目标，研究引入了一个独特的食物图像数据库FoodNExTDB，包含9,263张专家标注的图像，覆盖10个类别（如“蛋白质来源”）、62种子类别（如“家禽”）和9种烹饪方式（如“烤”），并由七位专家手动生成了50,000条营养标签。此外，研究提出了一个新颖的评价指标Expert-Weighted Recall (EWR)，以考虑标注者之间的差异性。尽管当前VLMs在单一产品图像识别中能达到超过90%的EWR，但它们在细粒度食物识别方面仍面临挑战，特别是在区分细微烹饪风格差异和外观相似食物时表现有限，这限制了其在自动膳食评估中的可靠性。

链接: https://arxiv.org/abs/2504.06925
作者: Sergio Romero-Tapiador,Ruben Tolosana,Blanca Lacruz-Pleguezuelos,Laura Judith Marcos Zambrano,Guadalupe X.Bazán,Isabel Espinosa-Salinas,Julian Fierrez,Javier Ortega-Garcia,Enrique Carrillo de Santa Pau,Aythami Morales
机构: Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid (马德里自治大学生物识别与数据模式分析实验室), Madrid, Spain; IMDEA Food, CEI UAM+CSIC (IMDEA食品研究所，UAM+CSIC联合研究中心), Madrid, Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE/CVF Computer Vision and Pattern Recognition Conference workshops 2025 (CVPRw) 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., “protein source”), 62 subcategories (e.g., “poultry”), and 9 cooking styles (e.g., “grilled”). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at this https URL.
zh

[CV-19] S-EO: A Large-Scale Dataset for Geometry-Aware Shadow Detection in Remote Sensing Applications CVPR

【速读】：该论文试图解决几何感知阴影检测在遥感影像中的挑战，并构建一个大规模、高分辨率的数据集（S-EO）作为解决方案的基础。解决方案的关键在于设计了一个包含多时相、多视角WorldView-3融合RGB图像、全色图像以及基于LiDAR扫描获取的真实DSM数据的数据集，同时提供了由几何和太阳位置推导的阴影掩膜、基于NDVI指数的植被掩膜以及经过束调整的RPC模型。通过这一数据集，论文不仅展示了阴影检测器的泛化能力，还将其应用于提升基于NeRF的三维重建效果，从而推动了几何感知阴影检测及其在遥感领域应用的发展。

链接: https://arxiv.org/abs/2504.06920
作者: Masquil Elías,Marí Roger,Ehret Thibaud,Meinhardt-Llopis Enric,Musé Pablo,Facciolo Gabriele
机构: IIE, Facultad de Ingeniería, Universidad de la República (乌拉圭工程学院，乌拉圭共和国大学); Digital Sense (乌拉圭); Eurecat, Centre Tecnològic de Catalunya, Multimedia Technologies (Eurecat, 加泰罗尼亚技术中心，多媒体技术，西班牙巴塞罗那); AMIAD, Pôle Recherche (AMIAD, 研究中心，法国); Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, 91190, Gif-sur-Yvette, France (巴黎萨克雷大学, 巴黎高等师范学校, 法国国家科学研究中心, Borelli中心, 法国伊夫里)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Earthvision 2025 (CVPR Workshop)

点击查看摘要

Abstract:We introduce the S-EO dataset: a large-scale, high-resolution dataset, designed to advance geometry-aware shadow detection. Collected from diverse public-domain sources, including challenge datasets and government providers such as USGS, our dataset comprises 702 georeferenced tiles across the USA, each covering 500x500 m. Each tile includes multi-date, multi-angle WorldView-3 pansharpened RGB images, panchromatic images, and a ground-truth DSM of the area obtained from LiDAR scans. For each image, we provide a shadow mask derived from geometry and sun position, a vegetation mask based on the NDVI index, and a bundle-adjusted RPC model. With approximately 20,000 images, the S-EO dataset establishes a new public resource for shadow detection in remote sensing imagery and its applications to 3D reconstruction. To demonstrate the dataset’s impact, we train and evaluate a shadow detector, showcasing its ability to generalize, even to aerial images. Finally, we extend EO-NeRF - a state-of-the-art NeRF approach for satellite imagery - to leverage our shadow predictions for improved 3D reconstructions.
zh

[CV-20] An Analysis of Temporal Dropout in Earth Observation Time Series for Regression Tasks

【速读】：该论文旨在解决时间序列数据中缺失实例对深度学习模型，尤其是在回归任务中的显著挑战。在地球观测领域，卫星故障或云层遮挡常导致时间步的缺失，从而引入输入层面的不确定性，降低预测性能。尽管已有研究通过数据增强来提升模型鲁棒性，但输入层面的不确定性往往被忽视。为填补这一空白，论文提出了Monte Carlo Temporal Dropout (MC-TD)，通过在推理过程中以预定义的丢弃比例随机丢弃时间步，显式考虑输入层面的不确定性，模拟数据缺失的影响。为避免昂贵的最优丢弃比例搜索，进一步引入了Monte Carlo Concrete Temporal Dropout (MC-ConcTD)，直接学习最优丢弃分布。两种方法均在推理阶段利用蒙特卡洛采样进行不确定性量化。实验表明，与现有方法相比，MC-ConcTD在预测性能和不确定性校准方面表现更优，并强调了自适应丢弃调整相对于手动选择的优势，使不确定性量化在地球观测应用中更加稳健和易用。

链接: https://arxiv.org/abs/2504.06915
作者: Miro Miranda,Francisco Mena,Andreas Dengel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Symposium on Intelligent Data Analysis (IDA 2025)

点击查看摘要

Abstract:Missing instances in time series data impose a significant challenge to deep learning models, particularly in regression tasks. In the Earth Observation field, satellite failure or cloud occlusion frequently results in missing time-steps, introducing uncertainties in the predicted output and causing a decline in predictive performance. While many studies address missing time-steps through data augmentation to improve model robustness, the uncertainty arising at the input level is commonly overlooked. To address this gap, we introduce Monte Carlo Temporal Dropout (MC-TD), a method that explicitly accounts for input-level uncertainty by randomly dropping time-steps during inference using a predefined dropout ratio, thereby simulating the effect of missing data. To bypass the need for costly searches for the optimal dropout ratio, we extend this approach with Monte Carlo Concrete Temporal Dropout (MC-ConcTD), a method that learns the optimal dropout distribution directly. Both MC-TD and MC-ConcTD are applied during inference, leveraging Monte Carlo sampling for uncertainty quantification. Experiments on three EO time-series datasets demonstrate that MC-ConcTD improves predictive performance and uncertainty calibration compared to existing approaches. Additionally, we highlight the advantages of adaptive dropout tuning over manual selection, making uncertainty quantification more robust and accessible for EO applications.
zh

[CV-21] UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation

【速读】：该论文旨在解决医学影像领域因隐私、物流及高标注成本等因素导致的大规模标注数据收集困难的问题。为应对这一挑战，论文的关键解决方案是构建了一个名为UK Biobank Organs and Bones (UKBOB) 的大规模标注数据集，包含51,761个基于MRI的3D样本（相当于1790万张2D图像）以及超过13.7亿个来自72个器官的2D分割掩膜。为了确保标注质量，论文采用了自动标注方法，并设计了一种结合器官特异性过滤器的自动化标签清洗流程，同时对300个MRI样本进行人工标注以验证标注质量（称为UKBOB-manual）。此外，为减轻噪声标签的影响，论文提出了Entropy Test-time Adaptation (ETTA) 方法来优化分割输出。最终，利用UKBOB训练的Swin-BOB模型在多个3D医学影像基准测试中取得了最先进的性能，包括BRATS脑部MRI肿瘤挑战和BTCV腹部CT扫描基准。因此，该研究的核心创新在于通过自动化与半监督技术实现了大规模高质量数据集的构建及其在医学影像分割任务中的有效应用。

链接: https://arxiv.org/abs/2504.06908
作者: Emmanuelle Bourigault,Amir Jamaludin,Abdullah Hamdi
机构: Visual Geometry Group, University of Oxford (牛津大学视觉几何组)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:In medical imaging, the primary challenge is collecting large-scale labeled data due to privacy concerns, logistics, and high labeling costs. In this work, we present the UK Biobank Organs and Bones (UKBOB), the largest labeled dataset of body organs, comprising 51,761 MRI 3D samples (equivalent to 17.9 million 2D images) and more than 1.37 billion 2D segmentation masks of 72 organs, all based on the UK Biobank MRI dataset. We utilize automatic labeling, introduce an automated label cleaning pipeline with organ-specific filters, and manually annotate a subset of 300 MRIs with 11 abdominal classes to validate the quality (referred to as UKBOB-manual). This approach allows for scaling up the dataset collection while maintaining confidence in the labels. We further confirm the validity of the labels by demonstrating zero-shot generalization of trained models on the filtered UKBOB to other small labeled datasets from similar domains (e.g., abdominal MRI). To further mitigate the effect of noisy labels, we propose a novel method called Entropy Test-time Adaptation (ETTA) to refine the segmentation output. We use UKBOB to train a foundation model, Swin-BOB, for 3D medical image segmentation based on the Swin-UNetr architecture, achieving state-of-the-art results in several benchmarks in 3D medical imaging, including the BRATS brain MRI tumor challenge (with a 0.4% improvement) and the BTCV abdominal CT scan benchmark (with a 1.3% improvement). The pre-trained models and the code are available at this https URL , and the filtered labels will be made available with the UK Biobank.
zh

[CV-22] MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs

【速读】：该论文旨在解决医疗图像领域数据稀缺及配对标记（image-mask pairs）难以获取的问题。为应对这一挑战，论文提出MedSegFactory，一种多功能的医学合成框架，通过双流扩散模型生成高质量的跨模态与任务的医学图像及其对应的分割掩膜。解决方案的关键在于其核心架构——引入联合交叉注意力机制（Joint Cross-Attention, JCA），通过流之间的动态跨条件协作实现精确的图像-掩膜配对对齐，并采用双向交互增强生成一致性。此外，MedSegFactory支持基于用户定义提示的按需生成，包括目标标签、成像模态、解剖区域和病理条件等，从而在保证数据质量和可用性的前提下实现高效扩展。实验结果表明，该方法在2D和3D分割任务中达到竞争性或最先进的性能，同时有效缓解了数据稀缺和技术合规性限制。

链接: https://arxiv.org/abs/2504.06897
作者: Jiawei Mao,Yuhan Wang,Yucheng Tang,Daguang Xu,Kang Wang,Yang Yang,Zongwei Zhou,Yuyin Zhou
机构: UC Santa Cruz; NVIDIA; UC San Francisco; Johns Hopkins University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 8 figures, The project page can be accessed via this https URL

点击查看摘要

Abstract:This paper presents MedSegFactory, a versatile medical synthesis framework that generates high-quality paired medical images and segmentation masks across modalities and tasks. It aims to serve as an unlimited data repository, supplying image-mask pairs to enhance existing segmentation tools. The core of MedSegFactory is a dual-stream diffusion model, where one stream synthesizes medical images and the other generates corresponding segmentation masks. To ensure precise alignment between image-mask pairs, we introduce Joint Cross-Attention (JCA), enabling a collaborative denoising paradigm by dynamic cross-conditioning between streams. This bidirectional interaction allows both representations to guide each other’s generation, enhancing consistency between generated pairs. MedSegFactory unlocks on-demand generation of paired medical images and segmentation masks through user-defined prompts that specify the target labels, imaging modalities, anatomical regions, and pathological conditions, facilitating scalable and high-quality data generation. This new paradigm of medical image synthesis enables seamless integration into diverse medical imaging workflows, enhancing both efficiency and accuracy. Extensive experiments show that MedSegFactory generates data of superior quality and usability, achieving competitive or state-of-the-art performance in 2D and 3D segmentation tasks while addressing data scarcity and regulatory constraints.
zh

[CV-23] ColorizeDiffusion v2: Enhancing Reference-based Sketch Colorization Through Separating Utilities

【速读】：该论文致力于解决基于参考的草图着色方法在实际应用中的局限性，即训练数据与推理阶段之间存在的分布不匹配问题。现有方法通常依赖于语义和空间对齐良好的草图、参考图像及真实标签三元组进行训练，然而真实场景中的草图与参考图像往往存在显著的空间和语义错位。这种数据分布差异导致模型过拟合，产生空间伪影，并严重影响整体着色质量，限制了当前方法在通用场景下的应用潜力。

论文的关键解决方案在于提出了一种新颖的工作流程，通过动态调整“载体”（carrier）来优化着色的不同方面。“载体”被定义为促进参考信息向草图传递的潜在表示。具体而言，针对空间错位问题，引入了带有空间掩码的分割交叉注意力机制，使得扩散过程中能够实现区域特定的参考注入；为了缓解草图语义忽略的问题，设计了专门的背景编码器和风格编码器，在潜在特征空间中传输详细的参考信息，从而提升空间控制能力和细节合成丰富度。此外，还提出了角色掩码融合和背景漂白作为预处理步骤，以改善前景与背景的整合效果并增强背景生成能力。实验结果表明，所提出的方案在多个评估指标上均优于现有方法，并且消融研究进一步验证了各组件的有效性。

链接: https://arxiv.org/abs/2504.06895
作者: Dingkun Yan,Xinrui Wang,Yusuke Iwasawa,Yutaka Matsuo,Suguru Saito,Jiaxian Guo
机构: Institute of Science Tokyo School of Computing (东京工业大学图像科学生命科学学院); University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reference-based sketch colorization methods have garnered significant attention due to their potential applications in the animation production industry. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially well-aligned, while real-world references and sketches often exhibit substantial misalignment. This mismatch in data distribution between training and inference leads to overfitting, consequently resulting in spatial artifacts and significant degradation in overall colorization quality, limiting potential applications of current methods for general purposes. To address this limitation, we conduct an in-depth analysis of the \textbfcarrier, defined as the latent representation facilitating information transfer from reference to sketch. Based on this analysis, we propose a novel workflow that dynamically adapts the carrier to optimize distinct aspects of colorization. Specifically, for spatially misaligned artifacts, we introduce a split cross-attention mechanism with spatial masks, enabling region-specific reference injection within the diffusion process. To mitigate semantic neglect of sketches, we employ dedicated background and style encoders to transfer detailed reference information in the latent feature space, achieving enhanced spatial control and richer detail synthesis. Furthermore, we propose character-mask merging and background bleaching as preprocessing steps to improve foreground-background integration and background generation. Extensive qualitative and quantitative evaluations, including a user study, demonstrate the superior performance of our proposed method compared to existing approaches. An ablation study further validates the efficacy of each proposed component.
zh

[CV-24] Audio-visual Event Localization on Portrait Mode Short Videos

【速读】：该论文旨在解决音频-视觉事件定位（Audio-Visual Event Localization, AVEL）在竖屏短视频格式下的特定挑战。现有AVEL数据集主要针对横屏长视频设计，而竖屏短视频因其独特的帧布局（portrait-oriented framing）和复杂的多层音频组成（如叠加音效、配音和背景音乐），带来了未被传统方法充分解决的问题。论文的关键在于通过创建首个专门面向竖屏短视频的AVEL数据集AVE-PM（包含25,335个片段，覆盖86个细粒度类别）以及深入分析，揭示了跨模态评估下主流AVEL方法性能平均下降18.66%的原因。研究进一步发现两大核心挑战：1）竖屏帧布局引入的空间偏置导致不同的领域先验；2）嘈杂的音频构成削弱了音频模态的可靠性。为应对这些问题，论文探索了针对竖屏视频的最优预处理方法和背景音乐的影响，实验表明，定制化的预处理和模型设计能够显著提升性能。因此，该工作不仅提供了基准数据集，还为移动优先视频内容时代的AVEL研究提供了实用的洞见。

链接: https://arxiv.org/abs/2504.06884
作者: Wuyang Liu,Yi Chai,Yongpeng Yan,Yanzhen Ren
机构: Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education (国家重点实验室); School of Cyber Science and Engineering, Wuhan University (网络安全与信息技术学院)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-visual event localization (AVEL) plays a critical role in multimodal scene understanding. While existing datasets for AVEL predominantly comprise landscape-oriented long videos with clean and simple audio context, short videos have become the primary format of online video content due to the the proliferation of smartphones. Short videos are characterized by portrait-oriented framing and layered audio compositions (e.g., overlapping sound effects, voiceovers, and music), which brings unique challenges unaddressed by conventional methods. To this end, we introduce AVE-PM, the first AVEL dataset specifically designed for portrait mode short videos, comprising 25,335 clips that span 86 fine-grained categories with frame-level annotations. Beyond dataset creation, our empirical analysis shows that state-of-the-art AVEL methods suffer an average 18.66% performance drop during cross-mode evaluation. Further analysis reveals two key challenges of different video formats: 1) spatial bias from portrait-oriented framing introduces distinct domain priors, and 2) noisy audio composition compromise the reliability of audio modality. To address these issues, we investigate optimal preprocessing recipes and the impact of background music for AVEL on portrait mode videos. Experiments show that these methods can still benefit from tailored preprocessing and specialized model design, thus achieving improved performance. This work provides both a foundational benchmark and actionable insights for advancing AVEL research in the era of mobile-centric video content. Dataset and code will be released.
zh

[CV-25] Compound and Parallel Modes of Tropical Convolutional Neural Networks

【速读】：该论文旨在解决深度卷积神经网络（Convolutional Neural Networks, CNNs）计算成本高以及热带卷积神经网络（Tropical Convolutional Neural Networks, TCNNs）性能不足的问题。为了解决这些问题，论文提出了两种新的TCNN变体——复合TCNN（compound TCNN, cTCNN）和并行TCNN（parallel TCNN, pTCNN），它们通过结合热带最小-加法（min-plus）和最大-加法（max-plus）核来替代传统的卷积核，从而在减少乘法运算的同时平衡效率与性能。关键在于这种核组合策略，它不仅降低了计算复杂度，还实现了与其他CNN方法相当或更优的性能表现。

链接: https://arxiv.org/abs/2504.06881
作者: Mingbo Li,Liying Liu,Ye Luo
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 5 figures

点击查看摘要

Abstract:Convolutional neural networks have become increasingly deep and complex, leading to higher computational costs. While tropical convolutional neural networks (TCNNs) reduce multiplications, they underperform compared to standard CNNs. To address this, we propose two new variants - compound TCNN (cTCNN) and parallel TCNN (pTCNN)-that use combinations of tropical min-plus and max-plus kernels to replace traditional convolution kernels. This reduces multiplications and balances efficiency with performance. Experiments on various datasets show that cTCNN and pTCNN match or exceed the performance of other CNN methods. Combining these with conventional CNNs in deeper architectures also improves performance. We are further exploring simplified TCNN architectures that reduce parameters and multiplications with minimal accuracy loss, aiming for efficient and effective models.
zh

[CV-26] GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes

【速读】：该论文试图解决机器人在杂乱环境中的鲁棒抓取问题，这是机器人领域尚未完全解决的挑战。现有基准数据集虽推动了深度学习方法的发展，但主要关注于简单场景且缺乏多样性和遮挡情况，限制了其在实际应用中的适用性。为解决此问题，论文提出了GraspClutter6D，这是一个大规模真实世界抓取数据集，其关键在于通过包含1,000个高度杂乱场景（平均14.1个物体/场景，62.6%遮挡）、覆盖200个物体在75种环境配置下的丰富标注（如736K个6D物体位姿和9.3B个可行机器人抓取点），以及从多视角捕获的RGB-D图像，全面提升了数据集的复杂度与多样性，从而为处理杂乱环境中的挑战提供了有效资源。实验表明，基于该数据集训练的抓取网络在模拟和真实环境中均显著优于现有数据集训练的模型。

链接: https://arxiv.org/abs/2504.06866
作者: Seunghyeok Back,Joosoon Lee,Kangmin Kim,Heeseon Rho,Geonhyup Lee,Raeyoung Kang,Sangbeom Lee,Sangjun Noh,Youngjin Lee,Taeyeop Lee,Kyoobin Lee
机构: Korea Institute of Machinery & Materials (KIMM); Gwangju Institute of Science and Technology (GIST); Korea Advanced Institute of Science and Technology (KAIST)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust grasping in cluttered environments remains an open challenge in robotics. While benchmark datasets have significantly advanced deep learning methods, they mainly focus on simplistic scenes with light occlusion and insufficient diversity, limiting their applicability to practical scenarios. We present GraspClutter6D, a large-scale real-world grasping dataset featuring: (1) 1,000 highly cluttered scenes with dense arrangements (14.1 objects/scene, 62.6% occlusion), (2) comprehensive coverage across 200 objects in 75 environment configurations (bins, shelves, and tables) captured using four RGB-D cameras from multiple viewpoints, and (3) rich annotations including 736K 6D object poses and 9.3B feasible robotic grasps for 52K RGB-D images. We benchmark state-of-the-art segmentation, object pose estimation, and grasping detection methods to provide key insights into challenges in cluttered environments. Additionally, we validate the dataset’s effectiveness as a training resource, demonstrating that grasping networks trained on GraspClutter6D significantly outperform those trained on existing datasets in both simulation and real-world experiments. The dataset, toolkit, and annotation tools are publicly available on our project website: this https URL.
zh

[CV-27] MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking

【速读】：该论文旨在解决单图像移动物体分割（Single-Image Moving Object Segmentation, MOS）的问题，这是现有方法中的一个挑战性任务，尤其是在缺乏时间线索的情况下。传统方法依赖多帧图像序列来识别移动物体，而单图像MOS对于运动意图预测和处理相机帧丢失等应用场景至关重要。然而，由于缺乏时间线索，现有方法在单图像上进行移动物体分割仍然困难重重。

为了解决这一问题，论文提出了一种名为MovSAM的新框架。MovSAM的关键在于利用增强的多模态大型语言模型（Multimodal Large Language Model, MLLM），并通过链式思维（Chain-of-Thought, CoT）提示机制搜索移动物体并生成基于深度推理的文字提示，用于分割任务。这些提示与Segment Anything Model (SAM) 和视觉-语言模型（Vision-Language Model, VLM）提取的视觉特征相结合，实现了基于逻辑驱动的移动物体分割。此外，分割结果通过深度推理优化循环进一步精炼，使MovSAM能够逐步提升其对场景上下文理解和对象间关系的认知能力。这种创新方法使得MovSAM能够在考虑场景理解的基础上实现单图像中的移动物体分割。实验表明，尽管多帧方法在利用时间信息方面具有天然优势，但MovSAM在公开的MOS基准测试中达到了最先进的性能，特别是在J\F数据集上达到了92.5%的准确率。

链接: https://arxiv.org/abs/2504.06863
作者: Chang Nie,Yiqing Xu,Guangming Wang,Zhe Liu,Yanzi Miao,Hesheng Wang
机构: Department of Automation, Shanghai Jiao Tong University (上海交通大学); The Advanced Robotics Research Center, Artificial Intelligence Research Institute and School of Information and Control Engineering, China University of Mining and Technology (中国矿业大学); Department of Engineering, University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5% on J\F. Our implementation will be available at this https URL.
zh

[CV-28] EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic Zero-Shot Training-Free Text-to-Video Generation CVPR

【速读】：该论文致力于解决零样本、无需训练的基于图像的文本转视频生成问题，当前方法需要对图像生成模型进行特定架构修改，这限制了其适应性和可扩展性。论文提出了一种与模型无关的方法，通过扩散轨迹的交集仅利用潜在值来实现这一目标。然而，仅使用轨迹交集难以获得局部帧级的一致性和多样性，因此采用了基于网格的方法。使用上下文训练的语言模型生成一致的帧级提示，并识别帧间差异，进而得到基于CLIP的注意力掩码，以控制每个网格单元提示切换的时间。早期切换会导致更高的方差，而晚期切换则提高一致性。由此，该方法能够在一致性与方差之间提供适当的平衡。关键在于结合网格化策略、语言模型生成的帧级提示及CLIP注意力机制，实现了在无需训练的情况下，基于图像的文本转视频生成的最新性能，同时提升了与多样化图像生成模型的兼容性。实证分析验证了该模型在时间一致性、视觉保真度和用户满意度方面的优越性。

链接: https://arxiv.org/abs/2504.06861
作者: Diljeet Jagpal,Xi Chen,Vinay P. Namboodiri
机构: University of Bath (巴斯大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image generation models, which limit their adaptability and scalability. In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, our approach can ensure appropriate control between coherence and variance for the frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model’s superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.
zh

[CV-29] CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading

【速读】：该论文致力于使用扩散模型实现从文本到纹理合成的研究，旨在生成基于物理的纹理贴图以实现模型在不同光照条件下的真实外观。论文指出，当前针对此任务的主要方法是评分蒸馏采样（Score Distillation Sampling, SDS），它能够在可微光栅化和着色管道的引导下恢复复杂的纹理，但与广泛使用的潜在扩散模型结合时，容易产生严重的视觉伪影，并需要额外的正则化手段如隐式纹理参数化。为了解决这些问题，论文提出了一种基于级联扩散模型的纹理合成方法（CasTex）。其关键在于利用评分蒸馏采样直接生成高质量的纹理，同时采用显式参数化取代隐式参数化，从而显著提升了性能。实验结果表明，该方法在公共纹理合成基准测试中大幅超越了现有的基于优化的方法。

链接: https://arxiv.org/abs/2504.06856
作者: Mishan Aliev,Dmitry Baranchuk,Kirill Struminsky
机构: HSE University; Yandex Research (Yandex 研究); Yandex Reseach (Yandex 研究), HSE University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, work in progress

点击查看摘要

Abstract:This work investigates text-to-texture synthesis using diffusion models to generate physically-based texture maps. We aim to achieve realistic model appearances under varying lighting conditions. A prominent solution for the task is score distillation sampling. It allows recovering a complex texture using gradient guidance given a differentiable rasterization and shading pipeline. However, in practice, the aforementioned solution in conjunction with the widespread latent diffusion models produces severe visual artifacts and requires additional regularization such as implicit texture parameterization. As a more direct alternative, we propose an approach using cascaded diffusion models for texture synthesis (CasTex). In our setup, score distillation sampling yields high-quality textures out-of-the box. In particular, we were able to omit implicit texture parameterization in favor of an explicit parameterization to improve the procedure. In the experiments, we show that our approach significantly outperforms state-of-the-art optimization-based solutions on public texture synthesis benchmarks.
zh

[CV-30] Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition ICDAR2025

【速读】：该论文试图解决的问题是如何通过极少量示例对文档中新型字符模式序列进行分类，同时避免显式的模型再训练。解决方案的关键在于引入了一种名为Rosetta的多模态模型，它利用多模态上下文学习（Multimodal In-Context Learning, MICL）实现这一目标。此外，通过设计一种确保不同上下文信息丰富程度的数据集生成过程以及采用上下文感知分词器（Context-Aware Tokenizer, CAT），实现了开放词汇分类能力，使模型能够超越其训练时所使用的字符模式字母表范围，从而支持新字母表和语言的识别等应用。实验结果表明，Rosetta能够在包含中文、希腊语、俄语、法语、西班牙语和日语等多种分布外视觉模式和字符集上成功分类。

链接: https://arxiv.org/abs/2504.06841
作者: Tom Simon,William Mocaer,Pierrick Tranouez,Clement Chatelain,Thierry Paquet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICDAR 2025

点击查看摘要

Abstract:We introduce Rosetta, a multimodal model that leverages Multimodal In-Context Learning (MICL) to classify sequences of novel script patterns in documents by leveraging minimal examples, thus eliminating the need for explicit retraining. To enhance contextual learning, we designed a dataset generation process that ensures varying degrees of contextual informativeness, improving the model’s adaptability in leveraging context across different scenarios. A key strength of our method is the use of a Context-Aware Tokenizer (CAT), which enables open-vocabulary classification. This allows the model to classify text and symbol patterns across an unlimited range of classes, extending its classification capabilities beyond the scope of its training alphabet of patterns. As a result, it unlocks applications such as the recognition of new alphabets and languages. Experiments on synthetic datasets demonstrate the potential of Rosetta to successfully classify Out-Of-Distribution visual patterns and diverse sets of alphabets and scripts, including but not limited to Chinese, Greek, Russian, French, Spanish, and Japanese.
zh

[CV-31] ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models ICLR2025

【速读】：该论文旨在解决黑盒提示调优（Black-Box Prompt-Tuning, BBPT）方法在实际应用中因需要过多查询次数而导致效率受限的问题。现有方法在有限查询预算下难以满足需求，尤其是在视觉-语言任务中。为了解决这一挑战，论文提出了一种名为零阶内在维提示调优（Zeroth-order Intrinsic-dimensional Prompt-tuning, ZIP）的新方法。ZIP的关键创新在于通过低秩表示重新参数化提示，并设计内在维度的梯度裁剪机制，从而降低问题的维度和零阶梯度估计的方差，实现更高效的训练过程，大幅减少所需的查询次数。实验结果表明，ZIP在多个视觉-语言任务上实现了平均6%的少样本准确率提升和48%的查询效率改进，同时其裁剪机制表现出良好的鲁棒性和无需手动调节的优势。

链接: https://arxiv.org/abs/2504.06838
作者: Seonghwan Park,Jaehyeon Jeong,Yongjun Kim,Jaeho Lee,Namhoon Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICLR 2025

点击查看摘要

Abstract:Recent studies have introduced various approaches for prompt-tuning black-box vision-language models, referred to as black-box prompt-tuning (BBPT). While BBPT has demonstrated considerable potential, it is often found that many existing methods require an excessive number of queries (i.e., function evaluations), which poses a significant challenge in real-world scenarios where the number of allowed queries is limited. To tackle this issue, we propose Zeroth-order Intrinsic-dimensional Prompt-tuning (ZIP), a novel approach that enables efficient and robust prompt optimization in a purely black-box setting. The key idea of ZIP is to reduce the problem dimensionality and the variance of zeroth-order gradient estimates, such that the training is done fast with far less queries. We achieve this by re-parameterizing prompts in low-rank representations and designing intrinsic-dimensional clipping of estimated gradients. We evaluate ZIP on 13+ vision-language tasks in standard benchmarks and show that it achieves an average improvement of approximately 6% in few-shot accuracy and 48% in query efficiency compared to the best-performing alternative BBPT methods, establishing a new state of the art. Our ablation analysis further shows that the proposed clipping mechanism is robust and nearly optimal, without the need to manually select the clipping threshold, matching the result of expensive hyperparameter search.
zh

[CV-32] Determining Fetal Orientations From Blind Sweep Ultrasound Video

【速读】：该论文旨在解决胎儿超声检查中因认知需求高而给临床医生带来的挑战，通过开发一种自动化流程来预测胎儿在超声视频中的方位（orientation），以提供辅助工具。解决方案的关键在于利用预训练的头部检测与分割模型，首先通过模板匹配方法确定胎儿的胎位（头位或臀位，cephalic or breech），接着通过分析分割后的脑部解剖结构的空间分布来判断胎儿体轴方向（面向左侧或右侧，fetal lie）。论文通过第三孕期超声扫描数据集验证了该流程的高准确性，并强调其创新之处在于引入了自动化的胎儿体轴预测，以及提出了一种增强而非取代超声医师专业知识的辅助范式。未来研究将着重提高数据采集效率，并探索实时临床集成以优化工作流并支持产科医生。

链接: https://arxiv.org/abs/2504.06836
作者: Jakub Maciej Wiśniewski,Anders Nymark Christensen,Mary Le Ngo,Martin Grønnebæk Tolsgaard,Chun Kit Wong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Cognitive demands of fetal ultrasound examinations pose unique challenges among clinicians. With the goal of providing an assistive tool, we developed an automated pipeline for predicting fetal orientation from ultrasound videos acquired following a simple blind sweep protocol. Leveraging on a pre-trained head detection and segmentation model, this is achieved by first determining the fetal presentation (cephalic or breech) with a template matching approach, followed by the fetal lie (facing left or right) by analyzing the spatial distribution of segmented brain anatomies. Evaluation on a dataset of third-trimester ultrasound scans demonstrated the promising accuracy of our pipeline. This work distinguishes itself by introducing automated fetal lie prediction and by proposing an assistive paradigm that augments sonographer expertise rather than replacing it. Future research will focus on enhancing acquisition efficiency, and exploring real-time clinical integration to improve workflow and support for obstetric clinicians.
zh

[CV-33] LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding

【速读】：该论文旨在解决长视频理解任务中视觉语言模型（Vision-Language Models, VLMs）因稀疏采样策略导致的信息丢失问题，同时克服视频大型语言模型（Video Large Language Models, Video-LLMs）在高质量视频-文本数据集稀缺性方面的限制。论文的关键解决方案是提出了一种名为轻量级视频压缩（Lightweight Video Compression, LVC）的新方法，其核心是基于查询-注意力机制的视频压缩（Query-Attention Video Compression）。通过仅使用10,000个短视频-文本对微调对齐层，LVC显著提升了VLMs的时间推理能力，并在多个模型上实现了稳定的性能提升，包括InternVL2系列和Phi-3.5-Vision。

链接: https://arxiv.org/abs/2504.06835
作者: Ziyi Wang,Haoran Wu,Yiming Rong,Deyang Jiang,Yixin Zhang,Yunlong Zhao,Shuang Xu,Bo XU
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院认知与智能决策重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long video understanding is a complex task that requires both spatial detail and temporal awareness. While Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input, they suffer from information loss due to the sparse sampling strategy. In contrast, Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets. To transfer long video understanding capabilities to VLMs with minimal data and computational cost, we propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism, which effectively tackles the sparse sampling problem in VLMs. By training only the alignment layer with 10k short video-text pairs, LVC significantly enhances the temporal reasoning abilities of VLMs. Extensive experiments show that LVC provides consistent performance improvements across various models, including the InternVL2 series and Phi-3.5-Vision. Notably, the InternVL2-40B-LVC achieves scores of 68.2 and 65.9 on the long video understanding benchmarks MLVU and Video-MME, respectively, with relative improvements of 14.6% and 7.7%. The enhanced models and code will be publicly available soon.
zh

[CV-34] IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments

【速读】：该论文旨在解决智能代理在理解其环境中可活动对象（articulated objects）时面临的挑战，具体是通过交互方式建立对这些对象的三维理解。传统方法通常依赖于针对特定任务设计的网络以及关于可移动部件的假设，而本文提出的IAAO框架利用大规模基础模型，在三个阶段估计交互式功能（affordances）和部件的活动性。方案的关键在于首先使用三维高斯点云（3D Gaussian Splatting, 3DGS）从多视角图像中提取物体状态的层次特征和标签场；接着对三维高斯基元进行对象级和部件级查询，以识别静态与活动元素，并同时估算全局变换、局部活动参数及交互功能；最后基于估算的变换合并并优化不同状态下的场景，从而实现基于功能的可靠对象交互与操作。实验结果验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.06827
作者: Can Zhang,Gim Hee Lee
机构: Department of Computer Science, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work presents IAAO, a novel framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. Unlike prior methods that rely on task-specific networks and assumptions about movable parts, our IAAO leverages large foundation models to estimate interactive affordances and part articulations in three stages. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances. Finally, scenes from different states are merged and refined based on the estimated transformations, enabling robust affordance-based interaction and manipulation of objects. Experimental results demonstrate the effectiveness of our method.
zh

[CV-35] SVG-IR: Spatially-Varying Gaussian Splatting for Inverse Rendering

【速读】：本文旨在解决逆渲染（Inverse Rendering, IR）任务中由于其不适定性导致的从图像重建3D资产的挑战。传统方法在处理新型视角合成（Novel View Synthesis, NVS）和基于3D高斯点扩散（3D Gaussian Splatting, 3DGS）的重光照（Relighting）时存在不足，主要表现为生成结果质量较低，包含伪影且间接照明不自然，这是由于每个高斯分布具有固定材质参数和法线，并缺乏物理约束所致。为了解决这些问题，论文提出了一种名为空间可变高斯逆渲染（Spatially-varying Gaussian Inverse Rendering, SVG-IR）的新框架。其关键创新在于引入了一种新的表示方法——空间可变高斯（Spatially-varying Gaussian, SVG），允许每个高斯分布具有空间变化的参数，同时结合类似于传统图形管线中的顶点/片段着色的空间采样方案，以及一个基于物理的间接照明模型，从而显著提升了NVS和重光照的质量，在峰值信噪比（PSNR）上优于最先进的NeRF基方法2.5 dB，并在重光照任务中比现有的基于高斯的方法高出3.5 dB，同时保持实时渲染速度。

链接: https://arxiv.org/abs/2504.06815
作者: Hanxiao Sun,YuPeng Gao,Jin Xie,Jian Yang,Beibei Wang
机构: Nankai University (南开大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing 3D assets from images, known as inverse rendering (IR), remains a challenging task due to its ill-posed nature. 3D Gaussian Splatting (3DGS) has demonstrated impressive capabilities for novel view synthesis (NVS) tasks. Methods apply it to relighting by separating radiance into BRDF parameters and lighting, yet produce inferior relighting quality with artifacts and unnatural indirect illumination due to the limited capability of each Gaussian, which has constant material parameters and normal, alongside the absence of physical constraints for indirect lighting. In this paper, we present a novel framework called Spatially-vayring Gaussian Inverse Rendering (SVG-IR), aimed at enhancing both NVS and relighting quality. To this end, we propose a new representation-Spatially-varying Gaussian (SVG)-that allows per-Gaussian spatially varying parameters. This enhanced representation is complemented by a SVG splatting scheme akin to vertex/fragment shading in traditional graphics pipelines. Furthermore, we integrate a physically-based indirect lighting model, enabling more realistic relighting. The proposed SVG-IR framework significantly improves rendering quality, outperforming state-of-the-art NeRF-based methods by 2.5 dB in peak signal-to-noise ratio (PSNR) and surpassing existing Gaussian-based techniques by 3.5 dB in relighting tasks, all while maintaining a real-time rendering speed.
zh

[CV-36] Hybrid CNN with Chebyshev Polynomial Expansion for Medical Image Analysis

【速读】：该论文旨在解决肺结节自动化检测在计算机断层扫描（CT）图像中因结节大小、形状、纹理和位置变化多样性而导致的挑战。传统卷积神经网络（CNNs）在医学影像分析中表现出一定潜力，但其捕捉细微空间-光谱变化的能力有限，限制了其在复杂诊断场景中的性能。论文的关键解决方案在于提出了一种新颖的混合深度学习架构——Chebyshev-CNN，通过将切比雪夫多项式展开融入CNN层，增强了模型的表达能力，并提升了对潜在解剖结构的表征能力。这种架构利用了切比雪夫多项式的正交性和递归特性，能够更精确地提取高频特征并逼近复杂的非线性函数，从而显著提高了肺结节良恶性分类的准确性、敏感性和特异性。

链接: https://arxiv.org/abs/2504.06811
作者: Abhinav Roy,Bhavesh Gyanchandani,Aditya Oza
机构: IIIT, Naya Raipur (印度国际信息技术学院，纳亚拉普尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Lung cancer remains one of the leading causes of cancer-related mortality worldwide, with early and accurate diagnosis playing a pivotal role in improving patient outcomes. Automated detection of pulmonary nodules in computed tomography (CT) scans is a challenging task due to variability in nodule size, shape, texture, and location. Traditional Convolutional Neural Networks (CNNs) have shown considerable promise in medical image analysis; however, their limited ability to capture fine-grained spatial-spectral variations restricts their performance in complex diagnostic scenarios. In this study, we propose a novel hybrid deep learning architecture that incorporates Chebyshev polynomial expansions into CNN layers to enhance expressive power and improve the representation of underlying anatomical structures. The proposed Chebyshev-CNN leverages the orthogonality and recursive properties of Chebyshev polynomials to extract high-frequency features and approximate complex nonlinear functions with greater fidelity. The model is trained and evaluated on benchmark lung cancer imaging datasets, including LUNA16 and LIDC-IDRI, achieving superior performance in classifying pulmonary nodules as benign or malignant. Quantitative results demonstrate significant improvements in accuracy, sensitivity, and specificity compared to traditional CNN-based approaches. This integration of polynomial-based spectral approximation within deep learning provides a robust framework for enhancing automated medical diagnostics and holds potential for broader applications in clinical decision support systems.
zh

[CV-37] DyDiT: Dynamic Diffusion Transformers for Efficient Visual Generation ICLR

【速读】：该论文旨在解决基于扩散模型的视觉生成任务中计算成本过高的问题。传统方法采用静态推理范式，在某些扩散步长（diffusion timesteps）和空间区域（spatial regions）中不可避免地引入冗余计算，导致效率低下。为克服这一不足，论文提出了动态扩散Transformer（Dynamic Diffusion Transformer, DyDiT），其核心在于通过动态调整计算量来提升效率。具体而言，DyDiT引入了按时间步宽动态调整（Timestep-wise Dynamic Width, TDW）的方法，根据生成阶段的时间步动态调节模型宽度；同时设计了按空间位置动态令牌（Spatial-wise Dynamic Token, SDT）策略，避免在不必要的空间位置进行冗余计算。这些设计能够无缝集成到Diffusion Transformer（DiT）中，并显著加速生成过程。此外，论文进一步增强了DyDiT的适用性与经济性，包括结合流匹配生成以提升灵活性、扩展至视频生成及文本到图像生成等复杂任务，以及通过参数高效的训练方式（如timestep-based dynamic LoRA, TD-LoRA）降低全微调的成本，从而推动技术的广泛普及。

链接: https://arxiv.org/abs/2504.06803
作者: Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Hao Luo,Yibing Song,Gao Huang,Fan Wang,Yang You
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended journal version for ICLR. arXiv admin note: substantial text overlap with arXiv:2410.03456

点击查看摘要

Abstract:Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the \emphstatic inference paradigm, which inevitably introduces redundant computation in certain \emphdiffusion timesteps and \emphspatial regions. To overcome this inefficiency, we propose \textbfDynamic \textbfDiffusion \textbfTransformer (DyDiT), an architecture that \emphdynamically adjusts its computation along both \emphtimestep and \emphspatial dimensions. Specifically, we introduce a \emphTimestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a \emphSpatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerates the generation process. Building on these designs, we further enhance DyDiT in three key aspects. First, DyDiT is integrated seamlessly with flow matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT.
zh

[CV-38] MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection

【速读】：该论文旨在解决现有单目3D目标检测器受现实世界数据集多样性与规模限制的问题，尤其是在生成具有场景感知能力的真实感增强数据方面面临的挑战。论文指出，虽然数据增强有助于性能提升，但在室外场景中生成逼真的增强数据尤为困难。目前大多数合成数据生成方法侧重于通过改进渲染技术提高物体外观的真实性，但忽视了物体在场景中的位置与布局同样重要。关键障碍在于如何自动确定合成物体引入实际场景时的合理放置参数（如位置、尺寸及方向对齐）。为了解决这一问题，论文提出MonoPlace3D系统，其核心在于利用场景内容生成真实感增强数据：首先从背景场景学习可能的三维边界框分布，然后依据采样的分布位置渲染并放置物体。实验表明，MonoPlace3D不仅显著提升了多个现有单目3D检测器的精度，同时表现出高效的数据利用率。

链接: https://arxiv.org/abs/2504.06801
作者: Rishubh Parihar,Srinjay Sarkar,Sarthak Vora,Jogendra Kundu,R. Venkatesh Babu
机构: IISc Bangalore (印度科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Current monocular 3D detectors are held back by the limited diversity and scale of real-world datasets. While data augmentation certainly helps, it’s particularly difficult to generate realistic scene-aware augmented data for outdoor settings. Most current approaches to synthetic data generation focus on realistic object appearance through improved rendering techniques. However, we show that where and how objects are positioned is just as crucial for training effective 3D monocular detectors. The key obstacle lies in automatically determining realistic object placement parameters - including position, dimensions, and directional alignment when introducing synthetic objects into actual scenes. To address this, we introduce MonoPlace3D, a novel system that considers the 3D scene content to create realistic augmentations. Specifically, given a background scene, MonoPlace3D learns a distribution over plausible 3D bounding boxes. Subsequently, we render realistic objects and place them according to the locations sampled from the learned distribution. Our comprehensive evaluation on two standard datasets KITTI and NuScenes, demonstrates that MonoPlace3D significantly improves the accuracy of multiple existing monocular 3D detectors while being highly data efficient.
zh

[CV-39] A Meaningful Perturbation Metric for Evaluating Explainability Methods

【速读】：该论文试图解决深度神经网络（DNNs）在实际应用中因决策过程不透明而阻碍其广泛应用的问题。为了解决这一问题， attribution 方法被提出以评估输入各部分的相关性值。然而，不同的 attribution 方法往往会产生完全不同的相关性图，因此需要开发标准化的指标来评估这些方法。传统的评估通常通过扰动实现，即操纵输入图像中的高相关性或低相关性区域以观察预测的变化。论文的关键解决方案在于引入了一种新颖的方法，利用图像生成模型进行目标导向的扰动。具体而言，该方法仅对输入图像中高相关性的像素进行修复（inpainting），从而修改模型的预测结果，同时保持图像的真实性。与现有方法不同，后者常常产生分布外的修改，导致结果不可靠。通过广泛的实验，论文展示了所提出方法在生成有意义排名方面的有效性，并证明了该方法产生的排名与人类偏好具有更高的相关性，从而突显了其在提升 DNNs 可解释性方面的潜力。

链接: https://arxiv.org/abs/2504.06800
作者: Danielle Cohen,Hila Chefer,Lior Wolf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have demonstrated remarkable success, yet their wide adoption is often hindered by their opaque decision-making. To address this, attribution methods have been proposed to assign relevance values to each part of the input. However, different methods often produce entirely different relevance maps, necessitating the development of standardized metrics to evaluate them. Typically, such evaluation is performed through perturbation, wherein high- or low-relevance regions of the input image are manipulated to examine the change in prediction. In this work, we introduce a novel approach, which harnesses image generation models to perform targeted perturbation. Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model’s predictions while preserving image fidelity. This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results. Through extensive experiments, we demonstrate the effectiveness of our approach in generating meaningful rankings across a wide range of models and attribution methods. Crucially, we establish that the ranking produced by our metric exhibits significantly higher correlation with human preferences compared to existing approaches, underscoring its potential for enhancing interpretability in DNNs.
zh

[CV-40] Zero-Shot Image-Based Large Language Model Approach to Road Pavement Monitoring

【速读】：该论文旨在解决传统手动道路表面状况评估方法主观性强以及现有基于机器学习的方法依赖于大规模高质量标注数据集的问题。这些数据集的获取需要大量资源，并且限制了模型在不同道路条件下的适应性。为应对这些挑战，论文提出了一种创新的零样本学习方法，利用大型语言模型（LLMs）的图像识别和自然语言理解能力来有效评估道路状况。解决方案的关键在于采用与路面状况指数（PSCI）标准相一致的提示工程策略开发多种基于LLMs的评估模型，并通过全面且结构化的提示工程优化模型，使其能够在准确性、一致性方面超越专家评估，甚至在Google街景图像上的广泛测试中表现出色。

链接: https://arxiv.org/abs/2504.06785
作者: Shuoshuo Xu,Kai Zhao,James Loney,Zili Li,Andrea Visentin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective and rapid evaluation of pavement surface condition is critical for prioritizing maintenance, ensuring transportation safety, and minimizing vehicle wear and tear. While conventional manual inspections suffer from subjectivity, existing machine learning-based methods are constrained by their reliance on large and high-quality labeled datasets, which require significant resources and limit adaptability across varied road conditions. The revolutionary advancements in Large Language Models (LLMs) present significant potential for overcoming these challenges. In this study, we propose an innovative automated zero-shot learning approach that leverages the image recognition and natural language understanding capabilities of LLMs to assess road conditions effectively. Multiple LLM-based assessment models were developed, employing prompt engineering strategies aligned with the Pavement Surface Condition Index (PSCI) standards. These models’ accuracy and reliability were evaluated against official PSCI results, with an optimized model ultimately selected. Extensive tests benchmarked the optimized model against evaluations from various levels experts using Google Street View road images. The results reveal that the LLM-based approach can effectively assess road conditions, with the optimized model -employing comprehensive and structured prompt engineering strategies -outperforming simpler configurations by achieving high accuracy and consistency, even surpassing expert evaluations. Moreover, successfully applying the optimized model to Google Street View images demonstrates its potential for future city-scale deployments. These findings highlight the transformative potential of LLMs in automating road damage evaluations and underscore the pivotal role of detailed prompt engineering in achieving reliable assessments.
zh

[CV-41] Domain Generalization through Attenuation of Domain-Specific Information CVPR2025

【速读】：该论文旨在解决自动驾驶图像领域泛化（domain-generalized）语义分割任务中，模型对特定领域信息（domain-specific information）过度依赖的问题。传统方法往往在训练阶段受限于特定领域的特性（如传感器参数或光照条件），导致其在未见过的领域场景中表现不佳。论文的关键在于提出了一种新的评估指标Domain Independence (DI)，用于量化模型对特定领域信息的依赖程度，并通过Attenuation of Domain-Specific Information (ADSI) 方法有效抑制这些特定领域特征的影响。ADSI 利用Butterworth滤波器保留低频成分中的重要信息（如颜色），同时通过缩放因子控制去除不必要的领域相关噪声，从而实现更鲁棒的域无关特征提取。实验表明，所提方法在GTA5与Cityscapes数据集上的表现优于传统方法，并展现出对夜间等复杂环境的良好适应性。

链接: https://arxiv.org/abs/2504.06781
作者: Reiji Saito,Kazuhiro Hotta
机构: Meijo University (名城大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 Workshops

点击查看摘要

Abstract:In this paper, we propose a new evaluation metric called Domain Independence (DI) and Attenuation of Domain-Specific Information (ADSI) which is specifically designed for domain-generalized semantic segmentation in automotive images. DI measures the presence of domain-specific information: a lower DI value indicates strong domain dependence, while a higher DI value suggests greater domain independence. This makes it roughly where domain-specific information exists and up to which frequency range it is present. As a result, it becomes possible to effectively suppress only the regions in the image that contain domain-specific information, enabling feature extraction independent of the domain. ADSI uses a Butterworth filter to remove the low-frequency components of images that contain inherent domain-specific information such as sensor characteristics and lighting conditions. However, since low-frequency components also contain important information such as color, we should not remove them completely. Thus, a scalar value (ranging from 0 to 1) is multiplied by the low-frequency components to retain essential information. This helps the model learn more domain-independent features. In experiments, GTA5 (synthetic dataset) was used as training images, and a real-world dataset was used for evaluation, and the proposed method outperformed conventional approaches. Similarly, in experiments that the Cityscapes (real-world dataset) was used for training and various environment datasets such as rain and nighttime were used for evaluation, the proposed method demonstrated its robustness under nighttime conditions.
zh

[CV-42] End2end-ALARA: Approaching the ALARA Law in CT Imaging with End-to-end Learning

【速读】：该论文旨在解决在计算机断层扫描（CT）成像中降低辐射剂量的同时保持图像质量的问题，即实现“尽可能低的合理辐射剂量”（ALARA）原则。论文提出的解决方案是构建一个端到端学习框架End2end-ALARA，其关键是通过联合优化剂量调制模块与图像重建模块，并利用可微分仿真函数连接这些模块，同时以约束铰链损失函数进行优化。目标是在满足预设图像质量指数的前提下最小化辐射剂量。该方法能够为患者设定个性化剂量水平，从而在不同患者间获得稳定的图像质量，同时相比固定剂量或传统剂量调制策略，能够在达到相同图像质量的情况下显著降低辐射剂量。

链接: https://arxiv.org/abs/2504.06777
作者: Xi Tao,Liyan Lin
机构: Guizhou University (贵州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computed tomography (CT) examination poses radiation injury to patient. A consensus performing CT imaging is to make the radiation dose as low as reasonably achievable, i.e. the ALARA law. In this paper, we propose an end-to-end learning framework, named End2end-ALARA, that jointly optimizes dose modulation and image reconstruction to meet the goal of ALARA in CT imaging. End2end-ALARA works by building a dose modulation module and an image reconstruction module, connecting these modules with a differentiable simulation function, and optimizing the them with a constrained hinge loss function. The objective is to minimize radiation dose subject to a prescribed image quality (IQ) index. The results show that End2end-ALARA is able to preset personalized dose levels to gain a stable IQ level across patients, which may facilitate image-based diagnosis and downstream model training. Moreover, compared to fixed-dose and conventional dose modulation strategies, End2end-ALARA consumes lower dose to reach the same IQ level. Our study sheds light on a way of realizing the ALARA law in CT imaging.
zh

[CV-43] FANeRV: Frequency Separation and Augmentation based Neural Representation for Video

【速读】：该论文旨在解决现有神经视频表示方法（NeRV）在捕捉精细空间细节方面的不足，导致重建结果模糊的问题。为应对这一挑战，论文提出了一种基于频率分离与增强的神经视频表示方法（FANeRV）。其核心在于引入离散小波变换的频率升级模块，明确将输入帧分解为高低频成分，并通过专门设计的模块针对性地增强高频信息。此外，论文设计了一个门控网络以有效地融合这些频率分量，实现最优重建效果。同时，在网络后半部分集成了卷积残差增强块，平衡参数分布并进一步提升高频细节的恢复能力。实验结果表明，FANeRV显著提升了重建性能，并在视频压缩、修复及插值等任务中表现优异，优于现有NeRV方法。

链接: https://arxiv.org/abs/2504.06755
作者: Li Yu,Zhihui Li,Jimin Xiao,Moncef Gabbouj
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural representations for video (NeRV) have gained considerable attention for their strong performance across various video tasks. However, existing NeRV methods often struggle to capture fine spatial details, resulting in vague reconstructions. In this paper, we present a Frequency Separation and Augmentation based Neural Representation for video (FANeRV), which addresses these limitations with its core Wavelet Frequency Upgrade this http URL block explicitly separates input frames into high and low-frequency components using discrete wavelet transform, followed by targeted enhancement using specialized modules. Finally, a specially designed gated network effectively fuses these frequency components for optimal reconstruction. Additionally, convolutional residual enhancement blocks are integrated into the later stages of the network to balance parameter distribution and improve the restoration of high-frequency details. Experimental results demonstrate that FANeRV significantly improves reconstruction performance and excels in multiple tasks, including video compression, inpainting, and interpolation, outperforming existing NeRV methods.
zh

[CV-44] Compass Control: Multi Object Orientation Control for Text-to-Image Generation

【速读】：该论文旨在解决文本到图像扩散模型中多对象方向控制的问题，目标是实现对每个对象精确的方向控制，从而生成具有多样化且方向可控的对象场景。解决方案的关键在于引入了一组带有方向感知的\textbf{compass tokens}（罗盘标记），每个对象对应一个罗盘标记，并与文本标记一起对扩散模型进行条件约束。通过一个轻量级编码器网络，这些罗盘标记以对象方向作为输入进行预测。然而，直接训练此框架会导致方向控制不佳以及对象之间的纠缠。为了解决这些问题，研究者在生成过程中进行了干预，限制了每个罗盘标记的交叉注意力图仅作用于其对应的对象区域。这种机制使得训练后的模型不仅能够对未见过的复杂对象实现精确的方向控制，还能处理包含两个以上对象的多对象场景，展示了强大的泛化能力。此外，结合个性化方法后，该方法能够在多样化的上下文中精确控制新对象的方向。论文通过广泛的评估和用户研究证明了该方法在方向控制和文本对齐方面的最先进性能。

链接: https://arxiv.org/abs/2504.06752
作者: Rishbuh Parihar,Vaibhav Agrawal,Sachidanand VS,R. Venkatesh Babu
机构: IISc Bangalore (印度科学学院班加罗尔分校); IIIT Hyderabad (海得拉巴国际信息技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control, such as precise control of object orientation. In this work, we address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object. The key idea is to condition the diffusion model with a set of orientation-aware \textbfcompass tokens, one for each object, along with text tokens. A light-weight encoder network predicts these compass tokens taking object orientation as the input. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D assets on a plain background. However, direct training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we intervene in the generation process and constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the new object in diverse contexts. Our method achieves state-of-the-art orientation control and text alignment, quantified with extensive evaluations and a user study.
zh

[CV-45] Visualisation of a multidimensional point cloud as a 3D swarm of avatars

【速读】：该论文致力于解决多维数据可视化的问题，特别是在处理复杂数据结构时如何更直观地分析与理解高维数据。论文的关键创新在于结合经典的投影技术与将特定数据维度映射到面部特征的方法，利用人类大脑对人脸表情的自然解读能力。通过借鉴Chernoff脸的概念，作者提出了一种交互式的可视化方法，将多维数据以“图腾群”的形式展现，其中超空间位置和面部特征分别表示数据的不同方面。此方法的核心在于有效整合数据维度与可视化元素，从而提升对复杂数据模式的洞察力。

链接: https://arxiv.org/abs/2504.06751
作者: Leszek Luchowski,Dariusz Pojda
机构: Institute of Theoretical and Applied Informatics, Polish Academy of Sciences (波兰科学院理论与应用研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The article presents an innovative approach to the visualisation of multidimensional data, using icons inspired by Chernoff faces. The approach merges classical projection techniques with the assignment of particular data dimensions to mimic features, capitalizing on the natural ability of the human brain to interpret facial expressions. The technique is implemented as a plugin to the dpVision open-source image handling platform. The plugin allows the data to be interactively explored in the form of a swarm of “totems” whose position in hyperspace as well as facial features represent various aspects of the data. Sample visualisations, based on synthetic test data as well as the vinhoverde 15-dimensional database on Portuguese wines, confirm the usefulness of our approach to the analysis of complex data structures.
zh

[CV-46] nnLandmark: A Self-Configuring Method for 3D Medical Landmark Detection

【速读】：本文旨在解决医学影像中三维标志点检测任务面临的挑战，包括人工标注耗时且依赖专家知识的问题，以及现有深度学习方法因公共数据集有限、基准不一致和基线非标准化而导致的可复现性差、公平比较困难及模型性能受限等瓶颈。为应对这些问题，论文提出了一种名为nnLandmark的自配置深度学习框架，其关键在于将nnU-Net适配为基于热图回归的方法，通过利用nnU-Net的自动化配置能力，消除了手动参数调优的需求，实现了开箱即用的功能。nnLandmark在两个公开数据集上达到了最先进的精度，分别实现了1.5毫米均方根误差（MRE）和1.2毫米的定位精度，并与人工标注间的变异水平相当（1.5毫米），从而为三维标志点检测提供了可靠的基准。

链接: https://arxiv.org/abs/2504.06742
作者: Alexandra Ertl,Shuhan Xiao,Stefan Denner,Robin Peretzke,David Zimmerer,Peter Neher,Fabian Isensee,Klaus Maier-Hein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Landmark detection plays a crucial role in medical imaging tasks that rely on precise spatial localization, including specific applications in diagnosis, treatment planning, image registration, and surgical navigation. However, manual annotation is labor-intensive and requires expert knowledge. While deep learning shows promise in automating this task, progress is hindered by limited public datasets, inconsistent benchmarks, and non-standardized baselines, restricting reproducibility, fair comparisons, and model this http URL work introduces nnLandmark, a self-configuring deep learning framework for 3D medical landmark detection, adapting nnU-Net to perform heatmap-based regression. By leveraging nnU-Net’s automated configuration, nnLandmark eliminates the need for manual parameter tuning, offering out-of-the-box usability. It achieves state-of-the-art accuracy across two public datasets, with a mean radial error (MRE) of 1.5 mm on the Mandibular Molar Landmark (MML) dental CT dataset and 1.2 mm for anatomical fiducials on a brain MRI dataset (AFIDs), where nnLandmark aligns with the inter-rater variability of 1.5 mm. With its strong generalization, reproducibility, and ease of deployment, nnLandmark establishes a reliable baseline for 3D landmark detection, supporting research in anatomical localization and clinical workflows that depend on precise landmark identification. The code will be available soon.
zh

[CV-47] Large Scale Supervised Pretraining For Traumatic Brain Injury Segmentation

【速读】：该论文旨在解决中重度创伤性脑损伤（msTBI）病变分割在神经影像学中的挑战，由于这些病变在大小、形状和分布上的异质性，传统图像处理技术难以准确处理，尤其是在图像配准和脑分区等任务中容易产生关键错误。为应对这些挑战，论文提出了通过大规模多数据集监督预训练方法来改进专门针对T1加权MRI数据的分割算法。关键解决方案在于首先基于MultiTalent方法设计了一个包含多种解剖和病理结构的大规模数据集进行预训练，使用Resenc L网络赋予模型对脑解剖和病理的强大学习能力；随后在特定于msTBI的数据集上微调模型，使其能够更好地适应T1加权MRI扫描的独特特性，并将性能提升高达2个Dice点以上。

链接: https://arxiv.org/abs/2504.06741
作者: Constantin Ulrich,Tassilo Wald,Fabian Isensee,Klaus H. Maier-Hein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The segmentation of lesions in Moderate to Severe Traumatic Brain Injury (msTBI) presents a significant challenge in neuroimaging due to the diverse characteristics of these lesions, which vary in size, shape, and distribution across brain regions and tissue types. This heterogeneity complicates traditional image processing techniques, resulting in critical errors in tasks such as image registration and brain parcellation. To address these challenges, the AIMS-TBI Segmentation Challenge 2024 aims to advance innovative segmentation algorithms specifically designed for T1-weighted MRI data, the most widely utilized imaging modality in clinical practice. Our proposed solution leverages a large-scale multi-dataset supervised pretraining approach inspired by the MultiTalent method. We train a Resenc L network on a comprehensive collection of datasets covering various anatomical and pathological structures, which equips the model with a robust understanding of brain anatomy and pathology. Following this, the model is fine-tuned on msTBI-specific data to optimize its performance for the unique characteristics of T1-weighted MRI scans and outperforms the baseline without pretraining up to 2 Dice points.
zh

[CV-48] MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning

【速读】：该论文旨在解决工业应用中光学检测领域中精确识别产品缺陷类型的问题。当前方法仅能判断产品是否异常，而无法提供关于缺陷类型的任何见解，也无法检测和识别多种缺陷。为应对这一挑战，论文提出了一种名为MultiADS的零样本学习方法，用于多类型异常检测与分割。其关键在于通过CLIP模型与额外的线性层，在联合特征空间中对视觉和文本表示进行对齐，从而实现针对每种特定缺陷类型生成精确的异常掩码、区分不同缺陷类型以及同时识别异常产品中存在的多种缺陷类型。实验结果表明，MultiADS在图像级和像素级异常检测与分割任务中优于现有的零样本/少量样本学习方法。

链接: https://arxiv.org/abs/2504.06740
作者: Ylli Sadikaj,Hongkuan Zhou,Lavdim Halilaj,Stefan Schmid,Steffen Staab,Claudia Plant
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise optical inspection in industrial applications is crucial for minimizing scrap rates and reducing the associated costs. Besides merely detecting if a product is anomalous or not, it is crucial to know the distinct type of defect, such as a bent, cut, or scratch. The ability to recognize the “exact” defect type enables automated treatments of the anomalies in modern production lines. Current methods are limited to solely detecting whether a product is defective or not without providing any insights on the defect type, nevertheless detecting and identifying multiple defects. We propose MultiADS, a zero-shot learning approach, able to perform Multi-type Anomaly Detection and Segmentation. The architecture of MultiADS comprises CLIP and extra linear layers to align the visual- and textual representation in a joint feature space. To the best of our knowledge, our proposal, is the first approach to perform a multi-type anomaly segmentation task in zero-shot learning. Contrary to the other baselines, our approach i) generates specific anomaly masks for each distinct defect type, ii) learns to distinguish defect types, and iii) simultaneously identifies multiple defect types present in an anomalous product. Additionally, our approach outperforms zero/few-shot learning SoTA methods on image-level and pixel-level anomaly detection and segmentation tasks on five commonly used datasets: MVTec-AD, Visa, MPDD, MAD and Real-IAD.
zh

[CV-49] EDIT: Enhancing Vision Transformers by Mitigating Attention Sink through an Encoder-Decoder Architecture

【速读】：本文旨在解决视觉Transformer（Vision Transformer, ViT）模型中观察到的注意力 sink 现象，即过多的注意力被分配给 [CLS] token，从而扭曲模型处理图像块的能力。为了解决这一问题，论文提出了 EDIT（Encoder-Decoder Image Transformer）架构，其关键在于引入了一种层对齐的编码器-解码器结构：编码器利用自注意力机制处理图像块，而解码器则通过交叉注意力机制专注于 [CLS] token。与传统编码器-解码器框架不同，EDIT 允许解码器从低级特征开始提取信息，并逐步逐层细化表示。这种设计不仅提高了模型的性能，还在 ImageNet-1k 和 ImageNet-21k 数据集以及迁移学习任务中展示了对 DeiT3 模型的一致性能提升，体现了 EDIT 在缓解注意力 sink 和改进视觉特征提取方面的有效性。

链接: https://arxiv.org/abs/2504.06738
作者: Wenfeng Feng,Guoying Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose EDIT (Encoder-Decoder Image Transformer), a novel architecture designed to mitigate the attention sink phenomenon observed in Vision Transformer models. Attention sink occurs when an excessive amount of attention is allocated to the [CLS] token, distorting the model’s ability to effectively process image patches. To address this, we introduce a layer-aligned encoder-decoder architecture, where the encoder utilizes self-attention to process image patches, while the decoder uses cross-attention to focus on the [CLS] token. Unlike traditional encoder-decoder framework, where the decoder depends solely on high-level encoder representations, EDIT allows the decoder to extract information starting from low-level features, progressively refining the representation layer by layer. EDIT is naturally interpretable demonstrated through sequential attention maps, illustrating the refined, layer-by-layer focus on key image features. Experiments on ImageNet-1k and ImageNet-21k, along with transfer learning tasks, show that EDIT achieves consistent performance improvements over DeiT3 models. These results highlight the effectiveness of EDIT’s design in addressing attention sink and improving visual feature extraction.
zh

[CV-50] Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding CVPR2025

【速读】：该论文试图解决的问题是：在3D场景理解中，自监督方法通常仅作为任务特定微调的权重初始化步骤，限制了其在通用特征提取方面的实用性。论文的关键解决方案在于提出了一种鲁棒的评估协议，专门用于评估3D场景理解中自监督特征的质量，并引入了一种新的自监督模型。该模型基于一种新颖的掩码场景建模（Masked Scene Modeling）目标，在3D空间中本机训练，通过自下而上的方式重建掩码区域的深层特征，特别适用于分层3D模型。这种创新的方法不仅实现了与监督模型竞争的性能，还在现有自监督方法的基础上取得了显著改进。

链接: https://arxiv.org/abs/2504.06719
作者: Pedro Hermosilla,Christian Stippel,Leon Sick
机构: TU Wien(维也纳技术大学); Ulm University(乌尔姆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (this https URL).
zh

[CV-51] GSta: Efficient Training Scheme with Siestaed Gaussians for Monocular 3D Scene Reconstruction

【速读】：该论文旨在解决Gaussian Splatting (GS) 在3D重建中因存储和内存需求过大以及训练速度较慢，导致在机器人场景中难以部署的问题。论文的关键解决方案是提出了一种名为GSta的新方法，它通过动态识别在训练过程中收敛良好的高斯点（基于位置和颜色梯度范数），将这些高斯点冻结以停止更新（siesta机制），从而显著提升了训练速度，同时保持了与当前最先进方法相当的准确性。此外，GSta还结合了学习率调度器等其他改进措施，并引入基于训练图像子集PSNR值的早停机制，实现了在收敛速度、内存和存储需求之间的平衡优化，同时保持高质量的重建效果。实验表明，GSta不仅能够独立提升训练效率，还能与其他方法（如Trick-GS）结合，使训练速度提高5倍，磁盘占用缩小至原来的1/16，且仅需一半的峰值内存，同时保持相近的精度。

链接: https://arxiv.org/abs/2504.06716
作者: Anil Armagan,Albert Saà-Garriga,Bruno Manganelli,Kyuwon Kim,M. Kerim Yucel
机构: Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages. In submission to an IEEE conference

点击查看摘要

Abstract:Gaussian Splatting (GS) is a popular approach for 3D reconstruction, mostly due to its ability to converge reasonably fast, faithfully represent the scene and render (novel) views in a fast fashion. However, it suffers from large storage and memory requirements, and its training speed still lags behind the hash-grid based radiance field approaches (e.g. Instant-NGP), which makes it especially difficult to deploy them in robotics scenarios, where 3D reconstruction is crucial for accurate operation. In this paper, we propose GSta that dynamically identifies Gaussians that have converged well during training, based on their positional and color gradient norms. By forcing such Gaussians into a siesta and stopping their updates (freezing) during training, we improve training speed with competitive accuracy compared to state of the art. We also propose an early stopping mechanism based on the PSNR values computed on a subset of training images. Combined with other improvements, such as integrating a learning rate scheduler, GSta achieves an improved Pareto front in convergence speed, memory and storage requirements, while preserving quality. We also show that GSta can improve other methods and complement orthogonal approaches in efficiency improvement; once combined with Trick-GS, GSta achieves up to 5x faster training, 16x smaller disk size compared to vanilla GS, while having comparable accuracy and consuming only half the peak memory. More visualisations are available at this https URL.
zh

[CV-52] Deep Learning for Cardiovascular Risk Assessment: Proxy Features from Carotid Sonography as Predictors of Arterial Damage

【速读】：本文研究旨在利用高血压作为个体血管损伤的标志，通过机器学习技术识别这种损伤，从而提供早期心血管事件的风险预警，并揭示个体动脉的整体健康状况。为实现这一目标，论文的关键在于将原本用于视频分类的VideoMAE深度学习模型进行微调（fine-tuning），使其适用于颈动脉超声影像分析领域。通过利用来自Gutenberg Health Study的大规模前瞻性队列数据集（包含超过31,000个颈动脉超声视频），该模型实现了高血压与非高血压分类的验证准确率为75.7%，作为视觉检测动脉损伤的替代方法。论文的核心贡献在于证明了此机器学习模型能够有效提取具有临床价值的视觉特征，以评估个体的心血管健康状态。

链接: https://arxiv.org/abs/2504.06680
作者: Christoph Balada,Aida Romano-Martinez,Vincent ten Cate,Katharina Geschke,Jonas Tesarz,Paul Claßen,Alexander K. Schuster,Dativa Tibyampansha,Karl-Patrik Kresoja,Philipp S. Wild,Sheraz Ahmed,Andreas Dengel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this study, hypertension is utilized as an indicator of individual vascular damage. This damage can be identified through machine learning techniques, providing an early risk marker for potential major cardiovascular events and offering valuable insights into the overall arterial condition of individual patients. To this end, the VideoMAE deep learning model, originally developed for video classification, was adapted by finetuning for application in the domain of ultrasound imaging. The model was trained and tested using a dataset comprising over 31,000 carotid sonography videos sourced from the Gutenberg Health Study (15,010 participants), one of the largest prospective population health studies. This adaptation facilitates the classification of individuals as hypertensive or non-hypertensive (75.7% validation accuracy), functioning as a proxy for detecting visual arterial damage. We demonstrate that our machine learning model effectively captures visual features that provide valuable insights into an individual’s overall cardiovascular health.
zh

[CV-53] Setup-Invariant Augmented Reality for Teaching by Demonstration with Surgical Robots

【速读】：该论文旨在解决机器人手术教育中现有增强现实（AR）系统无法适应导师与受训者机器人配置差异的问题，并希望让新手能够在无需专家实时监督的情况下，在手术室外通过专家指导进行训练。论文的关键解决方案是提出了一种名为dV-STEAR的开源系统，该系统能够回放任务对齐的专家演示，而无需假设专家与新手的关节位置设置相同。为实现这一目标，论文通过严格的位姿估计量化，确保了注册误差仅为3.86毫米（标准差=2.01毫米），从而在不依赖硬件一致性的情况下提供精准的指导。此外，通过用户研究验证了dV-STEAR在提升新手任务表现方面的有效性，特别是在单手环穿线和拣选放置任务中显著提高了完成速度、减少了碰撞时间并提升了成功率，同时降低了新手的挫败感，展示了其在机器人辅助手术教育中的潜力。

链接: https://arxiv.org/abs/2504.06677
作者: Alexandre Banks,Richard Cook,Septimiu E. Salcudean
机构: University of British Columbia (UBC), Vancouver, BC V6T 1Z4, Canada (不翻译成中文)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注: 12 pages, 10 figures; Open-source code, see this https URL Supplementary movies, see this https URL

点击查看摘要

Abstract:Augmented reality (AR) is an effective tool in robotic surgery education as it combines exploratory learning with three-dimensional guidance. However, existing AR systems require expert supervision and do not account for differences in the mentor and mentee robot configurations. To enable novices to train outside the operating room while receiving expert-informed guidance, we present dV-STEAR: an open-source system that plays back task-aligned expert demonstrations without assuming identical setup joint positions between expert and novice. Pose estimation was rigorously quantified, showing a registration error of 3.86 (SD=2.01)mm. In a user study (N=24), dV-STEAR significantly improved novice performance on tasks from the Fundamentals of Laparoscopic Surgery. In a single-handed ring-over-wire task, dV-STEAR increased completion speed (p=0.03) and reduced collision time (p=0.01) compared to dry-lab training alone. During a pick-and-place task, it improved success rates (p=0.004). Across both tasks, participants using dV-STEAR exhibited significantly more balanced hand use and reported lower frustration levels. This work presents a novel educational tool implemented on the da Vinci Research Kit, demonstrates its effectiveness in teaching novices, and builds the foundation for further AR integration into robot-assisted surgery.
zh

[CV-54] Probability Density Geodesics in Image Diffusion Latent Space CVPR2025

【速读】：本文旨在解决在扩散模型潜空间中计算测地线（geodesics）的问题。论文的关键在于利用由空间变化内积诱导的范数与数据概率密度成反比的特性，即高概率区域的路径长度短于低概率区域的等效路径。为此，作者提出了求解相关初值和边值问题的算法，并展示了如何沿路径计算概率密度以及两点间的测地距离。通过这些技术，论文分析了视频片段在预训练图像扩散空间中逼近测地线的程度，并进一步演示了如何在无需微调的情况下应用于图像序列插值和外推任务。

链接: https://arxiv.org/abs/2504.06675
作者: Qingtao Yu,Jaskirat Singh,Zhaoyuan Yang,Peter Henry Tu,Jing Zhang,Hongdong Li,Richard Hartley,Dylan Campbell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:Diffusion models indirectly estimate the probability density over a data space, which can be used to study its structure. In this work, we show that geodesics can be computed in diffusion latent space, where the norm induced by the spatially-varying inner product is inversely proportional to the probability density. In this formulation, a path that traverses a high density (that is, probable) region of image latent space is shorter than the equivalent path through a low density region. We present algorithms for solving the associated initial and boundary value problems and show how to compute the probability density along the path and the geodesic distance between two points. Using these techniques, we analyze how closely video clips approximate geodesics in a pre-trained image diffusion space. Finally, we demonstrate how these techniques can be applied to training-free image sequence interpolation and extrapolation, given a pre-trained image diffusion model.
zh

[CV-55] RAG ME: Retrieval Augmented Video Generation for Enhanced Motion Realism

【速读】：该论文试图解决生成视频在运动复杂性和物理真实性方面的不足问题。当前生成的视频往往在视觉质量上有所改进，但在运动细节和物理合理性方面仍然存在局限性，表现为运动过于静态或不真实。为了解决这一问题，论文提出了一种框架，其关键在于在生成阶段引入检索机制。通过检索与目标生成任务相关的视频片段作为“grounding信号”，为模型提供物体运动的示范，从而提升生成视频中运动的真实性。该方法可以适配任意文本到视频的扩散模型，并通过少量微调实现对预训练模型的增强。

链接: https://arxiv.org/abs/2504.06672
作者: Elia Peruzzo,Dejia Xu,Xingqian Xu,Humphrey Shi,Nicu Sebe
机构: University of Trento (特伦托大学); UT Austin (德克萨斯大学奥斯汀分校); SHI Labs @ Georgia Tech & UIUC (乔治亚理工学院&伊利诺伊大学香槟分校); Picsart AI Research (Picsart AI 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at: this https URL

点击查看摘要

Abstract:Video generation is experiencing rapid growth, driven by advances in diffusion models and the development of better and larger datasets. However, producing high-quality videos remains challenging due to the high-dimensional data and the complexity of the task. Recent efforts have primarily focused on enhancing visual quality and addressing temporal inconsistencies, such as flickering. Despite progress in these areas, the generated videos often fall short in terms of motion complexity and physical plausibility, with many outputs either appearing static or exhibiting unrealistic motion. In this work, we propose a framework to improve the realism of motion in generated videos, exploring a complementary direction to much of the existing literature. Specifically, we advocate for the incorporation of a retrieval mechanism during the generation phase. The retrieved videos act as grounding signals, providing the model with demonstrations of how the objects move. Our pipeline is designed to apply to any text-to-video diffusion model, conditioning a pretrained model on the retrieved samples with minimal fine-tuning. We demonstrate the superiority of our approach through established metrics, recently proposed benchmarks, and qualitative results, and we highlight additional applications of the framework.
zh

[CV-56] Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

【速读】：该论文旨在解决现有多模态大语言模型（Multimodal Large Language Models, MLLMs）在生成高质量图像描述时存在的细粒度细节不足以及幻觉（hallucinations）和不一致性问题。论文的关键在于提出了一种“先分割后聚合”（\textbf{divide-then-aggregate}）策略，通过将图像划分为语义和空间补丁以提取细粒度细节，从而增强模型对图像的局部感知能力，并在此基础上采用分层聚合的方式生成全局描述。此外，为了减少幻觉和不一致现象，在分层聚合过程中引入了语义级过滤过程。此无训练（training-free）的流水线可适用于开源与闭源模型，实验证明该方法能够生成更详尽且可靠的图像描述，同时无需对模型进行重新训练。

链接: https://arxiv.org/abs/2504.06666
作者: Ruotian Peng,Haiying He,Yake Wei,Yandong Wen,Di Hu
机构: School of Future Technology, South China University of Technology (华南理工大学未来技术学院); College of Science, China Agricultural University (中国农业大学理学院); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); School of Engineering, Westlake University (西湖大学工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a \textbfdivide-then-aggregate strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model’s local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at this https URL
zh

[CV-57] Uni-PrevPredMap: Extending PrevPredMap to a Unified Framework of Prior-Informed Modeling for Online Vectorized HD Map Construction

【速读】：本文旨在解决无地图（map-free）场景下实时矢量化高精地图构建的挑战，并探索如何有效利用先前预测与模拟的历史高精地图作为互补先验信息。论文的关键创新在于提出了一种名为Uni-PrevPredMap的统一先验感知框架，其核心包括：1）基于瓦片索引的三维矢量化全局地图处理器，用于高效刷新、存储和检索三维矢量化先验；2）三模态操作优化范式，在无先验、无地图及有先验三种场景中确保一致性，同时减少对理想地图保真度假设的依赖。通过这些方法，该框架在无地图场景下的在线矢量化高精地图构建基准测试中达到了最先进的性能，并验证了先前预测与历史地图之间协同互补的能力。

链接: https://arxiv.org/abs/2504.06647
作者: Nan Peng,Xun Zhou,Mingming Wang,Guisong Chen,Songming Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Safety constitutes a foundational imperative for autonomous driving systems, necessitating the maximal incorporation of accessible external prior information. This study establishes that temporal perception buffers and cost-efficient maps inherently form complementary prior sources for online vectorized high-definition (HD) map construction. We present Uni-PrevPredMap, a unified prior-informed framework that systematically integrates two synergistic information sources: previous predictions and simulated outdated HD maps. The framework introduces two core innovations: a tile-indexed 3D vectorized global map processor enabling efficient refreshment, storage, and retrieval of 3D vectorized priors; a tri-mode operational optimization paradigm ensuring consistency across prior-free, map-absent, and map-prior scenarios while mitigating reliance on idealized map fidelity assumptions. Uni-PrevPredMap achieves state-of-the-art performance in map-free scenarios across established online vectorized HD map construction benchmarks. When provided with simulated outdated HD maps, the framework exhibits robust capabilities in error-resilient prior fusion, empirically confirming the synergistic complementarity between previous predictions and simulated outdated HD maps. Code will be available at this https URL.
zh

[CV-58] HGMamba: Enhancing 3D Human Pose Estimation with a HyperGCN-Mamba Network IJCNN2025

【速读】：该论文旨在解决从地面真实（ground-truth）2D人体姿态数据精确重建3D人体姿态的问题。现有方法多聚焦于提升估计2D姿态的性能，但在处理地面真实2D姿态数据时往往表现不佳。论文观察到，实现准确的3D姿态重建需要精确建模局部姿态结构，并具备提取鲁棒全局时空特征的能力。为此，论文提出了一种新颖的超图卷积网络与Shuffle Mamba（HGMamba）模块，其通过两条并行流处理输入数据：Hyper-GCN流以不同粒度的超图形式建模人体结构，有效捕捉局部关节依赖关系；Shuffle Mamba流利用状态空间模型对所有关节进行时空扫描，建立全局依赖关系。通过自适应融合这两种表示，HGMamba在保持强大全局特征建模能力的同时，在局部结构建模方面表现出色。关键在于结合局部与全局特征的双流架构设计，以及通过多层堆叠的HGMamba块实现速度-精度权衡的灵活性。

链接: https://arxiv.org/abs/2504.06638
作者: Hu Cui,Tessai Hayama
机构: Information and Management Systems Engineering (信息系统工程), Nagaoka University of Technology (长冈技术科学大学), Nagaoka-shi (长冈市), Japan (日本)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IJCNN2025

点击查看摘要

Abstract:3D human pose lifting is a promising research area that leverages estimated and ground-truth 2D human pose data for training. While existing approaches primarily aim to enhance the performance of estimated 2D poses, they often struggle when applied to ground-truth 2D pose data. We observe that achieving accurate 3D pose reconstruction from ground-truth 2D poses requires precise modeling of local pose structures, alongside the ability to extract robust global spatio-temporal features. To address these challenges, we propose a novel Hyper-GCN and Shuffle Mamba (HGMamba) block, which processes input data through two parallel streams: Hyper-GCN and Shuffle-Mamba. The Hyper-GCN stream models the human body structure as hypergraphs with varying levels of granularity to effectively capture local joint dependencies. Meanwhile, the Shuffle Mamba stream leverages a state space model to perform spatio-temporal scanning across all joints, enabling the establishment of global dependencies. By adaptively fusing these two representations, HGMamba achieves strong global feature modeling while excelling at local structure modeling. We stack multiple HGMamba blocks to create three variants of our model, allowing users to select the most suitable configuration based on the desired speed-accuracy trade-off. Extensive evaluations on the Human3.6M and MPI-INF-3DHP benchmark datasets demonstrate the effectiveness of our approach. HGMamba-B achieves state-of-the-art results, with P1 errors of 38.65 mm and 14.33 mm on the respective datasets. Code and models are available: this https URL
zh

[CV-59] Crafting Query-Aware Selective Attention for Single Image Super-Resolution

【速读】：该论文旨在解决单图像超分辨率（SISR）任务中基于视觉Transformer（ViT）模型存在的计算成本高以及现有选择性注意机制未能有效聚焦于查询相关区域的问题。论文的关键在于提出了一种名为SSCAN的方法，它通过基于查询相似性的动态选择最相关的键值窗口，确保特征提取的专注性同时保持效率。与以往全局或启发式应用注意力的方法不同，SSCAN引入了一种查询感知的窗口选择策略，使注意力计算更好地集中在图像的重要区域。此外，通过采用固定大小的窗口，SSCAN减少了内存使用并实现了线性的标记间复杂度，从而在大图像处理中更具可扩展性。实验结果表明，SSCAN在城市数据集上的峰值信噪比（PSNR）提升了高达0.14 dB，证明了其在计算效率和重建质量方面的优越性。

链接: https://arxiv.org/abs/2504.06634
作者: Junyoung Kim,Youngrok Kim,Siyeol Jung,Donghyun Min
机构: POSTECH (浦项工科大学); Kyunghee University (庆熙大学); UNIST (蔚山科学技术院); Sogang University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Single Image Super-Resolution (SISR) reconstructs high-resolution images from low-resolution inputs, enhancing image details. While Vision Transformer (ViT)-based models improve SISR by capturing long-range dependencies, they suffer from quadratic computational costs or employ selective attention mechanisms that do not explicitly focus on query-relevant regions. Despite these advancements, prior work has overlooked how selective attention mechanisms should be effectively designed for SISR. We propose SSCAN, which dynamically selects the most relevant key-value windows based on query similarity, ensuring focused feature extraction while maintaining efficiency. In contrast to prior approaches that apply attention globally or heuristically, our method introduces a query-aware window selection strategy that better aligns attention computation with important image regions. By incorporating fixed-sized windows, SSCAN reduces memory usage and enforces linear token-to-token complexity, making it scalable for large images. Our experiments demonstrate that SSCAN outperforms existing attention-based SISR methods, achieving up to 0.14 dB PSNR improvement on urban datasets, guaranteeing both computational efficiency and reconstruction quality in SISR.
zh

[CV-60] PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering CVPR2025

【速读】：该论文致力于解决产品海报生成中精确渲染文本（尤其是复杂书写系统如中文）以及保持用户特定产品细节保真度的问题。解决方案的关键在于提出了一种基于字符辨别性视觉特征作为控制信号的稳健字符级表示方法，并据此开发了TextRenderNet以实现超过90%的文本渲染准确率。此外，通过引入基于图像修复的SceneGenNet及提出主题保真反馈学习，进一步增强了用户特定产品的保真度。最终，基于这两者构建了端到端的产品海报生成框架PosterMaker，并采用两阶段训练策略优化了整体性能。

链接: https://arxiv.org/abs/2504.06632
作者: Yifan Gao,Zihang Lin,Chuanbin Liu,Min Zhou,Tiezheng Ge,Bo Zheng,Hongtao Xie
机构: University of Science and Technology of China (中国科学技术大学); Taobao & Tmall Group of Alibaba (淘宝天猫集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Product posters, which integrate subject, scene, and text, are crucial promotional tools for attracting customers. Creating such posters using modern image generation methods is valuable, while the main challenge lies in accurately rendering text, especially for complex writing systems like Chinese, which contains over 10,000 individual characters. In this work, we identify the key to precise text rendering as constructing a character-discriminative visual feature as a control signal. Based on this insight, we propose a robust character-wise representation as control and we develop TextRenderNet, which achieves a high text rendering accuracy of over 90%. Another challenge in poster generation is maintaining the fidelity of user-specific products. We address this by introducing SceneGenNet, an inpainting-based model, and propose subject fidelity feedback learning to further enhance fidelity. Based on TextRenderNet and SceneGenNet, we present PosterMaker, an end-to-end generation framework. To optimize PosterMaker efficiently, we implement a two-stage training strategy that decouples text rendering and background generation learning. Experimental results show that PosterMaker outperforms existing baselines by a remarkable margin, which demonstrates its effectiveness.
zh

[CV-61] Rethinking LayerNorm in Image Restoration Transformers

【速读】：该论文旨在解决图像恢复（Image Restoration, IR）Transformer中观察到的异常特征行为问题，具体表现为特征熵过小以及特征幅值在百万量级范围内发散。论文指出这些问题的根本原因在于传统LayerNorm的逐token归一化方法破坏了重要的空间相关性和内部特征统计特性。为了解决这些问题，论文提出了一种针对IR Transformer优化的简单归一化策略，通过在整个空间-通道维度上进行归一化来有效保留空间相关性，并引入了一种输入自适应重缩放方法以调整特征统计特性以匹配每个输入的独特需求。实验结果验证了这一组合策略能够有效缓解特征发散现象，显著提升IR Transformer在多种图像恢复任务中的稳定性和性能。

链接: https://arxiv.org/abs/2504.06629
作者: MinKyu Lee,Sangeek Hyun,Woojin Jun,Hyunjun Kim,Jiwoo Chung,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work investigates abnormal feature behaviors observed in image restoration (IR) Transformers. Specifically, we identify two critical issues: feature entropy becoming excessively small and feature magnitudes diverging up to a million-fold scale. We pinpoint the root cause to the per-token normalization aspect of conventional LayerNorm, which disrupts essential spatial correlations and internal feature statistics. To address this, we propose a simple normalization strategy tailored for IR Transformers. Our approach applies normalization across the entire spatio-channel dimension, effectively preserving spatial correlations. Additionally, we introduce an input-adaptive rescaling method that aligns feature statistics to the unique statistical requirements of each input. Experimental results verify that this combined strategy effectively resolves feature divergence, significantly enhancing both the stability and performance of IR Transformers across various IR tasks.
zh

[CV-62] FACT: Multinomial Misalignment Classification for Point Cloud Registration

【速读】：本文旨在解决点云配准质量预测（即配准误差的评估）的问题，这对大规模自动配准三维模型的质量保证具有重要意义。传统方法如基于标准点云质量指标或配准残差的方法在预测配准误差方面表现不佳。为此，论文提出了一种名为FACT（Feature-based Alignment Classification Task）的方法，其关键在于通过点变换网络（Point Transformer）从配准后的点云对中提取局部特征，并将其分类为多类误配（multinomial misalignment），而非传统的二分类任务。为此，作者引入了一种自定义的回归-分类损失函数，结合交叉熵损失与Wasserstein距离损失，证明其在性能上优于直接回归及先前的二分类方法。此外，FACT能够有效处理不同配准算法（如经典的ICP和GeoTransformer）产生的点云对，进一步展示了其通用性和实用性。

链接: https://arxiv.org/abs/2504.06627
作者: Ludvig Dillén,Per-Erik Forssén,Johan Edstedt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at SCIA 2025 (the Scandinavian Conference on Image Analysis 2025)

点击查看摘要

Abstract:We present FACT, a method for predicting alignment quality (i.e., registration error) of registered lidar point cloud pairs. This is useful e.g. for quality assurance of large, automatically registered 3D models. FACT extracts local features from a registered pair and processes them with a point transformer-based network to predict a misalignment class. We generalize prior work that study binary alignment classification of registration errors, by recasting it as multinomial misalignment classification. To achieve this, we introduce a custom regression-by-classification loss function that combines the cross-entropy and Wasserstein losses, and demonstrate that it outperforms both direct regression and prior binary classification. FACT successfully classifies point-cloud pairs registered with both the classical ICP and GeoTransformer, while other choices, such as standard point-cloud-quality metrics and registration residuals are shown to be poor choices for predicting misalignment. On a synthetically perturbed point-cloud task introduced by the CorAl method, we show that FACT achieves substantially better performance than CorAl. Finally, we demonstrate how FACT can assist experts in correcting misaligned point-cloud maps. Our code is available at this https URL.
zh

[CV-63] InstantSticker: Realistic Decal Blending via Disentangled Object Reconstruction AAAI2025

【速读】：本文旨在解决基于图像的贴花（decal）融合方法在实现高真实感、快速编辑以及实时渲染过程中存在的几何形变、模糊不清及耗时优化等问题。关键在于引入了一种基于图像光照（Image-Based Lighting, IBL）的解耦重建流水线InstantSticker，通过将阴影因子融入IBL并在训练过程中自适应优化，确保贴花区域的阴影亮度能够被精确分解而非烘焙到漫反射颜色中，从而保证编辑后的纹理具有真实的阴影效果。此外，利用As-Rigid-As-Possible (ARAP) 参数化技术对网格特定区域进行预展开，并结合局部UV映射与神经纹理图增强高频细节表达能力，有效缓解了几何形变与模糊问题。同时，采用迪士尼BRDF模型显式定义材质颜色的三通道漫反射反照率，在编辑过程中可即时替换RGB值，避免了传统方法中的长时间优化过程。实验结果表明，该方法在编辑质量、速度以及渲染效率方面均优于现有技术，达到了当前最佳水平。

链接: https://arxiv.org/abs/2504.06620
作者: Yi Zhang,Xiaoyang Huang,Yishun Dou,Yue Shi,Rui Shi,Ye Chen,Bingbing Ni,Wenjun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:We present InstantSticker, a disentangled reconstruction pipeline based on Image-Based Lighting (IBL), which focuses on highly realistic decal blending, simulates stickers attached to the reconstructed surface, and allows for instant editing and real-time rendering. To achieve stereoscopic impression of the decal, we introduce shadow factor into IBL, which can be adaptively optimized during training. This allows the shadow brightness of surfaces to be accurately decomposed rather than baked into the diffuse color, ensuring that the edited texture exhibits authentic shading. To address the issues of warping and blurriness in previous methods, we apply As-Rigid-As-Possible (ARAP) parameterization to pre-unfold a specified area of the mesh and use the local UV mapping combined with a neural texture map to enhance the ability to express high-frequency details in that area. For instant editing, we utilize the Disney BRDF model, explicitly defining material colors with 3-channel diffuse albedo. This enables instant replacement of albedo RGB values during the editing process, avoiding the prolonged optimization required in previous approaches. In our experiment, we introduce the Ratio Variance Warping (RVW) metric to evaluate the local geometric warping of the decal area. Extensive experimental results demonstrate that our method surpasses previous decal blending methods in terms of editing quality, editing speed and rendering speed, achieving the state-of-the-art.
zh

[CV-64] Human-like compositional learning of visually-grounded concepts using synthetic environments

【速读】：该论文旨在解决多模态学习中的一个核心问题：如何通过试错过程让人工智能系统学会将语言概念类与视觉线索关联起来，并理解语言指令中复杂的组合结构。现有的算法虽表现出一定程度的组合性，但未能揭示人类如何通过试错学习来构建概念类别并将其映射到视觉场景。为探索这一挑战，研究设计了一个三维合成环境，在其中智能体通过强化学习方式学会根据自然语言指令导航至目标位置。这些指令包含名词、属性以及关键性的限定词（determiners）、介词或两者的结合，由此显著增加了视觉接地任务的组合复杂度。

解决方案的关键在于采用人类式的学习策略，特别是**课程学习（Curriculum Learning）**方法。研究发现，这种策略能够大幅提升智能体的概念学习效率，使限定词环境下的训练轮次减少了15%，同时帮助智能体更轻松地掌握介词概念。此外，通过在限定词或介词环境中训练的智能体能够分解未见过的测试指令，并快速调整其导航策略以适应新的视觉对象组合。这些结果表明，利用合成环境，多模态强化学习代理可以实现对复杂概念类别的组合性理解，并证明了类人学习策略在提升人工系统学习效率方面的有效性。

链接: https://arxiv.org/abs/2504.06618
作者: Zijun Lin,M Ganesh Kumar,Cheston Tan
机构: Nanyang Technological University (南洋理工大学); Harvard University (哈佛大学); Centre for Frontier AI Research, A*STAR (前沿人工智能研究中心, 新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The compositional structure of language enables humans to decompose complex phrases and map them to novel visual concepts, showcasing flexible intelligence. While several algorithms exhibit compositionality, they fail to elucidate how humans learn to compose concept classes and ground visual cues through trial and error. To investigate this multi-modal learning challenge, we designed a 3D synthetic environment in which an agent learns, via reinforcement, to navigate to a target specified by a natural language instruction. These instructions comprise nouns, attributes, and critically, determiners, prepositions, or both. The vast array of word combinations heightens the compositional complexity of the visual grounding task, as navigating to a blue cube above red spheres is not rewarded when the instruction specifies navigating to “some blue cubes below the red sphere”. We first demonstrate that reinforcement learning agents can ground determiner concepts to visual targets but struggle with more complex prepositional concepts. Second, we show that curriculum learning, a strategy humans employ, enhances concept learning efficiency, reducing the required training episodes by 15% in determiner environments and enabling agents to easily learn prepositional concepts. Finally, we establish that agents trained on determiner or prepositional concepts can decompose held-out test instructions and rapidly adapt their navigation policies to unseen visual object combinations. Leveraging synthetic environments, our findings demonstrate that multi-modal reinforcement learning agents can achieve compositional understanding of complex concept classes and highlight the efficacy of human-like learning strategies in improving artificial systems’ learning efficiency.
zh

[CV-65] Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization

【速读】：本文旨在解决手语生产（Sign Language Production, SLP）领域中从 spoken-language 文本直接生成手语姿态序列的问题。为实现这一目标，论文提出了一种无需词典（gloss-free）且基于 Transformer 的方法。方案的关键在于两个方面：首先，设计了一个基于解缠策略（articulator-based disentanglement strategy）的姿势自编码器，将手语姿态编码到紧凑的潜在空间中，并分别建模面部、右手、左手及身体特征以促进结构化且可解释的表示学习；其次，训练了一个非自回归 Transformer 解码器，从句子级文本嵌入预测这些潜在表示，并通过通道感知正则化（channel-aware regularization）引导解码过程，利用 KL 散度损失对齐预测的潜在分布与从真实编码提取的先验分布，同时根据关联的解剖区域（articulator region）加权每个通道对损失的贡献，从而在训练过程中考虑不同解剖部位的重要性。该方法不依赖词典监督或预训练模型，在 PHOENIX14T 数据集上仅使用适度的训练集即达到了当前最佳性能。

链接: https://arxiv.org/abs/2504.06610
作者: Sumeyye Meryem Tasyurek,Tugce Kiziltepe,Hacer Yalim Keles
机构: Hacettepe University (哈塞特佩大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, 1 table

点击查看摘要

Abstract:In this work, we propose a simple gloss-free, transformer-based sign language production (SLP) framework that directly maps spoken-language text to sign pose sequences. We first train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy, where features corresponding to the face, right hand, left hand, and body are modeled separately to promote structured and interpretable representation learning. Next, a non-autoregressive transformer decoder is trained to predict these latent representations from sentence-level text embeddings. To guide this process, we apply channel-aware regularization by aligning predicted latent distributions with priors extracted from the ground-truth encodings using a KL-divergence loss. The contribution of each channel to the loss is weighted according to its associated articulator region, enabling the model to account for the relative importance of different articulators during training. Our approach does not rely on gloss supervision or pretrained models, and achieves state-of-the-art results on the PHOENIX14T dataset using only a modest training set.
zh

[CV-66] A Cross-Domain Few-Shot Learning Method Based on Domain Knowledge Mapping

【速读】：该论文旨在解决在任务型少量学习（task-based few-shot learning）场景下，现有数据分布与实际遇到的非独立同分布（non-i.i.d.）情况之间的差异问题，探索如何有效利用已有数据知识使模型能够快速适应类别变化。解决方案的关键在于提出了一种基于领域知识映射的新颖跨域少量学习方法，并将其贯穿于预训练、训练及测试阶段。具体而言，在预训练阶段通过最大化互信息整合自监督与有监督损失以缓解模式崩溃；训练阶段结合领域知识映射层与领域分类器同时学习领域映射能力和评估领域适应难度的能力；最终在测试阶段通过支持集上的元训练任务快速适应领域变化，从而提升模型迁移领域知识的能力。实验验证表明，该方法在六个来自不同领域的数据集上均表现出有效性。

链接: https://arxiv.org/abs/2504.06608
作者: Jiajun Chen,Hongpeng Yin,Yifu Yang
机构: School of Automation, Chongqing University, Chongqing, 400044, China (重庆大学自动化学院); Bradley Department of Electrical & Computer Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA 24061 (美国弗吉尼亚理工大学电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In task-based few-shot learning paradigms, it is commonly assumed that different tasks are independently and identically distributed (i.i.d.). However, in real-world scenarios, the distribution encountered in few-shot learning can significantly differ from the distribution of existing data. Thus, how to effectively leverage existing data knowledge to enable models to quickly adapt to class variations under non-i.i.d. assumptions has emerged as a key research challenge. To address this challenge, this paper proposes a new cross-domain few-shot learning approach based on domain knowledge mapping, applied consistently throughout the pre-training, training, and testing phases. In the pre-training phase, our method integrates self-supervised and supervised losses by maximizing mutual information, thereby mitigating mode collapse. During the training phase, the domain knowledge mapping layer collaborates with a domain classifier to learn both domain mapping capabilities and the ability to assess domain adaptation difficulty. Finally, this approach is applied during the testing phase, rapidly adapting to domain variations through meta-training tasks on support sets, consequently enhancing the model’s capability to transfer domain knowledge effectively. Experimental validation conducted across six datasets from diverse domains demonstrates the effectiveness of the proposed method.
zh

[CV-67] Visually Similar Pair Alignment for Robust Cross-Domain Object Detection

【速读】：该论文旨在解决跨域目标检测中由于训练数据（源域）与真实环境（目标域）之间的领域差距导致模型性能下降的问题。传统方法主要通过在源域和目标域之间对齐特征来弥合这一差距，但往往未能充分考虑对齐过程中颜色、方向等视觉差异的影响，从而难以同时应对领域特定变化（如雾天）和视觉变异，限制了领域自适应的效果。论文的关键在于首次通过定制数据集证明，对齐视觉相似的样本对能够显著提升领域自适应能力，并提出了一种基于记忆的新型系统来增强领域对齐。该系统存储源域前景物体和背景区域的预计算特征，并在训练过程中周期性更新。通过检索与目标域前景和背景特征视觉相似的源域特征进行对齐，模型不仅能有效处理领域特定差异，还能减轻视觉变异带来的影响。实验结果验证了该方法的有效性，在Foggy Cityscapes和Sim10k数据集上分别达到了53.1 mAP和62.3 mAP，超越现有最先进的方法1.2 mAP和4.1 mAP。

链接: https://arxiv.org/abs/2504.06607
作者: Onkar Krishna,Hiroki Ohashi
机构: Hitachi Ltd. (日立有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, Journal paper submission

点击查看摘要

Abstract:Domain gaps between training data (source) and real-world environments (target) often degrade the performance of object detection models. Most existing methods aim to bridge this gap by aligning features across source and target domains but often fail to account for visual differences, such as color or orientation, in alignment pairs. This limitation leads to less effective domain adaptation, as the model struggles to manage both domain-specific shifts (e.g., fog) and visual variations simultaneously. In this work, we demonstrate for the first time, using a custom-built dataset, that aligning visually similar pairs significantly improves domain adaptation. Based on this insight, we propose a novel memory-based system to enhance domain alignment. This system stores precomputed features of foreground objects and background areas from the source domain, which are periodically updated during training. By retrieving visually similar source features for alignment with target foreground and background features, the model effectively addresses domain-specific differences while reducing the impact of visual variations. Extensive experiments across diverse domain shift scenarios validate our method’s effectiveness, achieving 53.1 mAP on Foggy Cityscapes and 62.3 on Sim10k, surpassing prior state-of-the-art methods by 1.2 and 4.1 mAP, respectively.
zh

[CV-68] Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

【速读】：该论文旨在解决在多模态领域中将奖励信号应用于大型语言模型（Large Language Models, LLMs）时面临的显著挑战，包括劳动密集型的标注需求、过度依赖单步奖励以及评估不足等问题。为了解决这些问题，论文提出了一种名为SVIP的新方法，其关键在于通过自动生成代码来解决视觉任务，并将代码块的分析转化为逐步链式思维（Chain-of-Thought, CoT）的评价作为训练样本，从而实现步级多维CoT奖励模型的自动训练。此外，采用名为TriAtt-CoT的多头注意力机制来训练SVIP-Reward模型，该模型在整个多模态大型语言模型（MLLM）的训练和推理过程中表现出明显优势。同时，论文还引入了一个用于CoT奖励模型训练和测试的基准。实验结果表明，SVIP-Reward提升了MLLM在训练和推理扩展中的性能，在基准测试中获得更好的结果，同时减少了幻觉现象并增强了推理能力。

链接: https://arxiv.org/abs/2504.06606
作者: Minghe Gao,Xuqi Liu,Zhongqi Yue,Yang Wu,Shuang Chen,Juncheng Li,Siliang Tang,Fei Wu,Tat-Seng Chua,Yueting Zhuang
机构: Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学); Ant Group (蚂蚁集团); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in reward signal usage for Large Language Models (LLMs) are remarkable. However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Thought~(CoT) reward model automatically. It generates code for solving visual tasks and transforms the analysis of code blocks into the evaluation of CoT step as training samples. Then, we train SVIP-Reward model using a multi-head attention mechanism called TriAtt-CoT. The advantages of SVIP-Reward are evident throughout the entire process of MLLM. We also introduce a benchmark for CoT reward model training and testing. Experimental results demonstrate that SVIP-Reward improves MLLM performance across training and inference-time scaling, yielding better results on benchmarks while reducing hallucinations and enhancing reasoning ability.
zh

[CV-69] Exploring Ordinal Bias in Action Recognition for Instructional Videos ICLR2025

【速读】：该论文旨在解决动作识别模型在理解教学视频时过度依赖数据集特定的动作序列（ordinal bias），而非真正理解视频内容的问题。为应对这一挑战，论文提出的关键解决方案是设计两种有效的视频操作方法：动作掩码 (Action Masking) 和 动作片段随机化 (Sequence Shuffling)。通过这些方法，论文揭示了现有模型在面对非标准动作序列时性能显著下降的现象，从而强调了重新思考评估策略以及开发能够超越固定动作模式的通用模型的重要性。

链接: https://arxiv.org/abs/2504.06580
作者: Joochan Kim,Minjoon Jung,Byoung-Tak Zhang
机构: Korea Institute of Science and Technology (韩国科学技术院); Seoul National University (首尔国立大学); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to SCSL @ ICLR 2025

点击查看摘要

Abstract:Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.
zh

[CV-70] Attributes-aware Visual Emotion Representation Learning

【速读】：该论文旨在解决视觉情感分析中的“情感鸿沟”（affective gap）问题，即如何从图像中提取通用特征以有效关联视觉特征与不同的情感状态。传统方法通常侧重于整个图像的一般特征学习，而忽视了特定情感属性（如亮度、色彩丰富度、场景理解及面部表情）的重要性。论文提出的关键解决方案是A4Net，一种深度表征网络，通过融合亮度（Attribute 1）、色彩丰富度（Attribute 2）、场景上下文（Attribute 3）以及面部表情（Attribute 4）这四个关键属性，实现对情感内容更全面的理解。通过联合训练这些属性识别与视觉情感分析，A4Net能够有效提升跨数据集的泛化能力，并在多个视觉情感数据集上展现出竞争力。

链接: https://arxiv.org/abs/2504.06578
作者: Rahul Singh Maharjan,Marta Romeo,Angelo Cangelosi
机构: Department of Computer Science, University of Manchester (曼彻斯特大学); School of Mathematical & Computer Sciences, Heriot-Watt University (赫瑞-瓦特大学); Department of Computer Science, University of Manchester (曼彻斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Visual emotion analysis or recognition has gained considerable attention due to the growing interest in understanding how images can convey rich semantics and evoke emotions in human perception. However, visual emotion analysis poses distinctive challenges compared to traditional vision tasks, especially due to the intricate relationship between general visual features and the different affective states they evoke, known as the affective gap. Researchers have used deep representation learning methods to address this challenge of extracting generalized features from entire images. However, most existing methods overlook the importance of specific emotional attributes such as brightness, colorfulness, scene understanding, and facial expressions. Through this paper, we introduce A4Net, a deep representation network to bridge the affective gap by leveraging four key attributes: brightness (Attribute 1), colorfulness (Attribute 2), scene context (Attribute 3), and facial expressions (Attribute 4). By fusing and jointly training all aspects of attribute recognition and visual emotion analysis, A4Net aims to provide a better insight into emotional content in images. Experimental results show the effectiveness of A4Net, showcasing competitive performance compared to state-of-the-art methods across diverse visual emotion datasets. Furthermore, visualizations of activation maps generated by A4Net offer insights into its ability to generalize across different visual emotion datasets.
zh

[CV-71] Domain Generalization via Discrete Codebook Learning ICME2025

【速读】：该论文旨在解决领域泛化（Domain Generalization, DG）中因连续特征空间分布偏移导致模型泛化能力受限的问题。当前DG方法主要通过学习鲁棒的连续特征表示来应对分布变化，但这种方法在处理大规模连续特征空间时可能难以弥合领域差距，容易受到像素细节中虚假相关性或噪声的影响。论文的关键创新在于提出了一种名为离散领域泛化（Discrete Domain Generalization, DDG）的新学习范式。DDG通过将特征图量化为离散码字，并利用共享的离散表示空间对语义等价信息进行对齐，强调语义层面的信息而弱化像素级复杂性。其核心解决方案在于通过离散化过程减少连续表征中的领域差距，从而优化表示空间的利用效率并降低连续特征空间带来的风险。实验结果表明，DDG在多个基准数据集上的表现优于现有最先进方法，验证了其在减小分布差距和提升模型泛化能力方面的潜力。

链接: https://arxiv.org/abs/2504.06572
作者: Shaocong Long,Qianyu Zhou,Xikun Jiang,Chenhao Ying,Lizhuang Ma,Yuan Luo
机构: Shanghai Jiao Tong University, Shanghai, China (上海交通大学, 中国); Jilin University, Changchun, China (吉林大学, 中国); Copenhagen University, Copenhagen, Denmark (哥本哈根大学, 丹麦)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2025

点击查看摘要

Abstract:Domain generalization (DG) strives to address distribution shifts across diverse environments to enhance model’s generalizability. Current DG approaches are confined to acquiring robust representations with continuous features, specifically training at the pixel level. However, this DG paradigm may struggle to mitigate distribution gaps in dealing with a large space of continuous features, rendering it susceptible to pixel details that exhibit spurious correlations or noise. In this paper, we first theoretically demonstrate that the domain gaps in continuous representation learning can be reduced by the discretization process. Based on this inspiring finding, we introduce a novel learning paradigm for DG, termed Discrete Domain Generalization (DDG). DDG proposes to use a codebook to quantize the feature map into discrete codewords, aligning semantic-equivalent information in a shared discrete representation space that prioritizes semantic-level information over pixel-level intricacies. By learning at the semantic level, DDG diminishes the number of latent features, optimizing the utilization of the representation space and alleviating the risks associated with the wide-ranging space of continuous features. Extensive experiments across widely employed benchmarks in DG demonstrate DDG’s superior performance compared to state-of-the-art approaches, underscoring its potential to reduce the distribution gaps and enhance the model’s generalizability.
zh

[CV-72] ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

【速读】：该论文旨在解决将抽象的高层次指令（high-level instructions）与物理3D场景（physical 3D environments）进行关联的问题。现有方法在将自然语言与3D环境绑定方面取得了进展，但难以将高层次指令分解为依赖于具体环境的低层次子任务。论文指出，高层次任务的分解过程受环境依赖性影响，并且高层次指令可能不会显式引用场景中的语义元素。为了解决这一挑战，论文提出了一种名为ASHiTA的新框架，其关键在于通过结合大型语言模型（LLM）辅助的分层任务分析（hierarchical task analysis, HTA）来生成与3D场景图（scene graph）对齐的任务分解，同时利用任务驱动的3D场景图构建（task-driven 3D scene graph construction）生成适合的环境表示。这种交替迭代的方法能够实现更有效的任务分解，并在依赖环境的子任务划分以及场景绑定性能上表现出显著优势。

链接: https://arxiv.org/abs/2504.06553
作者: Yun Chang,Leonor Fermoselle,Duy Ta,Bernadette Bucher,Luca Carlone,Jiuguang Wang
机构: LIDS, Massachusetts Institute of Technology (麻省理工学院); Robotics and AI Institute (未知); Department of Robotics, University of Michigan (密歇根大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks, a process called hierarchical task analysis, is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis, to generate the task breakdown, with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.
zh

[CV-73] LCGC: Learning from Consistency Gradient Conflicting for Class-Imbalanced Semi-Supervised Debiasing AAAI2025

【速读】：该论文旨在解决类别不平衡（class-imbalanced）数据集下半监督学习（semi-supervised learning, SSL）分类器容易产生偏差的问题。传统方法通过减去与类别无关图像的logits来适当地重新平衡分类器，但缺乏坚实的理论基础。论文的关键在于提出了一种名为LCGC（Learning from Consistency Gradient Conflicting）的去偏方案，其核心思想是在训练过程中鼓励类别偏向预测，并利用梯度冲突的伪标签更新策略来优化分类器。具体而言，通过在测试阶段减去基准图像的logits来去偏置预测，同时证明了使用黑图像作为基准是最优选择。实验结果表明，LCGC显著提升了现有类别不平衡半监督学习模型在公开基准数据集上的预测准确性。

链接: https://arxiv.org/abs/2504.06544
作者: Weiwei Xing,Yue Cheng,Hongzhu Yi,Xiaohui Gao,Xiang Wei,Xiaoyu Guo,Yuming Zhang,Xinyu Pang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by AAAI 2025

点击查看摘要

Abstract:Classifiers often learn to be biased corresponding to the class-imbalanced dataset, especially under the semi-supervised learning (SSL) set. While previous work tries to appropriately re-balance the classifiers by subtracting a class-irrelevant image’s logit, but lacks a firm theoretical basis. We theoretically analyze why exploiting a baseline image can refine pseudo-labels and prove that the black image is the best choice. We also indicated that as the training process deepens, the pseudo-labels before and after refinement become closer. Based on this observation, we propose a debiasing scheme dubbed LCGC, which Learning from Consistency Gradient Conflicting, by encouraging biased class predictions during training. We intentionally update the pseudo-labels whose gradient conflicts with the debiased logits, representing the optimization direction offered by the over-imbalanced classifier predictions. Then, we debiased the predictions by subtracting the baseline image logits during testing. Extensive experiments demonstrate that LCGC can significantly improve the prediction accuracy of existing CISSL models on public benchmarks.
zh

[CV-74] SP-OCS: A Time-Series Prediction for Optimal Camera Selection in Multi-Viewpoint Surgical Video Analysis

【速读】：该论文旨在解决传统单摄像头记录方法在开放手术过程记录中存在的遮挡问题（如外科医生头部和身体引起的遮挡）以及固定视角导致的局限性，这些问题降低了视频内容的可理解性。为了解决这些问题，论文提出了一种基于全监督学习的时间序列预测方法，用于从多个同时录制的视频流中选择最佳镜头序列，确保每个时刻的最佳视角。解决方案的关键在于利用预训练模型从手术视频中提取并融合视觉和语义特征，并通过带有TimeBlocks的时间预测网络捕获顺序依赖关系。此外，通过线性嵌入层降低维度，并使用Softmax分类器根据最高概率选择最优摄像机视图。实验结果表明，该方法在长时预测方面表现出与传统监督方法相当的准确性，并且在所提出的框架下实现了优于现有时间序列预测技术的结果。这一创新框架对提升手术视频分析技术具有重要意义，尤其在改善手术教育和患者安全方面具有深远影响。

链接: https://arxiv.org/abs/2504.06527
作者: Xinyu Liu,Xiaoguang Lin,Xiang Liu,Yong Yang,Hongqian Wang,Qilong Sun
机构: Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences (重庆绿色智能技术研究院, 中国科学院); Southwest Hospital, Third Military Medical University (西南医院, 第三军医大学); First Affiliated Hospital of Army Medical University (陆军军医大学第一附属医院); Chongqing School, University of Chinese Academy of Sciences (重庆学院, 中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recording the open surgery process is essential for educational and medical evaluation purposes; however, traditional single-camera methods often face challenges such as occlusions caused by the surgeon’s head and body, as well as limitations due to fixed camera angles, which reduce comprehensibility of the video content. This study addresses these limitations by employing a multi-viewpoint camera recording system, capturing the surgical procedure from six different angles to mitigate occlusions. We propose a fully supervised learning-based time series prediction method to choose the best shot sequences from multiple simultaneously recorded video streams, ensuring optimal viewpoints at each moment. Our time series prediction model forecasts future camera selections by extracting and fusing visual and semantic features from surgical videos using pre-trained models. These features are processed by a temporal prediction network with TimeBlocks to capture sequential dependencies. A linear embedding layer reduces dimensionality, and a Softmax classifier selects the optimal camera view based on the highest probability. In our experiments, we created five groups of open thyroidectomy videos, each with simultaneous recordings from six different angles. The results demonstrate that our method achieves competitive accuracy compared to traditional supervised methods, even when predicting over longer time horizons. Furthermore, our approach outperforms state-of-the-art time series prediction techniques on our dataset. This manuscript makes a unique contribution by presenting an innovative framework that advances surgical video analysis techniques, with significant implications for improving surgical education and patient safety.
zh

[CV-75] DUKAE: DUal-level Knowledge Accumulation and Ensemble for Pre-Trained Model-Based Continual Learning

【速读】：该论文旨在解决预训练模型增强的连续学习（PTMCL）中两个主要问题：一是任务特定分类头之间的不一致导致的决策边界错配和灾难性遗忘加剧；二是受限于初始任务的特征级知识积累，限制了模型的表示能力。为了解决这些问题，论文提出了一种名为DUKAE（Dual-level Knowledge Accumulation and Ensemble）的方法，其关键是通过高斯分布采样将分类头对齐到统一的特征空间以实现特征级的知识积累，并引入自适应专家集成机制在特征层面融合跨任务的知识。实验结果验证了该方法在CIFAR-100、ImageNet-R、CUB-200和Cars-196数据集上的优越性能。

链接: https://arxiv.org/abs/2504.06521
作者: Songze Li,Tonghua Su,Xu-Yao Zhang,Qixing Xu,Zhongjie Wang
机构: Harbin Institute of Technology (哈尔滨工业大学); State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA (中科院模式识别国家重点实验室); School of Artificial Intelligence, UCAS (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-trained model-based continual learning (PTMCL) has garnered growing attention, as it enables more rapid acquisition of new knowledge by leveraging the extensive foundational understanding inherent in pre-trained model (PTM). Most existing PTMCL methods use Parameter-Efficient Fine-Tuning (PEFT) to learn new knowledge while consolidating existing memory. However, they often face some challenges. A major challenge lies in the misalignment of classification heads, as the classification head of each task is trained within a distinct feature space, leading to inconsistent decision boundaries across tasks and, consequently, increased forgetting. Another critical limitation stems from the restricted feature-level knowledge accumulation, with feature learning typically restricted to the initial task only, which constrains the model’s representation capabilities. To address these issues, we propose a method named DUal-level Knowledge Accumulation and Ensemble (DUKAE) that leverages both feature-level and decision-level knowledge accumulation by aligning classification heads into a unified feature space through Gaussian distribution sampling and introducing an adaptive expertise ensemble to fuse knowledge across feature this http URL experiments on CIFAR-100, ImageNet-R, CUB-200, and Cars-196 datasets demonstrate the superior performance of our approach.
zh

[CV-76] STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints

【速读】：该论文致力于解决运动重定位（Motion Retargeting）中的两个核心挑战：在不同体型的目标角色上忠实地复制源角色的空间-时间运动特性，同时确保几何合理性（Geometric Plausibility）和时间一致性（Temporal Consistency）。现有方法往往侧重于其中一个目标，而忽视几何合理性会导致角色间穿透（Interpenetration），忽略时间一致性则会引起运动抖动（Motion Jitter）。论文的关键在于提出了一种新的端到端模型——Spatial-Temporal aware motion Retargeting (STaR)，通过引入空间模块与时间模块的无缝结合来同时满足几何合理性和时间一致性。空间模块利用密集形状表示和新颖的肢体穿透约束来保证几何合理性，同时保持运动语义；时间模块则借助时间Transformer和新颖的时间一致性约束一次性预测完整的运动序列，并强制多层级轨迹平滑性。这种双模块的有机结合实现了语义、几何和时间目标之间的良好平衡。

链接: https://arxiv.org/abs/2504.06504
作者: Xiaohang Yang,Qing Wang,Jiahao Yang,Gregory Slabaugh,Shanxin Yuan
机构: Queen Mary University of London (伦敦大学玛丽皇后学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures;

点击查看摘要

Abstract:Motion retargeting seeks to faithfully replicate the spatio-temporal motion characteristics of a source character onto a target character with a different body shape. Apart from motion semantics preservation, ensuring geometric plausibility and maintaining temporal consistency are also crucial for effective motion retargeting. However, many existing methods prioritize either geometric plausibility or temporal consistency. Neglecting geometric plausibility results in interpenetration while neglecting temporal consistency leads to motion jitter. In this paper, we propose a novel sequence-to-sequence model for seamless Spatial-Temporal aware motion Retargeting (STaR), with penetration and consistency constraints. STaR consists of two modules: (1) a spatial module that incorporates dense shape representation and a novel limb penetration constraint to ensure geometric plausibility while preserving motion semantics, and (2) a temporal module that utilizes a temporal transformer and a novel temporal consistency constraint to predict the entire motion sequence at once while enforcing multi-level trajectory smoothness. The seamless combination of the two modules helps us achieve a good balance between the semantic, geometric, and temporal targets. Extensive experiments on the Mixamo and ScanRet datasets demonstrate that our method produces plausible and coherent motions while significantly reducing interpenetration rates compared with other approaches.
zh

[CV-77] Mind the Gap: Evaluating Vision Systems in Small Data Applications

【速读】：该论文试图解决计算机视觉领域在小样本数据（small-data regime）场景下评估方法不足的问题。研究发现，尽管许多实际应用（如生态监测、医疗诊断和工业质量控制）依赖于数百到数千个标注样本的小样本数据环境，但当前的研究更多关注零样本和少样本学习，忽视了这一重要场景。论文通过使用Natural World Tasks (NeWT) 数据集，比较了多模态大型语言模型（Multi-modal Large Language Models, MLLMs）与仅视觉方法在不同训练集规模下的表现。研究表明，MLLMs 在小样本环境下表现出早期性能 plateau，而仅视觉方法在整个小样本范围内持续改进，且性能差距在超过10个训练样本后逐渐扩大。论文的关键解决方案在于强调在小样本数据场景下对AI方法进行明确评估的重要性，并倡导将理论进展更好地应用于实际部署。

链接: https://arxiv.org/abs/2504.06486
作者: Samuel Stevens,S M Rayeed,Jenna Kline
机构: The Ohio State University; Rensselaer Polytechnic Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages (main text), 5 figures

点击查看摘要

Abstract:The practical application of AI tools for specific computer vision tasks relies on the “small-data regime” of hundreds to thousands of labeled samples. This small-data regime is vital for applications requiring expensive expert annotations, such as ecological monitoring, medical diagnostics or industrial quality control. We find, however, that computer vision research has ignored the small data regime as evaluations increasingly focus on zero- and few-shot learning. We use the Natural World Tasks (NeWT) benchmark to compare multi-modal large language models (MLLMs) and vision-only methods across varying training set sizes. MLLMs exhibit early performance plateaus, while vision-only methods improve throughout the small-data regime, with performance gaps widening beyond 10 training examples. We provide the first comprehensive comparison between these approaches in small-data contexts and advocate for explicit small-data evaluations in AI research to better bridge theoretical advances with practical deployments.
zh

[CV-78] Holistic Fusion: Task- and Setup-Agnostic Robot Localization and State Estimation with Factor Graphs

【速读】：本文旨在解决移动机器人在复杂环境中对低延迟局部运动估计（如动态操控）和高精度全局定位（如路径导航）的同时需求。现有传感器融合方法多针对特定场景设计，而本文提出了一种灵活的开源解决方案——Holistic Fusion (HF)，它不依赖于具体任务或配置，具有通用性和易用性。关键在于将传感器融合建模为机器人局部与全局状态以及任意数量动态上下文变量联合估计的问题，并通过显式引入优化中的状态及随机游走模型来实现不同参考系下绝对、局部和路标测量的直接融合，同时特别关注局部平滑性和一致性以避免机器人状态信念跳跃。此外，HF能够在典型机器人硬件上实现实时低延迟平滑状态估计，同时提供基于IMU测量频率的低漂移全局定位。该框架的有效性已在三种机器人平台的五个真实场景中得到验证。

链接: https://arxiv.org/abs/2504.06479
作者: Julian Nubert,Turcan Tuna,Jonas Frey,Cesar Cadena,Katherine J. Kuchenbecker,Shehryar Khattak,Marco Hutter
机构: Robotic Systems Lab, ETH Zürich, Switzerland(瑞士苏黎世联邦理工学院机器人系统实验室); Max Planck Institute for Intelligent Systems, Stuttgart, Germany(德国斯图加特马克斯·普朗克智能系统研究所); NASA Jet Propulsion Laboratory, California Institute of Technology, USA(美国加州理工学院喷气推进实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 21 pages, 25 figures, 9 tables, journal submission

点击查看摘要

Abstract:Seamless operation of mobile robots in challenging environments requires low-latency local motion estimation (e.g., dynamic maneuvers) and accurate global localization (e.g., wayfinding). While most existing sensor-fusion approaches are designed for specific scenarios, this work introduces a flexible open-source solution for task- and setup-agnostic multimodal sensor fusion that is distinguished by its generality and usability. Holistic Fusion formulates sensor fusion as a combined estimation problem of i) the local and global robot state and ii) a (theoretically unlimited) number of dynamic context variables, including automatic alignment of reference frames; this formulation fits countless real-world applications without any conceptual modifications. The proposed factor-graph solution enables the direct fusion of an arbitrary number of absolute, local, and landmark measurements expressed with respect to different reference frames by explicitly including them as states in the optimization and modeling their evolution as random walks. Moreover, local smoothness and consistency receive particular attention to prevent jumps in the robot state belief. HF enables low-latency and smooth online state estimation on typical robot hardware while simultaneously providing low-drift global localization at the IMU measurement rate. The efficacy of this released framework is demonstrated in five real-world scenarios on three robotic platforms, each with distinct task requirements.
zh

[CV-79] Implementation of a Zed 2i Stereo Camera for High-Frequency Shoreline Change and Coastal Elevation Monitoring

【速读】：该论文旨在解决沿海地区高人口密度和经济利益导致的对海岸高程和岸线变化短期监测需求增加的问题，现有资源通常缺乏所需的高时间分辨率（如每小时）。论文的关键解决方案是实施低成本的ZED 2i立体相机系统与近景摄影测量技术，以在局部尺度和高时间分辨率下采集图像，生成三维点云、海滩高程数字表面模型（DSM）以及地理配准影像。研究的核心贡献包括相机内参标定、获取影像与点云的地理配准与注册、海滩高程DSM的生成，以及将其结果与无人航空系统结构光摄影测量结果进行对比。初步结果显示，尽管存在局限性，ZED 2i仍能够提供满足局部高时间尺度需求的测绘产品，系统实现了平均重投影误差0.20像素、点云配准误差27厘米、相对于真实值的垂直误差37.56厘米，以及x和y方向的地理配准均方根误差分别为2.67厘米和2.81厘米。

链接: https://arxiv.org/abs/2504.06464
作者: José A. Pilartes-Congo,Matthew Kastl,Michael J. Starek,Marina Vicens-Miquel,Philippe Tissot
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in IGARSS 2023 - 2023 IEEE International Geoscience and Remote Sensing Symposium

点击查看摘要

Abstract:The increasing population, thus financial interests, in coastal areas have increased the need to monitor coastal elevation and shoreline change. Though several resources exist to obtain this information, they often lack the required temporal resolution for short-term monitoring (e.g., every hour). To address this issue, this study implements a low-cost ZED 2i stereo camera system and close-range photogrammetry to collect images for generating 3D point clouds, digital surface models (DSMs) of beach elevation, and georectified imagery at a localized scale and high temporal resolution. The main contributions of this study are (i) intrinsic camera calibration, (ii) georectification and registration of acquired imagery and point cloud, (iii) generation of the DSM of the beach elevation, and (iv) a comparison of derived products against those from uncrewed aircraft system structure-from-motion photogrammetry. Preliminary results show that despite its limitations, the ZED 2i can provide the desired mapping products at localized and high temporal scales. The system achieved a mean reprojection error of 0.20 px, a point cloud registration of 27 cm, a vertical error of 37.56 cm relative to ground truth, and georectification root mean square errors of 2.67 cm and 2.81 cm for x and y.
zh

[CV-80] D-Feat Occlusions: Diffusion Features for Robustness to Partial Visual Occlusions in Object Recognition

【速读】：该论文旨在解决物体识别任务中分类模型对遮挡（occlusions）鲁棒性不足的问题。论文的关键在于提出了一种利用冻结的扩散模型（frozen diffusion model）的管道，并引入扩散特征（diffusion features）来增强模型对遮挡的适应能力。论文假设这些扩散特征能够帮助模型“脑补”被遮挡物体背后的视觉特征，从而提升模型的遮挡鲁棒性。为此，作者设计了基于输入增强（input-based augmentations）和基于特征增强（feature-based augmentations）的实验方案：前者通过修复遮挡像素后微调模型，后者则将中间层的扩散特征融入分类特征。实验结果表明，所提出的扩散特征方法显著提升了ImageNet数据集上Transformer和卷积网络（ConvNets）在模拟遮挡情况下的性能，并进一步验证了其在包含真实世界遮挡的数据集上的有效性。

链接: https://arxiv.org/abs/2504.06432
作者: Rupayan Mallick,Sibo Dong,Nataniel Ruiz,Sarah Adel Bargal
机构: Georgetown University (乔治敦大学); Boston University (波士顿大学); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Applications of diffusion models for visual tasks have been quite noteworthy. This paper targets making classification models more robust to occlusions for the task of object recognition by proposing a pipeline that utilizes a frozen diffusion model. Diffusion features have demonstrated success in image generation and image completion while understanding image context. Occlusion can be posed as an image completion problem by deeming the pixels of the occluder to be `missing.’ We hypothesize that such features can help hallucinate object visual features behind occluding objects, and hence we propose using them to enable models to become more occlusion robust. We design experiments to include input-based augmentations as well as feature-based augmentations. Input-based augmentations involve finetuning on images where the occluder pixels are inpainted, and feature-based augmentations involve augmenting classification features with intermediate diffusion features. We demonstrate that our proposed use of diffusion-based features results in models that are more robust to partial object occlusions for both Transformers and ConvNets on ImageNet with simulated occlusions. We also propose a dataset that encompasses real-world occlusions and demonstrate that our method is more robust to partial object occlusions.
zh

[CV-81] PEEL the Layers and Find Yourself: Revisiting Inference-time Data Leakage for Residual Neural Networks

【速读】：本文探讨了深度神经网络在推理阶段的数据泄露风险，研究了一种诚实且好奇的模型服务提供商如何仅基于模型推理结果来恢复用户私有数据输入的问题。论文特别关注残差网络（Residual Networks），因其在计算机视觉中的广泛应用，并假设跳过连接（skip connections）导致了残差块成为数据泄露的主要原因。为解决此问题，论文将推理阶段的数据泄露建模为一个约束优化问题，并提出了一种新颖的反向特征逆向方法\textbf{PEEL}，能够有效地从残差网络中间输出中恢复分块输入特征。关键在于利用残差块输出可以被视为输入的噪声版本这一直觉，从而保留足够的信息以实现输入恢复。实验表明，所提出的分层特征逆向方法在人脸图像数据集和预训练分类器上的效果显著优于现有最先进的恢复方法，尤其在均方误差（MSE）指标下提升了至少一个数量级。代码已开源。

链接: https://arxiv.org/abs/2504.06410
作者: Huzaifa Arif,Keerthiram Murugesan,Payel Das,Alex Gittens,Pin-Yu Chen
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM研究院)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper explores inference-time data leakage risks of deep neural networks (NNs), where a curious and honest model service provider is interested in retrieving users’ private data inputs solely based on the model inference results. Particularly, we revisit residual NNs due to their popularity in computer vision and our hypothesis that residual blocks are a primary cause of data leakage owing to the use of skip connections. By formulating inference-time data leakage as a constrained optimization problem, we propose a novel backward feature inversion method, \textbfPEEL, which can effectively recover block-wise input features from the intermediate output of residual NNs. The surprising results in high-quality input data recovery can be explained by the intuition that the output from these residual blocks can be considered as a noisy version of the input and thus the output retains sufficient information for input recovery. We demonstrate the effectiveness of our layer-by-layer feature inversion method on facial image datasets and pre-trained classifiers. Our results show that PEEL outperforms the state-of-the-art recovery methods by an order of magnitude when evaluated by mean squared error (MSE). The code is available at \hrefthis https URLthis https URL
zh

[CV-82] PromptHMR: Promptable Human Mesh Recovery

【速读】：该论文旨在解决人类姿态与形状（HPS）估计在复杂场景中的挑战，如拥挤环境、人物间交互以及单视角重建等。现有方法缺乏有效机制来利用辅助信息以提升重建精度，且最精确的方法依赖于裁剪的人体检测结果，无法充分利用场景上下文；而处理整张图像的方法虽能捕捉更多上下文但往往难以准确检测人体或在精度上逊色于使用裁剪的方法。此外，近期基于语言的方法虽然尝试通过大型语言模型或视觉-语言模型进行HPS推理，但其性能仍显著低于当前最优水平。论文提出PromptHMR，这是一种基于Transformer架构的可提示方法，通过空间和语义提示重新定义HPS估计任务。PromptHMR的关键在于能够处理整张图像以保留场景上下文，并支持多种输入模态，包括空间提示（如边界框和掩码）和语义提示（如语言描述或交互标签）。实验表明，PromptHMR在多个具有挑战性的任务中表现稳健，同时实现了最先进的性能，同时提供了灵活的基于提示的HPS估计过程控制能力。

链接: https://arxiv.org/abs/2504.06397
作者: Yufu Wang,Yu Sun,Priyanka Patel,Kostas Daniilidis,Michael J. Black,Muhammed Kocabas
机构: Meshcapade; MPI for Intelligent Systems; ETH Zürich; University of Pennsylvania (宾夕法尼亚大学); Archimedes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary “side information” that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.
zh

[CV-83] SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation CVPR2025

【速读】：该论文旨在解决半监督领域自适应（Semi-supervised Domain Adaptation, SSDA）在语义分割任务中的挑战，特别是在源域和目标域之间存在视觉外观相似类别的混淆以及训练数据分布偏向源域表示学习的问题。论文的关键解决方案在于引入语言引导机制，通过利用预训练语言模型中丰富的语义关系，作为跨域特征表示增强的语义桥梁，从而缓解视觉歧义并实现稳健的类别区分。此外，针对长尾分布引起的类别不平衡问题，论文提出了类别平衡的分割损失函数来优化学习过程。这些方法共同构成了一个协同框架，显著提升了在多种领域自适应场景下的性能表现。

链接: https://arxiv.org/abs/2504.06389
作者: Hritam Basak,Zhaozheng Yin
机构: Dept. of Computer Science, Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Domain Adaptation (DA) and Semi-supervised Learning (SSL) converge in Semi-supervised Domain Adaptation (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre-trained language models to enhance feature representations across domains. Therefore, we propose the first language-guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision-language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class-imbalance challenges in long-tailed distributions, we introduce class-balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain adaptation scenarios, our approach demonstrates substantial performance improvements over contemporary state-of-the-art (SoTA) methodologies. Code is available: \hrefthis https URLGitHub.
zh

[CV-84] Fast Globally Optimal and Geometrically Consistent 3D Shape Matching

【速读】：该论文旨在解决三维形状匹配中全局最优且几何一致匹配的计算问题，这一问题在实际应用中常因过于严格的假设（如需要良好的初始值）而被忽视或难以实现。论文的关键创新在于提出了一种新颖的形式化方法，通过将源形状表面表示为一组循环路径，并将其与目标形状进行一致性匹配，从而实现几何一致的形状匹配。数学上，作者构建了一个超积图（hyper product graph），并将三维形状匹配问题转化为该超图中的最小成本环流问题，进而得到两形状之间的全局几何一致匹配。这种方法在实践中具有可扩展性，其关键是利用超图流模型实现了高效求解和高质量结果。

链接: https://arxiv.org/abs/2504.06385
作者: Paul Roetzer,Florian Bernard
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages main paper

点击查看摘要

Abstract:Geometric consistency, i.e. the preservation of neighbourhoods, is a natural and strong prior in 3D shape matching. Geometrically consistent matchings are crucial for many downstream applications, such as texture transfer or statistical shape modelling. Yet, in practice, geometric consistency is often overlooked, or only achieved under severely limiting assumptions (e.g. a good initialisation). In this work, we propose a novel formalism for computing globally optimal and geometrically consistent matchings between 3D shapes which is scalable in practice. Our key idea is to represent the surface of the source shape as a collection of cyclic paths, which are then consistently matched to the target shape. Mathematically, we construct a hyper product graph (between source and target shape), and then cast 3D shape matching as a minimum-cost circulation flow problem in this hyper graph, which yields global geometrically consistent matchings between both shapes. We empirically show that our formalism is efficiently solvable and that it leads to high-quality results.
zh

[CV-85] owards Calibration Enhanced Network by Inverse Adversarial Attack

【速读】：该论文旨在解决在人机界面（HMI）屏幕验证自动化过程中，光学字符识别（OCR）模型在处理噪声时面临的挑战。论文的关键解决方案是利用对抗训练（adversarial training）技术来增强OCR模型的鲁棒性。具体而言，研究设计了一种新的对抗攻击目标，以探索HMI测试场景下的决策边界，并通过对抗训练优化这些边界，使OCR模型更加稳健和准确。此外，还构建了一个基于真实需求的HMI屏幕数据集，并对干净数据集施加多种扰动，以覆盖更多潜在场景。实验结果表明，采用对抗训练的OCR模型在应对各类噪声时表现出了更高的鲁棒性，同时保持了较高的准确性，甚至对其他模式的扰动也表现出一定程度的稳健性。

链接: https://arxiv.org/abs/2504.06358
作者: Yupeng Cheng,Zi Pong Lim,Sarthak Ketanbhai Modi,Yon Shin Teo,Yushi Cao,Shang-Wei Lin
机构: Nanyang Technological University (南洋理工大学); Continental Corporation (大陆集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Test automation has become increasingly important as the complexity of both design and content in Human Machine Interface (HMI) software continues to grow. Current standard practice uses Optical Character Recognition (OCR) techniques to automatically extract textual information from HMI screens for validation. At present, one of the key challenges faced during the automation of HMI screen validation is the noise handling for the OCR models. In this paper, we propose to utilize adversarial training techniques to enhance OCR models in HMI testing scenarios. More specifically, we design a new adversarial attack objective for OCR models to discover the decision boundaries in the context of HMI testing. We then adopt adversarial training to optimize the decision boundaries towards a more robust and accurate OCR model. In addition, we also built an HMI screen dataset based on real-world requirements and applied multiple types of perturbation onto the clean HMI dataset to provide a more complete coverage for the potential scenarios. We conduct experiments to demonstrate how using adversarial training techniques yields more robust OCR models against various kinds of noises, while still maintaining high OCR model accuracy. Further experiments even demonstrate that the adversarial training models exhibit a certain degree of robustness against perturbations from other patterns.
zh

[CV-86] From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction CVPR2025

【速读】：该论文旨在解决体育视频理解中的关键任务——比赛状态重构（Game State Reconstruction, GSR），其目标是在真实世界坐标系中精确跟踪和定位足球场上的所有个体（包括球员、守门员、裁判等）。论文提出了一种针对单摄像机设置的鲁棒端到端管道，以实现整场比赛中球员的精准跟踪。解决方案的关键在于结合了经过微调的YOLOv5m用于对象检测、基于SegFormer的相机参数估计器以及增强有重识别、方向预测和球衣号码识别的DeepSORT跟踪框架，从而确保空间精度与时间一致性，最终达到最先进的比赛状态重构效果，并在2024年SoccerNet GSR挑战赛中获得第一名。

链接: https://arxiv.org/abs/2504.06357
作者: Vladimir Golovkin,Nikolay Nemtsev,Vasyl Shandyba,Oleg Udin,Nikita Kasatkin,Pavel Kononov,Anton Afanasiev,Sergey Ulasen,Andrei Boiarov
机构: Constructor Tech (构造技术); Sofia (索非亚)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for presentation at the CVPR 2025 CVsports Workshop

点击查看摘要

Abstract:Game State Reconstruction (GSR), a critical task in Sports Video Understanding, involves precise tracking and localization of all individuals on the football field-players, goalkeepers, referees, and others - in real-world coordinates. This capability enables coaches and analysts to derive actionable insights into player movements, team formations, and game dynamics, ultimately optimizing training strategies and enhancing competitive advantage. Achieving accurate GSR using a single-camera setup is highly challenging due to frequent camera movements, occlusions, and dynamic scene content. In this work, we present a robust end-to-end pipeline for tracking players across an entire match using a single-camera setup. Our solution integrates a fine-tuned YOLOv5m for object detection, a SegFormer-based camera parameter estimator, and a DeepSORT-based tracking framework enhanced with re-identification, orientation prediction, and jersey number recognition. By ensuring both spatial accuracy and temporal consistency, our method delivers state-of-the-art game state reconstruction, securing first place in the SoccerNet Game State Reconstruction Challenge 2024 and significantly outperforming competing methods.
zh

[CV-87] Analyzing the Impact of Low-Rank Adaptation for Cross-Domain Few-Shot Object Detection in Aerial Images

【速读】：该论文旨在解决小样本目标检测在航空图像跨域场景下的模型过拟合问题，特别是在资源受限环境中的高效微调挑战。论文的关键解决方案是将低秩适应（Low-Rank Adaptation, LoRA）方法应用于小型模型，并将其集成到DiffusionDet中。通过在初始微调后应用LoRA，研究发现其在低样本设置（如1-shot和5-shot）中略微提升了性能，同时保持了与全量微调相当的效果。这种方法的核心优势在于通过参数高效的微调策略缓解了过拟合问题，为航空目标检测任务提供了有效的跨域适配能力。

链接: https://arxiv.org/abs/2504.06330
作者: Hicham Talaoubrid,Anissa Mokraoui,Ismail Ben Ayed,Axel Prouvost,Sonimith Hang,Monit Korn,Rémi Harvey
机构: L2TI & COSE (L2TI 和 COSE), Université Sorbonne Paris Nord (巴黎第十三大学); LIVIA, ETS (LIVIA, ETS), Montreal, Canada; IMT Mines Alès (IMT Mines Alès), France; COSE (COSE), Montmagny, France
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates the application of Low-Rank Adaptation (LoRA) to small models for cross-domain few-shot object detection in aerial images. Originally designed for large-scale models, LoRA helps mitigate overfitting, making it a promising approach for resource-constrained settings. We integrate LoRA into DiffusionDet, and evaluate its performance on the DOTA and DIOR datasets. Our results show that LoRA applied after an initial fine-tuning slightly improves performance in low-shot settings (e.g., 1-shot and 5-shot), while full fine-tuning remains more effective in higher-shot configurations. These findings highlight LoRA’s potential for efficient adaptation in aerial object detection, encouraging further research into parameter-efficient fine-tuning strategies for few-shot learning. Our code is available here: this https URL.
zh

[CV-88] Analyzing How Text-to-Image Models Represent Nationalities in Everyday Tasks

【速读】：该论文旨在探究一款流行的Text-to-Image (T2I) 模型在生成执行典型日常任务的个人图像时，如何表征来自208个不同国家的人。为了解决这一问题，研究设计了两种场景，并基于指定国籍的输入提示生成图像。关键在于通过分析生成图像中传统服饰的呈现比例及其与地区及收入群体的关联性，揭示模型在表征上的潜在偏见。此外，使用CLIP测量生成图像与提示及标题之间的对齐分数，进一步验证这种表征模式的影响。研究还评估了修订后的提示（自动添加到原始提示中的额外上下文信息）对生成图像中个体表征的潜在影响，发现“传统”一词常被添加到修订后的提示中。这些发现为改进未来模型提供了有价值的见解。

链接: https://arxiv.org/abs/2504.06313
作者: Abdulkareem Alsudais
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The primary objective of this paper is to investigate how a popular Text-to-Image (T2I) model represents people from 208 different nationalities when prompted to generate images of individuals performing typical everyday tasks. Two scenarios were developed, and images were generated based on input prompts that specified nationalities. The results show that in one scenario, the majority of images, and in the other, a substantial portion, depict individuals wearing traditional attire. This suggests that the model emphasizes such characteristics even when they are impractical for the given task. A statistically significant relationship was observed between this representation pattern and the regions associated with the specified countries. This indicates that the issue disproportionately affects certain areas, particularly the Middle East North Africa and Sub-Saharan Africa. A notable association with income groups was also found. CLIP was used to measure alignment scores between generated images and various prompts and captions. The findings indicate statistically significant higher scores for images featuring individuals in traditional attire in one scenario. The study also examined revised prompts (additional contextual information automatically added to the original input prompts) to assess their potential influence on how individuals are represented in the generated images, finding that the word “traditional” was commonly added to revised prompts. These findings provide valuable insights into how T2I models represent individuals from various countries and highlight potential areas for improvement in future models.
zh

[CV-89] rnarization of Vision Language Models for use on edge devices

【速读】：该论文试图解决如何高效压缩预训练的视觉语言模型（Vision Language Model），以生成其三值版本（ternary version）而非从头训练一个新的三值模型。论文的关键解决方案在于提出了一种基于k-means算法的新初始化方案（new initialization scheme），用于从预训练权重中快速生成三值表示，从而显著减少三值化（ternarization）的时间开销。此外，论文还针对TensorFlow Lite引擎实现了自定义算子（custom operators），以优化三值模型的执行效率。通过这些方法，论文在内存消耗、推理速度和困惑度（perplexity）之间找到了良好的平衡，同时实现了最快的标记生成速度。

链接: https://arxiv.org/abs/2504.06298
作者: Ben Crulis,Cyril De Runz,Barthelemy Serres,Gilles Venturini
机构: LIFAT (LIFAT); University of Tours (图尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a process to compress a pre-trained Vision Language Model into a ternary version of itself instead of training a ternary model from scratch. A new initialization scheme from pre-trained weights based on the k-means algorithm is proposed to reduce the ternarization time. We implement different custom operators for executing the ternary model on the TensorFlow Lite Engine. We compare the original model with its ternary and binary versions in terms of memory consumption, inference speed and perplexity. We find that the ternary model using our custom ternary matrix multiplication operator provides a good compromise in term of memory usage and perplexity, while having the fastest token generation speed.
zh

[CV-90] mporal-contextual Event Learning for Pedestrian Crossing Intent Prediction ICONIP2024

【速读】：该论文旨在解决通过观察视频帧预测行人穿越意图（PCI）时因视频帧冗余导致难以捕捉时间维度上的关键行为事件的问题，进而影响PCI预测性能的不足。论文的关键解决方案是引入了一种名为Temporal-contextual Event Learning (TCL) 的新方法，其核心在于结合Temporal Merging Module (TMM) 和Contextual Attention Block (CAB)：TMM通过聚类视频帧生成多个关键时间事件以减少冗余；CAB则自适应地整合这些事件特征与视觉及非视觉数据。通过在关键事件的重要信息上进行时序特征提取与上下文注意力机制的融合，TCL能够学习到更具表达性的PCI表示，从而显著提升预测性能。实验结果表明，TCL在PIE、JAAD-beh和JAAD-all三个数据集上均优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.06292
作者: Hongbin Liang,Hezhe Qiao,Wei Huang,Qizhou Wang,Mingsheng Shang,Lin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in ICONIP2024

点击查看摘要

Abstract:Ensuring the safety of vulnerable road users through accurate prediction of pedestrian crossing intention (PCI) plays a crucial role in the context of autonomous and assisted driving. Analyzing the set of observation video frames in ego-view has been widely used in most PCI prediction methods to forecast the cross intent. However, they struggle to capture the critical events related to pedestrian behaviour along the temporal dimension due to the high redundancy of the video frames, which results in the sub-optimal performance of PCI prediction. Our research addresses the challenge by introducing a novel approach called \underlineTemporal-\underlinecontextual Event \underlineLearning (TCL). The TCL is composed of the Temporal Merging Module (TMM), which aims to manage the redundancy by clustering the observed video frames into multiple key temporal events. Then, the Contextual Attention Block (CAB) is employed to adaptively aggregate multiple event features along with visual and non-visual data. By synthesizing the temporal feature extraction and contextual attention on the key information across the critical events, TCL can learn expressive representation for the PCI prediction. Extensive experiments are carried out on three widely adopted datasets, including PIE, JAAD-beh, and JAAD-all. The results show that TCL substantially surpasses the state-of-the-art methods. Our code can be accessed at this https URL.
zh

[CV-91] Longitudinal Assessment of Lung Lesion Burden in CT

【速读】：该论文试图解决早期肺癌患者随访中纵向追踪总肺肿瘤负荷变化的问题。现有方法多聚焦于肺结节分割与体积分析，但鲜有研究关注总肿瘤负荷的动态变化。为解决此问题，论文提出了基于两种3D模型（nnUNet）的解决方案：一种带有解剖学先验信息，另一种不带解剖学先验信息。关键在于通过训练这两种模型实现肺部病灶的自动分割及总病灶负荷的量化，并发现不带解剖学先验信息的模型在性能上显著优于带先验信息的模型（p < 0.001）。对于直径≥1 cm的临床显著病灶检测，该方法达到了71.3%的精度、68.4%的敏感性以及69.8%的F1分数；在分割任务中，Dice系数为77.1 ± 20.3，Hausdorff距离误差为11.7 ± 24.1 mm。此外，手动测量与自动化测量之间的体积差异中位数为0.02 cc（四分位距：-2.8, 1.2），并通过线性回归和Bland-Altman图评估了一致性。该方法能够为患者提供个性化的总肿瘤负荷评估，并支持随时间推移进行间隔变化跟踪。

链接: https://arxiv.org/abs/2504.06924
作者: Tejas Sudharshan Mathai,Benjamin Hou,Ronald M. Summers
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at SPIE Medical Imaging 2025

点击查看摘要

Abstract:In the U.S., lung cancer is the second major cause of death. Early detection of suspicious lung nodules is crucial for patient treatment planning, management, and improving outcomes. Many approaches for lung nodule segmentation and volumetric analysis have been proposed, but few have looked at longitudinal changes in total lung tumor burden. In this work, we trained two 3D models (nnUNet) with and without anatomical priors to automatically segment lung lesions and quantified total lesion burden for each patient. The 3D model without priors significantly outperformed ( p .001 ) the model trained with anatomy priors. For detecting clinically significant lesions 1cm, a precision of 71.3%, sensitivity of 68.4%, and F1-score of 69.8% was achieved. For segmentation, a Dice score of 77.1 \pm 20.3 and Hausdorff distance error of 11.7 \pm 24.1 mm was obtained. The median lesion burden was 6.4 cc (IQR: 2.1, 18.1) and the median volume difference between manual and automated measurements was 0.02 cc (IQR: -2.8, 1.2). Agreements were also evaluated with linear regression and Bland-Altman plots. The proposed approach can produce a personalized evaluation of the total tumor burden for a patient and facilitate interval change tracking over time.
zh

[CV-92] Leverag ing Anatomical Priors for Automated Pancreas Segmentation on Abdominal CT

【速读】：该论文旨在解决胰腺在CT图像上的精确分割问题，这对于识别胰腺病理和提取基于影像的生物标志物至关重要。现有研究主要集中在改进分割模型架构或使用预处理与后处理技术。本文提出利用解剖学先验知识（anatomical priors）来提升胰腺分割性能。关键在于通过结合公共数据集PANORAMA中的8个精炼标签以及来自TotalSegmentator工具的标签，训练了两个3D全分辨率nnU-Net模型。实验结果表明，引入解剖学先验使Dice评分提高了6%（p < 0.001），Hausdorff距离减少了36.5毫米（p < 0.001），并且始终能够检测到胰腺，而未使用解剖学先验时有8次未能成功检测的情况。这表明解剖学先验在胰腺分割及其后续生物标志物推导中有很大潜力。

链接: https://arxiv.org/abs/2504.06921
作者: Anisa V. Prasad,Tejas Sudharshan Mathai,Pritam Mukherjee,Jianfei Liu,Ronald M. Summers
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at SPIE Medical Imaging 2025

点击查看摘要

Abstract:An accurate segmentation of the pancreas on CT is crucial to identify pancreatic pathologies and extract imaging-based biomarkers. However, prior research on pancreas segmentation has primarily focused on modifying the segmentation model architecture or utilizing pre- and post-processing techniques. In this article, we investigate the utility of anatomical priors to enhance the segmentation performance of the pancreas. Two 3D full-resolution nnU-Net models were trained, one with 8 refined labels from the public PANORAMA dataset, and another that combined them with labels derived from the public TotalSegmentator (TS) tool. The addition of anatomical priors resulted in a 6% increase in Dice score ( p .001 ) and a 36.5 mm decrease in Hausdorff distance for pancreas segmentation ( p .001 ). Moreover, the pancreas was always detected when anatomy priors were used, whereas there were 8 instances of failed detections without their use. The use of anatomy priors shows promise for pancreas segmentation and subsequent derivation of imaging biomarkers.
zh

[CV-93] DIMA: DIffusing Motion Artifacts for unsupervised correction in brain MRI images

【速读】：该论文旨在解决磁共振成像（MRI）中的运动伪影问题，这是临床诊断中长期存在的挑战，可能导致图像质量下降、误诊或重复扫描。现有基于深度学习的运动伪影校正方法通常依赖于配对的无运动与有运动影响的图像进行训练，但在临床实践中这类数据难以获取。为此，论文提出了一种名为DIMA（DIffusing Motion Artifacts）的新框架，其关键在于利用扩散模型实现脑部MRI中无需监督的运动伪影校正。具体而言，DIMA采用两阶段策略：首先在未配对的有运动影响的图像上训练扩散模型以学习运动伪影的分布；然后利用该模型在干净图像上生成真实的运动伪影，从而构建适合监督训练的配对数据集。与现有方法不同，DIMA无需进行k空间操作或深入了解MRI序列参数，使其能够适应不同的扫描协议和硬件设备。全面评估表明，DIMA在多个数据集和解剖平面上的表现可媲美最先进的监督方法，并展现出对真实临床数据的优越泛化能力，标志着运动伪影校正在常规临床应用中的重要进步。

链接: https://arxiv.org/abs/2504.06767
作者: Paolo Angella,Luca Balbi,Fabrizio Ferrando,Paolo Traverso,Rosario Varriale,Vito Paolo Pastore,Matteo Santacesaria
机构: MaLGa Center, DIMA, University of Genoa (马利伽中心, DIMA, 热那亚大学), Italy; MaLGa Center, DIBRIS, University of Genoa (马利伽中心, DIBRIS, 热那亚大学), Italy; Esaote S.p.A. (Esaote S.p.A.), Italy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Motion artifacts remain a significant challenge in Magnetic Resonance Imaging (MRI), compromising diagnostic quality and potentially leading to misdiagnosis or repeated scans. Existing deep learning approaches for motion artifact correction typically require paired motion-free and motion-affected images for training, which are rarely available in clinical settings. To overcome this requirement, we present DIMA (DIffusing Motion Artifacts), a novel framework that leverages diffusion models to enable unsupervised motion artifact correction in brain MRI. Our two-phase approach first trains a diffusion model on unpaired motion-affected images to learn the distribution of motion artifacts. This model then generates realistic motion artifacts on clean images, creating paired datasets suitable for supervised training of correction networks. Unlike existing methods, DIMA operates without requiring k-space manipulation or detailed knowledge of MRI sequence parameters, making it adaptable across different scanning protocols and hardware. Comprehensive evaluations across multiple datasets and anatomical planes demonstrate that our method achieves comparable performance to state-of-the-art supervised approaches while offering superior generalizability to real clinical data. DIMA represents a significant advancement in making motion artifact correction more accessible for routine clinical use, potentially reducing the need for repeat scans and improving diagnostic accuracy.
zh

[CV-94] Image registration of 2D optical thin sections in a 3D porous medium: Application to a Berea sandstone digital rock image

【速读】：该论文旨在解决如何系统性地将二维光学薄片图像与三维数字岩石体积内的平面精确对齐的问题。为实现这一目标，作者提出了一种基于模板图像匹配与微分进化优化的方法，通过识别三维数据中最相似的二维平面来完成配准任务。这种方法的关键在于利用微分进化算法优化模板匹配过程，从而实现高精度的图像配准。论文通过合成多孔介质验证了该方法能够实现完全精确的注册，并进一步应用于Berea砂岩样品，获得了结构相似性指数（SSIM）高达0.990的结果。

链接: https://arxiv.org/abs/2504.06604
作者: Jaehong Chung,Wei Cai,Tapan Mukerji
机构: Department of Geophysics, Stanford University, Stanford, CA, USA (地球物理学系，斯坦福大学); Department of Mechanical Engineering, Stanford University, Stanford, CA, USA (机械工程系，斯坦福大学); Department of Energy Science and Engineering, Stanford University, Stanford, CA, USA (能源科学与工程系，斯坦福大学)
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study proposes a systematic image registration approach to align 2D optical thin-section images within a 3D digital rock volume. Using template image matching with differential evolution optimization, we identify the most similar 2D plane in 3D. The method is validated on a synthetic porous medium, achieving exact registration, and applied to Berea sandstone, where it achieves a structural similarity index (SSIM) of 0.990. With the registered images, we explore upscaling properties based on paired multimodal images, focusing on pore characteristics and effective elastic moduli. The thin-section image reveals 50 % more porosity and submicron pores than the registered CT plane. In addition, bulk and shear moduli from thin sections are 25 % and 30 % lower, respectively, than those derived from CT images. Beyond numerical comparisons, thin sections provide additional geological insights, including cementation, mineral phases, and weathering effects, which are not clear in CT images. This study demonstrates the potential of multimodal image registration to improve computed rock properties in digital rock physics by integrating complementary imaging modalities.
zh

[CV-95] AstroClearNet: Deep image prior for multi-frame astronomical image restoration

【速读】：该论文旨在解决从模糊天文观测图像中恢复高保真夜空图像这一基础性难题，在地面天文学中，由于大气湍流引起的点扩散函数变化，通过组合多帧曝光以提升信噪比的传统方法面临额外挑战。论文提出了一种基于深度图像先验的自监督多帧方法，用于去噪、去模糊及融合地面观测数据。其关键在于精心设计的卷积神经网络，该网络能够整合多帧观测信息并施加物理约束。通过处理Hyper Suprime-Cam数据，展示了方法的潜力，初步结果表明恢复图像更加清晰。

链接: https://arxiv.org/abs/2504.06463
作者: Yashil Sukurdeep,Fausto Navarro,Tamás Budavári
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recovering high-fidelity images of the night sky from blurred observations is a fundamental problem in astronomy, where traditional methods typically fall short. In ground-based astronomy, combining multiple exposures to enhance signal-to-noise ratios is further complicated by variations in the point-spread function caused by atmospheric turbulence. In this work, we present a self-supervised multi-frame method, based on deep image priors, for denoising, deblurring, and coadding ground-based exposures. Central to our approach is a carefully designed convolutional neural network that integrates information across multiple observations and enforces physically motivated constraints. We demonstrate the method’s potential by processing Hyper Suprime-Cam exposures, yielding promising preliminary results with sharper restored images.
zh

[CV-96] Retuve: Automated Multi-Modality Analysis of Hip Dysplasia with Open Source AI

【速读】：该论文旨在解决发育性髋关节发育不良（DDH）诊断面临的标准化不足与及时干预受阻的问题。当前筛查方法缺乏统一标准，而基于AI的研究因数据和代码的可用性有限，存在可重复性挑战。为应对这些局限，论文引入Retuve，这是一个开源多模态DDH分析框架，涵盖超声（US）和X光成像。其关键解决方案在于提供完整且可重现的工作流，包括专家注释的开放数据集、预训练模型及其代码和权重，以及友好的Python API。此外，Retuve整合分割与标志点检测模型，实现关键诊断参数如α角和髋臼指数的自动化测量，通过遵循开源原则促进研究透明度、协作和可及性，从而有望推动DDH筛查普及化、实现早期诊断并改善患者结局。

链接: https://arxiv.org/abs/2504.06422
作者: Adam McArthur,Stephanie Wichuk,Stephen Burnside,Andrew Kirby,Alexander Scammon,Damian Sol,Abhilash Hareendranathan,Jacob L. Jaremko
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures, submitted to Software Impacts

点击查看摘要

Abstract:Developmental dysplasia of the hip (DDH) poses significant diagnostic challenges, hindering timely intervention. Current screening methodologies lack standardization, and AI-driven studies suffer from reproducibility issues due to limited data and code availability. To address these limitations, we introduce Retuve, an open-source framework for multi-modality DDH analysis, encompassing both ultrasound (US) and X-ray imaging. Retuve provides a complete and reproducible workflow, offering open datasets comprising expert-annotated US and X-ray images, pre-trained models with training code and weights, and a user-friendly Python Application Programming Interface (API). The framework integrates segmentation and landmark detection models, enabling automated measurement of key diagnostic parameters such as the alpha angle and acetabular index. By adhering to open-source principles, Retuve promotes transparency, collaboration, and accessibility in DDH research. This initiative has the potential to democratize DDH screening, facilitate early diagnosis, and ultimately improve patient outcomes by enabling widespread screening and early intervention. The GitHub repository/code can be found here: this https URL
zh

[CV-97] Leverag ing State Space Models in Long Range Genomics ICLR2025 ICLR

【速读】：该论文旨在解决现有方法在处理基因组学长程依赖（long-range dependencies）时的不足，尤其是传统方法及广泛采用的基于Transformer的模型在建模长序列（超过训练数据长度）时面临计算复杂度高（O(n²)）以及泛化能力有限的问题。论文的关键解决方案是探索状态空间模型（State Space Models, SSMs）作为替代方案，通过设计两种受SSM启发的架构（Caduceus和Hawk），并在与50M参数Transformer基线相同的条件下评估其在长程基因组建模任务中的性能。研究发现，SSMs不仅能够匹配Transformer的表现，还能在零样本场景下实现跨任务的出色外推能力，处理比训练数据长10到100倍的上下文，表明其具备更通用的表征能力，更适合建模复杂的人类基因组。此外，这些模型在单GPU上可高效处理长达1M tokens的序列，支持一次性建模整个基因组区域，这对计算资源受限的实验室尤为适用。因此，论文的核心贡献在于证明SSMs是一种高效且可扩展的长上下文基因组分析工具。

链接: https://arxiv.org/abs/2504.06304
作者: Matvei Popov,Aymen Kallala,Anirudha Ramesh,Narimane Hennouni,Shivesh Khaitan,Rick Gentry,Alain-Sam Cohen
机构: InstaDeep
类目: Genomics (q-bio.GN); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICLR 2025 (LMRL) - Project page: this https URL

点击查看摘要

Abstract:Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module’s quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.
zh

[CV-98] Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image Compression

【速读】：该论文旨在解决现代基于学习的图像压缩方法在高保真范围内主观质量评估不足的问题。传统客观质量指标在预测JPEG AI压缩图像的质量时普遍存在过于乐观的现象，这凸显了严格主观评估的重要性。论文的关键解决方案是采用JPEG AIC-3主观评估方法，通过大规模众包实验收集了大量三元组比较数据，并利用增强与普通三元组比较构建统一模型以重建基于感知门槛值（Just Noticeable Difference, JND）的质量尺度。此外，论文引入Meng-Rosenthal-Rubin统计检验，用于可靠评估质量度量与真实值之间的相关性差异显著性，从而为现代图像编解码器的开发和基准测试提供了重要参考。

链接: https://arxiv.org/abs/2504.06301
作者: Mohsen Jenadeleh,Jon Sneyers,Panqi Jia,Shima Mohammadi,Joao Ascenso,Dietmar Saupe
机构: University of Konstanz (康斯坦茨大学), Germany; Cloudinary (Cloudinary), Belgium; Huawei (华为), Germany; IST-IT (葡萄牙理工研究院), Portugal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, 3 tables, submitted to QoMEX 2025

点击查看摘要

Abstract:Learning-based image compression methods have recently emerged as promising alternatives to traditional codecs, offering improved rate-distortion performance and perceptual quality. JPEG AI represents the latest standardized framework in this domain, leveraging deep neural networks for high-fidelity image reconstruction. In this study, we present a comprehensive subjective visual quality assessment of JPEG AI-compressed images using the JPEG AIC-3 methodology, which quantifies perceptual differences in terms of Just Noticeable Difference (JND) units. We generated a dataset of 50 compressed images with fine-grained distortion levels from five diverse sources. A large-scale crowdsourced experiment collected 96,200 triplet responses from 459 participants. We reconstructed JND-based quality scales using a unified model based on boosted and plain triplet comparisons. Additionally, we evaluated the alignment of multiple objective image quality metrics with human perception in the high-fidelity range. The CVVDP metric achieved the overall highest performance; however, most metrics including CVVDP were overly optimistic in predicting the quality of JPEG AI-compressed images. These findings emphasize the necessity for rigorous subjective evaluations in the development and benchmarking of modern image codecs, particularly in the high-fidelity range. Another technical contribution is the introduction of the well-known Meng-Rosenthal-Rubin statistical test to the field of Quality of Experience research. This test can reliably assess the significance of difference in performance of quality metrics in terms of correlation between metrics and ground truth. The complete dataset, including all subjective scores, is publicly available at this https URL.
zh

[CV-99] Going beyond explainability in multi-modal stroke outcome prediction models

【速读】：该论文旨在提升多模态预测模型（整合影像与表格患者数据）的可解释性与透明度。为实现这一目标，作者将Grad-CAM和Occlusion等可解释人工智能(xAI)方法适配到结合统计学与深度学习的深度变换模型(dTMs)中。dTMs能够在达到当前最先进预测性能的同时，提供可解释的参数估计（如表格特征的优势比）。研究基于407名中风患者的脑部影像与表格数据，训练dTMs预测中风后三个月的功能预后，并通过多种区分性指标评估模型。适配的xAI方法用于生成解释图以识别相关影像特征并进行错误分析。

解决方案的关键在于通过适配xAI方法增强dTMs的可解释性，从而生成有助于理解特定图像区域在结果预测中的重要性的解释图，这不仅促进了错误分析，还支持了关于特定图像区域意义的假设生成。

链接: https://arxiv.org/abs/2504.06299
作者: Jonas Brändli,Maurice Schneeberger,Lisa Herzog,Loran Avci,Nordin Dari,Martin Häansel,Hakim Baazaoui,Pascal Bühler,Susanne Wegener,Beate Sick
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Aim: This study aims to enhance interpretability and explainability of multi-modal prediction models integrating imaging and tabular patient data. Methods: We adapt the xAI methods Grad-CAM and Occlusion to multi-modal, partly interpretable deep transformation models (dTMs). DTMs combine statistical and deep learning approaches to simultaneously achieve state-of-the-art prediction performance and interpretable parameter estimates, such as odds ratios for tabular features. Based on brain imaging and tabular data from 407 stroke patients, we trained dTMs to predict functional outcome three months after stroke. We evaluated the models using different discriminatory metrics. The adapted xAI methods were used to generated explanation maps for identification of relevant image features and error analysis. Results: The dTMs achieve state-of-the-art prediction performance, with area under the curve (AUC) values close to 0.8. The most important tabular predictors of functional outcome are functional independence before stroke and NIHSS on admission, a neurological score indicating stroke severity. Explanation maps calculated from brain imaging dTMs for functional outcome highlighted critical brain regions such as the frontal lobe, which is known to be linked to age which in turn increases the risk for unfavorable outcomes. Similarity plots of the explanation maps revealed distinct patterns which give insight into stroke pathophysiology, support developing novel predictors of stroke outcome and enable to identify false predictions. Conclusion: By adapting methods for explanation maps to dTMs, we enhanced the explainability of multi-modal and partly interpretable prediction models. The resulting explanation maps facilitate error analysis and support hypothesis generation regarding the significance of specific image regions in outcome prediction. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2504.06299 [eess.IV] (or arXiv:2504.06299v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2504.06299 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Beate Sick [view email] [v1] Mon, 7 Apr 2025 09:56:16 UTC (19,767 KB)
zh

人工智能

[AI-0] AssistanceZero: Scalably Solving Assistance Games

【速读】：该论文试图解决在复杂环境中训练有效的AI助手的问题，传统方法如基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）存在激励欺骗行为等缺点。论文的关键在于提出了一种可扩展的方法AssistanceZero，通过扩展AlphaZero算法并引入一个预测人类动作和奖励的神经网络，使其能够在不确定性下进行规划，从而克服了在复杂环境下难以准确建模人类行为和解决不确定性决策问题的挑战。实验结果表明，AssistanceZero在Minecraft基线辅助游戏中优于无模型的RL算法和模仿学习，并显著减少了参与者完成建筑任务所需的操作步骤。

链接: https://arxiv.org/abs/2504.07091
作者: Cassidy Laidlaw,Eli Bronstein,Timothy Guo,Dylan Feng,Lukas Berglund,Justin Svegliato,Stuart Russell,Anca Dragan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Assistance games are a promising alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games resolve key drawbacks of RLHF, such as incentives for deceptive behavior, by explicitly modeling the interaction between assistant and user as a two-player game where the assistant cannot observe their shared goal. Despite their potential, assistance games have only been explored in simple settings. Scaling them to more complex environments is difficult because it requires both solving intractable decision-making problems under uncertainty and accurately modeling human users’ behavior. We present the first scalable approach to solving assistance games and apply it to a new, challenging Minecraft-based assistance game with over 10^400 possible goals. Our approach, AssistanceZero, extends AlphaZero with a neural network that predicts human actions and rewards, enabling it to plan under uncertainty. We show that AssistanceZero outperforms model-free RL algorithms and imitation learning in the Minecraft-based assistance game. In a human study, our AssistanceZero-trained assistant significantly reduces the number of actions participants take to complete building tasks in Minecraft. Our results suggest that assistance games are a tractable framework for training effective AI assistants in complex environments. Our code and models are available at this https URL.
zh

[AI-1] Π-NeSy: A Possibilistic Neuro-Symbolic Approach

【速读】：该论文旨在解决将低级感知任务与高级推理任务结合的问题，目标是为每个输入实例推导出其属于目标(元)概念的可能性程度。这一(元)概念通过可能规则系统与中间概念相连，神经网络用于推断输入实例中每个中间概念的概率，而神经网络输出的概率分布通过softmax激活函数转化为可能性分布，从而连接低级感知任务与高级推理任务。关键在于设计高效的方法来定义可能规则系统相关的矩阵关系和方程组，并利用处理不一致模糊关系方程组的最新结果提出了一种基于多个训练数据样本学习规则参数的方法。实验表明，该方法在MNIST加法问题和Sudoku谜题问题上的有效性优于现有的神经符号方法。

链接: https://arxiv.org/abs/2504.07055
作者: Ismaïl Baaj,Pierre Marquis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:In this article, we introduce a neuro-symbolic approach that combines a low-level perception task performed by a neural network with a high-level reasoning task performed by a possibilistic rule-based system. The goal is to be able to derive for each input instance the degree of possibility that it belongs to a target (meta-)concept. This (meta-)concept is connected to intermediate concepts by a possibilistic rule-based system. The probability of each intermediate concept for the input instance is inferred using a neural network. The connection between the low-level perception task and the high-level reasoning task lies in the transformation of neural network outputs modeled by probability distributions (through softmax activation) into possibility distributions. The use of intermediate concepts is valuable for the explanation purpose: using the rule-based system, the classification of an input instance as an element of the (meta-)concept can be justified by the fact that intermediate concepts have been recognized. From the technical side, our contribution consists of the design of efficient methods for defining the matrix relation and the equation system associated with a possibilistic rule-based system. The corresponding matrix and equation are key data structures used to perform inferences from a possibilistic rule-based system and to learn the values of the rule parameters in such a system according to a training data sample. Furthermore, leveraging recent results on the handling of inconsistent systems of fuzzy relational equations, an approach for learning rule parameters according to multiple training data samples is presented. Experiments carried out on the MNIST addition problems and the MNIST Sudoku puzzles problems highlight the effectiveness of our approach compared with state-of-the-art neuro-symbolic ones. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) Cite as: arXiv:2504.07055 [cs.AI] (or arXiv:2504.07055v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.07055 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-2] Enhancing Metabolic Syndrome Prediction with Hybrid Data Balancing and Counterfactuals

【速读】：本文旨在解决代谢综合征（MetS）预测中存在的挑战，包括类别不平衡、数据稀缺以及现有研究在方法学上的不一致性。为应对这些难题，论文系统性评估并优化了多种机器学习（ML）模型用于MetS预测，采用了先进的数据平衡技术（如随机过采样ROS、SMOTE、ADASYN及CTGAN）和反事实分析。解决方案的关键在于引入MetaBoost这一新颖的混合框架，通过加权平均和迭代权重调整优化合成数据生成，相比单一平衡技术提升了1.14%的准确性。此外，通过全面的反事实分析量化了将高风险个体转变为低风险所需的特征级变化，揭示了血糖和甘油三酯在降低MetS风险中的临床重要性及其作为强预测因子的概率特性。

链接: https://arxiv.org/abs/2504.06987
作者: Sanyam Paresh Shah,Abdullah Mamun,Shovito Barua Soumma,Hassan Ghasemzadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the IEEE EMBC 2025 Conference. 7 pages, 3 figures

点击查看摘要

Abstract:Metabolic Syndrome (MetS) is a cluster of interrelated risk factors that significantly increases the risk of cardiovascular diseases and type 2 diabetes. Despite its global prevalence, accurate prediction of MetS remains challenging due to issues such as class imbalance, data scarcity, and methodological inconsistencies in existing studies. In this paper, we address these challenges by systematically evaluating and optimizing machine learning (ML) models for MetS prediction, leveraging advanced data balancing techniques and counterfactual analysis. Multiple ML models, including XGBoost, Random Forest, TabNet, etc., were trained and compared under various data balancing techniques such as random oversampling (ROS), SMOTE, ADASYN, and CTGAN. Additionally, we introduce MetaBoost, a novel hybrid framework that integrates SMOTE, ADASYN, and CTGAN, optimizing synthetic data generation through weighted averaging and iterative weight tuning to enhance the model’s performance (achieving a 1.14% accuracy improvement over individual balancing techniques). A comprehensive counterfactual analysis is conducted to quantify feature-level changes required to shift individuals from high-risk to low-risk categories. The results indicate that blood glucose (50.3%) and triglycerides (46.7%) were the most frequently modified features, highlighting their clinical significance in MetS risk reduction. Additionally, probabilistic analysis shows elevated blood glucose (85.5% likelihood) and triglycerides (74.9% posterior probability) as the strongest predictors. This study not only advances the methodological rigor of MetS prediction but also provides actionable insights for clinicians and researchers, highlighting the potential of ML in mitigating the public health burden of metabolic syndrome.
zh

[AI-3] Review of Case-Based Reasoning for LLM Agents : Theoretical Foundations Architectural Components and Cognitive Integration

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）驱动的智能体在处理需要特定结构化知识、灵活性或可问责决策的任务时所面临的局限性，例如幻觉现象以及跨交互缺乏上下文记忆的问题。论文的关键在于将基于案例推理（Case-Based Reasoning, CBR）策略融入LLM代理框架中，通过参考过去的经历来解决新问题。这种集成使LLMs能够利用显式知识，从而提升其有效性。论文系统性地回顾了增强型代理的理论基础，识别了关键框架组件，并为案例检索、适应和学习的CBR过程制定了数学模型。此外，还通过与链式思维推理（Chain-of-Thought reasoning）和标准检索增强生成（Retrieval-Augmented Generation）等方法对比评估了增强后的CBR代理，分析了它们各自的相对优势。进一步地，论文探讨了通过目标驱动自治机制利用CBR的认知维度（如自我反思、内省和好奇心）如何进一步提升LLM代理的能力。这项研究为神经符号混合系统的持续发展做出了贡献，提出CBR是一种可行的技术，可以增强自主LLM代理的推理能力和认知方面。

链接: https://arxiv.org/abs/2504.06943
作者: Kostas Hatalis,Despina Christou,Vyshnavi Kondapalli
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agents powered by Large Language Models (LLMs) have recently demonstrated impressive capabilities in various tasks. Still, they face limitations in tasks requiring specific, structured knowledge, flexibility, or accountable decision-making. While agents are capable of perceiving their environments, forming inferences, planning, and executing actions towards goals, they often face issues such as hallucinations and lack of contextual memory across interactions. This paper explores how Case-Based Reasoning (CBR), a strategy that solves new problems by referencing past experiences, can be integrated into LLM agent frameworks. This integration allows LLMs to leverage explicit knowledge, enhancing their effectiveness. We systematically review the theoretical foundations of these enhanced agents, identify critical framework components, and formulate a mathematical model for the CBR processes of case retrieval, adaptation, and learning. We also evaluate CBR-enhanced agents against other methods like Chain-of-Thought reasoning and standard Retrieval-Augmented Generation, analyzing their relative strengths. Moreover, we explore how leveraging CBR’s cognitive dimensions (including self-reflection, introspection, and curiosity) via goal-driven autonomy mechanisms can further enhance the LLM agent capabilities. Contributing to the ongoing research on neuro-symbolic hybrid systems, this work posits CBR as a viable technique for enhancing the reasoning skills and cognitive aspects of autonomous LLM agents.
zh

[AI-4] Beyond Tools: Generative AI as Epistemic Infrastructure in Education

【速读】：该论文试图解决生成式人工智能（Generative AI）作为认识论基础设施（epistemic infrastructure）在教育领域应用时，当前讨论未能充分涵盖其对教学与学习影响的问题。具体而言，论文关注AI系统如何影响教师的认识论代理能力（epistemic agency），包括其在促进熟练认识论行为、支持认识论敏感性以及长期习惯形成方面的表现。研究发现，现有AI系统在这三个方面存在不足，可能削弱而非增强教师的能力，并可能导致优先考虑效率而非认识论代理的不良习惯。

解决方案的关键在于：首先，认识到教育领域正在发生的基础设施转型；其次，设计能够激发熟练认识论行为同时坚持认识论规范的AI环境；最后，让教育者参与到AI的设计过程中。这些措施旨在推动AI在教育中的整合，使其与核心教育价值观保持一致，并维护人类的认识论代理能力。

链接: https://arxiv.org/abs/2504.06928
作者: Bodong Chen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 23 pages, 2 figures

点击查看摘要

Abstract:As generative AI rapidly integrates into educational infrastructures worldwide, it transforms how knowledge gets created, validated, and shared, yet current discourse inadequately addresses its implications as epistemic infrastructure mediating teaching and learning. This paper investigates how AI systems function as epistemic infrastructures in education and their impact on human epistemic agency. Adopting a situated cognition perspective and following a value-sensitive design approach, the study conducts a technical investigation of two representative AI systems in educational settings, analyzing their impact on teacher practice across three dimensions: affordances for skilled epistemic actions, support for epistemic sensitivity, and implications for long-term habit formation. The analysis reveals that current AI systems inadequately support teachers’ skilled epistemic actions, insufficiently foster epistemic sensitivity, and potentially cultivate problematic habits that prioritize efficiency over epistemic agency. To address these challenges, the paper recommends recognizing the infrastructural transformation occurring in education, developing AI environments that stimulate skilled actions while upholding epistemic norms, and involving educators in AI design processes – recommendations aimed at fostering AI integration that aligns with core educational values and maintains human epistemic agency.
zh

[AI-5] Adaptive Locally Linear Embedding

【速读】：该论文试图解决传统局部线性嵌入（Locally Linear Embedding, LLE）方法在处理复杂高维数据时，因采用欧几里得距离定义邻域而导致难以有效捕捉数据内在几何关系的问题。论文提出了一种新颖的自适应局部线性嵌入（Adaptive Locally Linear Embedding, ALLE）方法作为解决方案，其关键是引入一种动态且数据驱动的距离度量机制，通过关注拓扑邻域包含而非固定距离来重新定义邻近性概念。这种方法能够根据数据的局部结构自适应调整度量标准，从而显著提升邻域保持能力，特别是在具有复杂几何形状和高维结构的数据集上表现出色。实验结果表明，ALLE大幅改善了输入空间与特征空间中邻域的一致性，实现了更精确且拓扑忠实的嵌入表示。这一改进通过为底层数据定制距离度量，为捕获高维数据集中复杂的内在关系提供了稳健方案。

链接: https://arxiv.org/abs/2504.06829
作者: Ali Goli,Mahdieh Alizadeh,Hadi Sadoghi Yazdi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Manifold learning techniques, such as Locally linear embedding (LLE), are designed to preserve the local neighborhood structures of high-dimensional data during dimensionality reduction. Traditional LLE employs Euclidean distance to define neighborhoods, which can struggle to capture the intrinsic geometric relationships within complex data. A novel approach, Adaptive locally linear embedding(ALLE), is introduced to address this limitation by incorporating a dynamic, data-driven metric that enhances topological preservation. This method redefines the concept of proximity by focusing on topological neighborhood inclusion rather than fixed distances. By adapting the metric based on the local structure of the data, it achieves superior neighborhood preservation, particularly for datasets with complex geometries and high-dimensional structures. Experimental results demonstrate that ALLE significantly improves the alignment between neighborhoods in the input and feature spaces, resulting in more accurate and topologically faithful embeddings. This approach advances manifold learning by tailoring distance metrics to the underlying data, providing a robust solution for capturing intricate relationships in high-dimensional datasets.
zh

[AI-6] Learning in Spiking Neural Networks with a Calcium-based Hebbian Rule for Spike-timing-dependent Plasticity

【速读】：该论文旨在解决如何通过局部可塑性机制有效地塑造生物神经网络，并将其应用于设计能量高效且自适应的边缘计算系统。现有模型大多仅关注尖峰时间或平均发放率中的单一因素，而生物学中两者均被无缝利用以调节突触强度。为此，论文提出了一种基于钙迹线跟踪神经元活动的赫布型局部学习规则（Hebbian local learning rule），其关键在于将突触修饰建模为钙迹线与神经活动的函数。通过该规则，论文不仅重现了神经科学中尖峰时间和尖峰率协议的结果，还展示了其在MNIST手写数字识别任务中训练脉冲神经网络的能力，揭示了学习现实世界模式所需的机制。此外，该模型能够敏感于相关尖峰活动，并据此调节网络的学习率，而不改变神经元的平均发放率或学习规则的超参数。据作者所知，这是首次证明尖峰时间和尖峰率在塑造脉冲神经网络连接性方面具有互补作用的工作。

链接: https://arxiv.org/abs/2504.06796
作者: Willian Soares Girão,Nicoletta Risi,Elisabetta Chicca
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding how biological neural networks are shaped via local plasticity mechanisms can lead to energy-efficient and self-adaptive information processing systems, which promises to mitigate some of the current roadblocks in edge computing systems. While biology makes use of spikes to seamless use both spike timing and mean firing rate to modulate synaptic strength, most models focus on one of the two. In this work, we present a Hebbian local learning rule that models synaptic modification as a function of calcium traces tracking neuronal activity. We show how the rule reproduces results from spike time and spike rate protocols from neuroscientific studies. Moreover, we use the model to train spiking neural networks on MNIST digit recognition to show and explain what sort of mechanisms are needed to learn real-world patterns. We show how our model is sensitive to correlated spiking activity and how this enables it to modulate the learning rate of the network without altering the mean firing rate of the neurons nor the hyparameters of the learning rule. To the best of our knowledge, this is the first work that showcases how spike timing and rate can be complementary in their role of shaping the connectivity of spiking neural networks.
zh

[AI-7] AI Help Me Thinkunicodex2014but for Myself: Assisting People in Complex Decision-Making by Providing Different Kinds of Cognitive Support

【速读】：该论文旨在解决如何设计能够有效支持人类决策的 AI 工具，以补充和增强用户推理过程的问题。传统的以推荐为中心的方法面临诸如对用户决策过程的不当依赖或缺乏整合等挑战。为此，论文提出了一种替代的交互模型，称为 ExtendAI，其关键在于 AI 的输出基于用户自身的决策理由进行扩展，而非直接提供推荐。通过与基于推荐的 AI（RecommendAI）对比，在一项涉及投资决策任务的混合方法用户研究中发现，ExtendAI 更好地融入了用户的决策过程，并促进了稍好的决策结果；而 RecommendAI 则能提供更新颖的见解但需要较少的认知努力。研究揭示了这种替代方法的优势及其所体现的辅助决策 AI 的三个核心张力。

链接: https://arxiv.org/abs/2504.06771
作者: Leon Reicherts,Zelun Tony Zhang,Elisabeth von Oswald,Yuanting Liu,Yvonne Rogers,Mariam Hassib
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: To be published at ACM CHI 2025 Conference on Human Factors in Computing Systems

点击查看摘要

Abstract:How can we design AI tools that effectively support human decision-making by complementing and enhancing users’ reasoning processes? Common recommendation-centric approaches face challenges such as inappropriate reliance or a lack of integration with users’ decision-making processes. Here, we explore an alternative interaction model in which the AI outputs build upon users’ own decision-making rationales. We compare this approach, which we call ExtendAI, with a recommendation-based AI. Participants in our mixed-methods user study interacted with both AIs as part of an investment decision-making task. We found that the AIs had different impacts, with ExtendAI integrating better into the decision-making process and people’s own thinking and leading to slightly better outcomes. RecommendAI was able to provide more novel insights while requiring less cognitive effort. We discuss the implications of these and other findings along with three tensions of AI-assisted decision-making which our study revealed.
zh

[AI-8] Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

【速读】：该论文旨在解决跨类型音频深度伪造检测（Cross-type Audio Deepfake Detection, ADD）的问题，现有方法在单一类型的检测任务中表现良好，但在跨类型场景下性能显著下降。论文的关键创新在于提出了一个全面的跨类型音频深度伪造检测基准，并引入了两种基于提示调优的自监督学习（Prompt Tuning Self-Supervised Learning, PT-SSL）的方法：一种是通过学习特定的提示令牌优化自监督学习前端，与微调相比减少了458倍的可训练参数；另一种是小波提示调优自监督学习（Wavelet Prompt Tuning-SSL, WPT-SSL），它从频域捕获类型不变的音频深度伪造信息，而无需额外的训练参数。最终，通过多类型音频的协同训练，实验结果显示WPT-XLSR-AASIST在所有评估集上的平均等错误率（EER）达到了3.58%，展现了最佳性能。

链接: https://arxiv.org/abs/2504.06753
作者: Yuankun Xie,Ruibo Fu,Zhiyong Wang,Xiaopeng Wang,Songjun Cao,Long Ma,Haonan Cheng,Long Ye
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the alltype ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL frontend by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types,we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets. The code is available online.
zh

[AI-9] Learning global control of underactuated systems with Model-Based Reinforcement Learning

【速读】：本文针对“AI Olympics with RealAIGym”第三届竞赛提出了解决方案，目标是优化摆杆机器人（Pendubot）和双摆机器人（Acrobot）系统的全局控制策略。论文的关键在于采用蒙特卡洛概率推理学习控制算法（Monte-Carlo Probabilistic Inference for Learning Control, MC-PILCO），这是一种被广泛认可的数据高效模型基础强化学习（Model-Based Reinforcement Learning, MBRL）方法。MC-PILCO通过与系统交互数据优化动力学模型，并基于模拟进行策略迭代而非直接优化实际系统数据，从而实现更高的数据利用率。与无模型（Model-Free, MF）方法相比，该方法在物理系统中表现尤为突出，同时其在模拟和真实环境中的稳健性已在前两届比赛中得到验证。

链接: https://arxiv.org/abs/2504.06721
作者: Niccolò Turcato,Marco Calì,Alberto Dalla Libera,Giulio Giacomuzzo,Ruggero Carli,Diego Romeres
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2409.05811

点击查看摘要

Abstract:This short paper describes our proposed solution for the third edition of the “AI Olympics with RealAIGym” competition, held at ICRA 2025. We employed Monte-Carlo Probabilistic Inference for Learning Control (MC-PILCO), an MBRL algorithm recognized for its exceptional data efficiency across various low-dimensional robotic tasks, including cart-pole, ball \ plate, and Furuta pendulum systems. MC-PILCO optimizes a system dynamics model using interaction data, enabling policy refinement through simulation rather than direct system data optimization. This approach has proven highly effective in physical systems, offering greater data efficiency than Model-Free (MF) alternatives. Notably, MC-PILCO has previously won the first two editions of this competition, demonstrating its robustness in both simulated and real-world environments. Besides briefly reviewing the algorithm, we discuss the most critical aspects of the MC-PILCO implementation in the tasks at hand: learning a global policy for the pendubot and acrobot systems.
zh

[AI-10] Hyperparameter Optimisation with Practical Interpretability and Explanation Methods in Probabilistic Curriculum Learning

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）中超参数优化（Hyperparameter Optimisation, HPO）的挑战，特别是针对概率课程学习（Probabilistic Curriculum Learning, PCL）策略在标准RL任务（如点迷宫导航和直流电机控制）中的应用。论文的关键在于通过结合AlgOS框架与Optuna的树结构Parzen估计器（Tree-Structured Parzen Estimator, TPE），提出方法以精炼超参数搜索空间，从而提升优化效率。此外，论文引入基于SHAP的可解释性方法，专门分析超参数及其交互对RL性能的影响，为理解超参数作用机制提供直观洞见。这些工作为强化学习中超参数优化的有效性和计算可行性提供了实用指南和工具支持。

链接: https://arxiv.org/abs/2504.06683
作者: Llewyn Salt,Marcus Gallagher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hyperparameter optimisation (HPO) is crucial for achieving strong performance in reinforcement learning (RL), as RL algorithms are inherently sensitive to hyperparameter settings. Probabilistic Curriculum Learning (PCL) is a curriculum learning strategy designed to improve RL performance by structuring the agent’s learning process, yet effective hyperparameter tuning remains challenging and computationally demanding. In this paper, we provide an empirical analysis of hyperparameter interactions and their effects on the performance of a PCL algorithm within standard RL tasks, including point-maze navigation and DC motor control. Using the AlgOS framework integrated with Optuna’s Tree-Structured Parzen Estimator (TPE), we present strategies to refine hyperparameter search spaces, enhancing optimisation efficiency. Additionally, we introduce a novel SHAP-based interpretability approach tailored specifically for analysing hyperparameter impacts, offering clear insights into how individual hyperparameters and their interactions influence RL performance. Our work contributes practical guidelines and interpretability tools that significantly improve the effectiveness and computational feasibility of hyperparameter optimisation in reinforcement learning.
zh

[AI-11] GRAIN: Multi-Granular and Implicit Information Aggregation Graph Neural Network for Heterophilous Graphs AAAI2025

【速读】：该论文旨在解决图神经网络（Graph Neural Networks, GNNs）在异质图（heterophilous graphs）任务中表现不佳的问题，即当连接节点的特征或标签存在差异时，传统GNN模型往往无法超越简单的多层感知机（MLPs），这挑战了GNNs基于同质性假设的传统优势。为克服现有方法忽视信息粒度重要性以及未充分考虑远距离节点隐式关系的局限性，论文提出了一种名为GRAIN（Granular and Implicit Graph Network）的新模型。GRAIN的关键创新在于通过多视角信息的多层次聚合以及整合来自远距离非邻居节点的隐式数据，有效融合局部与全局信息，从而生成更平滑且精确的节点表示。此外，GRAIN引入自适应图信息聚合器，高效结合多粒度和隐式数据，显著提升了节点表示质量，并在涵盖不同同质性和异质性水平的13个数据集实验中证明了其优越性，始终优于12种最先进的模型，在同质图和异质图任务上均表现出色。

链接: https://arxiv.org/abs/2504.06649
作者: Songwei Zhao,Yuan Jiang,Zijing Zhang,Yang Yu,Hechang Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Graph neural networks (GNNs) have shown significant success in learning graph representations. However, recent studies reveal that GNNs often fail to outperform simple MLPs on heterophilous graph tasks, where connected nodes may differ in features or labels, challenging the homophily assumption. Existing methods addressing this issue often overlook the importance of information granularity and rarely consider implicit relationships between distant nodes. To overcome these limitations, we propose the Granular and Implicit Graph Network (GRAIN), a novel GNN model specifically designed for heterophilous graphs. GRAIN enhances node embeddings by aggregating multi-view information at various granularity levels and incorporating implicit data from distant, non-neighboring nodes. This approach effectively integrates local and global information, resulting in smoother, more accurate node representations. We also introduce an adaptive graph information aggregator that efficiently combines multi-granularity and implicit data, significantly improving node representation quality, as shown by experiments on 13 datasets covering varying homophily and heterophily. GRAIN consistently outperforms 12 state-of-the-art models, excelling on both homophilous and heterophilous graphs.
zh

[AI-12] AMAD: AutoMasked Attention for Unsupervised Multivariate Time Series Anomaly Detection

【速读】：该论文致力于解决无监督多变量时间序列异常检测（UMTSAD）中的两个主要问题：一是现有基于Transformer和自注意力机制的模型在处理序列异常关联时受限于特定预定义模式或场景（如集中或峰值异常模式），导致其泛化能力不足；二是缺乏标签的数据环境下难以应对多样化的异常情况。为了解决这些问题，论文提出了一种名为AMAD的新方法，其关键是引入了AutoMask机制和注意力Mixup模块构建的通用异常关联表示框架，并通过Max-Min训练策略与局部-全局对比学习进一步增强模型的鲁棒性和适应性，从而实现多尺度特征提取与自动相对关联建模的结合，提供了一个高效且灵活的解决方案。

链接: https://arxiv.org/abs/2504.06643
作者: Tiange Huang,Yongjun Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages,7 figures, first upload

点击查看摘要

Abstract:Unsupervised multivariate time series anomaly detection (UMTSAD) plays a critical role in various domains, including finance, networks, and sensor systems. In recent years, due to the outstanding performance of deep learning in general sequential tasks, many models have been specialized for deep UMTSAD tasks and have achieved impressive results, particularly those based on the Transformer and self-attention mechanisms. However, the sequence anomaly association assumptions underlying these models are often limited to specific predefined patterns and scenarios, such as concentrated or peak anomaly patterns. These limitations hinder their ability to generalize to diverse anomaly situations, especially where the lack of labels poses significant challenges. To address these issues, we propose AMAD, which integrates \textbfAuto\textbfMasked Attention for UMTS\textbfAD scenarios. AMAD introduces a novel structure based on the AutoMask mechanism and an attention mixup module, forming a simple yet generalized anomaly association representation framework. This framework is further enhanced by a Max-Min training strategy and a Local-Global contrastive learning approach. By combining multi-scale feature extraction with automatic relative association modeling, AMAD provides a robust and adaptable solution to UMTSAD challenges. Extensive experimental results demonstrate that the proposed model achieving competitive performance results compared to SOTA benchmarks across a variety of datasets.
zh

[AI-13] InteractRank: Personalized Web-Scale Search Pre-Ranking with Cross Interaction Features

【速读】：该论文旨在解决个性化搜索系统中预排名阶段效率与复杂交互特征捕捉之间的权衡问题。传统两塔模型因计算高效而被广泛采用，但其在捕捉复杂的查询-项目交互方面表现不足。同时，将全排名阶段所需的查询-项目交叉交互特征集成到预排名模型中面临效率挑战。为解决这些问题，论文提出了一种名为InteractRank的新颖两塔预排名模型。其关键创新在于通过引入基于历史用户参与度的查询-项目交互特征，并结合两塔点积机制，在评分函数中增强交叉交互建模能力。这种设计显著提升了预排名性能，同时保持了低延迟和计算成本。实验结果表明，InteractRank相较于BM25基线提升了6.5%的在线参与度指标，相较于标准两塔模型提升了3.7%。此外，论文还探讨了实时用户序列建模等其他组件的作用，并通过离线消融研究分析了其贡献。

链接: https://arxiv.org/abs/2504.06609
作者: Sujay Khandagale,Bhawna Juneja,Prabhat Agarwal,Aditya Subramanian,Jaewon Yang,Yuting Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 3 figures, to appear at TheWebConf Industry Track 2025

点击查看摘要

Abstract:Modern search systems use a multi-stage architecture to deliver personalized results efficiently. Key stages include retrieval, pre-ranking, full ranking, and blending, which refine billions of items to top selections. The pre-ranking stage, vital for scoring and filtering hundreds of thousands of items down to a few thousand, typically relies on two tower models due to their computational efficiency, despite often lacking in capturing complex interactions. While query-item cross interaction features are paramount for full ranking, integrating them into pre-ranking models presents efficiency-related challenges. In this paper, we introduce InteractRank, a novel two tower pre-ranking model with robust cross interaction features used at Pinterest. By incorporating historical user engagement-based query-item interactions in the scoring function along with the two tower dot product, InteractRank significantly boosts pre-ranking performance with minimal latency and computation costs. In real-world A/B experiments at Pinterest, InteractRank improves the online engagement metric by 6.5% over a BM25 baseline and by 3.7% over a vanilla two tower baseline. We also highlight other components of InteractRank, like real-time user-sequence modeling, and analyze their contributions through offline ablation studies. The code for InteractRank is available at this https URL.
zh

[AI-14] Right Prediction Wrong Reasoning : Uncovering LLM Misalignment in RA Disease Diagnosis

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在临床预筛查中的诊断准确性与推理合理性之间的不匹配问题。具体而言，研究聚焦于评估LLMs在类风湿性关节炎（Rheumatoid Arthritis, RA）疾病预测中的表现，并探讨其高预测准确性与其错误推理之间的矛盾。解决方案的关键在于通过多轮分析揭示LLMs在处理真实世界患者数据时的性能特点，发现最佳模型能够以约95%的准确率预测RA疾病，但其推理过程却有高达68%的错误率。这一结果表明，尽管LLMs具备出色的预测能力，但其推理机制存在缺陷，强调了在临床应用中谨慎依赖LLM解释的重要性。

链接: https://arxiv.org/abs/2504.06581
作者: Umakanta Maharana,Sarthak Verma,Avarna Agarwal,Prakashini Mruthyunjaya,Dwarikanath Mahapatra,Sakir Ahmed,Murari Mandal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) offer a promising pre-screening tool, improving early disease detection and providing enhanced healthcare access for underprivileged communities. The early diagnosis of various diseases continues to be a significant challenge in healthcare, primarily due to the nonspecific nature of early symptoms, the shortage of expert medical practitioners, and the need for prolonged clinical evaluations, all of which can delay treatment and adversely affect patient outcomes. With impressive accuracy in prediction across a range of diseases, LLMs have the potential to revolutionize clinical pre-screening and decision-making for various medical conditions. In this work, we study the diagnostic capability of LLMs for Rheumatoid Arthritis (RA) with real world patients data. Patient data was collected alongside diagnoses from medical experts, and the performance of LLMs was evaluated in comparison to expert diagnoses for RA disease prediction. We notice an interesting pattern in disease diagnosis and find an unexpected \textitmisalignment between prediction and explanation. We conduct a series of multi-round analyses using different LLM agents. The best-performing model accurately predicts rheumatoid arthritis (RA) diseases approximately 95% of the time. However, when medical experts evaluated the reasoning generated by the model, they found that nearly 68% of the reasoning was incorrect. This study highlights a clear misalignment between LLMs high prediction accuracy and its flawed reasoning, raising important questions about relying on LLM explanations in clinical settings. \textbfLLMs provide incorrect reasoning to arrive at the correct answer for RA disease diagnosis.
zh

[AI-15] Societal Impacts Research Requires Benchmarks for Creative Composition Tasks ICLR2025

【速读】：该论文试图解决的问题是评估基础模型在自动化认知任务中的社会影响，特别是关注其生成的内容可能带来的潜在负面后果，如信息生态中的合成内容泛滥、同质化及误导性等问题。论文指出，现有基准测试未能充分覆盖实际使用场景中风险较高的领域，尤其是需要日常创造力的创意写作等任务。
解决方案的关键在于通过分析真实使用案例，识别当前基准测试与实际需求之间的不匹配，并强调开发针对创意写作等任务的新基准的重要性，以更全面地衡量和理解具有创造性能力的AI生成内容所带来的社会影响，同时推动更高的透明度以指导基准的发展。

链接: https://arxiv.org/abs/2504.06549
作者: Judy Hanwen Shen,Carlos Guestrin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: v1: ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign)

点击查看摘要

Abstract:Foundation models that are capable of automating cognitive tasks represent a pivotal technological shift, yet their societal implications remain unclear. These systems promise exciting advances, yet they also risk flooding our information ecosystem with formulaic, homogeneous, and potentially misleading synthetic content. Developing benchmarks grounded in real use cases where these risks are most significant is therefore critical. Through a thematic analysis using 2 million language model user prompts, we identify creative composition tasks as a prevalent usage category where users seek help with personal tasks that require everyday creativity. Our fine-grained analysis identifies mismatches between current benchmarks and usage patterns among these tasks. Crucially, we argue that the same use cases that currently lack thorough evaluations can lead to negative downstream impacts. This position paper argues that benchmarks focused on creative composition tasks is a necessary step towards understanding the societal harms of AI-generated content. We call for greater transparency in usage patterns to inform the development of new benchmarks that can effectively measure both the progress and the impacts of models with creative capabilities.
zh

[AI-16] Polygon: Symbolic Reasoning for SQL using Conflict-Driven Under-Approximation Search

【速读】：该论文旨在解决符号推理在 SQL 查询中的高效输入生成问题，具体目标是为一组查询 (P_1, \cdots, P_n) 生成输入 (I)，使得这些查询在 (I) 上的输出满足给定的性质（用 Satisfiability Modulo Theories, SMT 表达）。这一问题在验证 SQL 查询等价性以及消除查询歧义等场景中具有重要意义。论文的关键在于提出了一种结合语义感知与轻量级特性的方法：首先通过分析每个查询 (P_i) 的输入-输出行为的下近似集（under-approximation）来实现语义感知；其次，进一步通过搜索一个表达能力强的下近似集族（collectively covering all desired program behaviors），确保方法的完备性。最终，论文实现了这一思路，并开发工具 Polygon，在两个任务上进行了大规模评估，结果表明其性能显著优于现有技术。

链接: https://arxiv.org/abs/2504.06542
作者: Pinhan Zhao,Yuepeng Wang,Xinyu Wang
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Databases (cs.DB); Software Engineering (cs.SE)
备注: PLDI 2025

点击查看摘要

Abstract:We present a novel symbolic reasoning engine for SQL which can efficiently generate an input I for n queries P_1, \cdots, P_n , such that their outputs on I satisfy a given property (expressed in SMT). This is useful in different contexts, such as disproving equivalence of two SQL queries and disambiguating a set of queries. Our first idea is to reason about an under-approximation of each P_i – that is, a subset of P_i 's input-output behaviors. While it makes our approach both semantics-aware and lightweight, this idea alone is incomplete (as a fixed under-approximation might miss some behaviors of interest). Therefore, our second idea is to perform search over an expressive family of under-approximations (which collectively cover all program behaviors of interest), thereby making our approach complete. We have implemented these ideas in a tool, Polygon, and evaluated it on over 30,000 benchmarks across two tasks (namely, SQL equivalence refutation and query disambiguation). Our evaluation results show that Polygon significantly outperforms all prior techniques.
zh

[AI-17] OPAL: Encoding Causal Understanding of Physical Systems for Robot Learning

【速读】：本文旨在解决机器人控制中动作序列生成与执行的问题，特别是在复杂操作任务中的表现提升。传统方法在处理长时序动作规划时往往面临搜索空间过大及缺乏物理约束的问题，导致性能受限或计算开销高昂。为应对这些挑战，论文提出OPAL（Operant Physical Agent with Language），一种融合视觉、语言和动作的新型架构，并引入拓扑注意力机制（topological attention）。其关键创新在于将动作序列建模为具有非平凡约束的拓扑结构表示，通过施加基于物理定律的拓扑约束来限制学习问题的搜索空间，从而实现更连贯的长时序动作序列生成。此外，该方法无需针对具体任务进行微调即可显著提升零样本性能，同时将推理计算需求降低42%。这一解决方案的核心在于结合拓扑学原理与Transformer架构，以嵌入因果理解能力，为机器人控制提供了理论保障与实际效率的双重优势。

链接: https://arxiv.org/abs/2504.06538
作者: Daniel Tcheurekdjian,Joshua Klasmeier,Tom Cooney,Christopher McCann,Tyler Fenstermaker
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 3 tables, 24 equations

点击查看摘要

Abstract:We present OPAL (Operant Physical Agent with Language), a novel vision-language-action architecture that introduces topological constraints to flow matching for robotic control. To do so, we further introduce topological attention. Our approach models action sequences as topologically-structured representations with non-trivial constraints. Experimental results across 10 complex manipulation tasks demonstrate OPAL’s superior performance compared to previous approaches, including Octo, OpenVLA, and \pi 0. Our architecture achieves significant improvements in zero-shot performance without requiring task-specific fine-tuning, while reducing inference computational requirements by 42%. The theoretical guarantees provided by our topological approach result in more coherent long-horizon action sequences. Our results highlight the potential of constraining the search space of learning problems in robotics by deriving from fundamental physical laws, and the possibility of using topological attention to embed causal understanding into transformer architectures. Comments: 11 pages, 2 figures, 3 tables, 24 equations Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.06538 [cs.RO] (or arXiv:2504.06538v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2504.06538 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-18] Flexible Graph Similarity Computation With A Proactive Optimization Strategy

【速读】：本文旨在解决基于学习的 Graph Edit Distance (GED) 计算方法在处理可变操作成本时存在的局限性，以及依赖孤立节点距离导致匹配效率低下和需要反复调整的问题。关键在于提出了一种名为 Graph Edit Network (GEN) 的新方法：首先，在建立图映射之前引入操作成本以增强 GED 的灵活性；其次，通过从图的角度主动优化匹配指导策略，将每个节点的对齐难度作为初始指导，并利用难度传播机制捕捉节点间的依赖关系，从而实现更明智的匹配决策。最终，GEN 能够一次性选出最优匹配，显著减少昂贵的迭代调整需求，同时提升了计算效率与适应性。

链接: https://arxiv.org/abs/2504.06533
作者: Zhouyang Liu,Ning Liu,Yixin Chen,Jiezhong He,Dongsheng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:Graph Edit Distance (GED) is an important similarity measure in graph retrieval, which quantifies the minimum cost of transforming one graph into another through edit operations, and offers flexibility by allowing customizable operation costs. Recent learning-based approaches approximate GEDs with the distances between representations in vector spaces. However, these methods often struggle with varying operation costs due to neglecting the impact of these costs on determining optimal graph mappings. Furthermore, they rely on isolated node distances as guidance, necessitating inefficient reactive refinements of mappings. To address these issues, we propose Graph Edit Network (GEN), a novel learning-based approach for flexible GED computation. By identifying the limitations of existing methods in capturing flexibility of GED, we introduce a principled yet simple solution that incorporates the operation costs before establishing mappings. To improve matching efficiency, we propose a strategy that proactively optimizes guidance from a graph perspective. This strategy initializes guidance as each node’s alignment difficulty and captures the interdependencies between matches within and across graphs through a difficulty propagation mechanism, enabling more informed decisions. As a result, GEN selects optimal matches in a single step, minimizing the need for costly refinements. Results on real-world and synthetic datasets demonstrate the effectiveness, time efficiency, and adaptability of GEN, achieving up to 37.8% error reduction and 72.7% inference time reduction compared with state-of-the-art models, while performing robustly under varying cost settings and graph sizes.
zh

[AI-19] WaveHiTS: Wavelet-Enhanced Hierarchical Time Series Modeling for Wind Direction Nowcasting in Eastern Inner Mongolia

【速读】：该论文旨在解决风向预测中的三大挑战：方向数据的循环特性、多步预测中的误差累积以及复杂的气象交互作用。论文提出的解决方案WaveHiTS（Wavelet Transform with Neural Hierarchical Interpolation for Time Series）的关键在于将小波变换与神经层次插值相结合，通过将风向分解为U-V分量、应用小波变换捕获多尺度频率模式，并利用分层结构在多个尺度上建模时间依赖性，从而有效减轻误差传播。实验结果表明，WaveHiTS在多个性能指标上显著优于多种基线模型，其核心创新点对整体性能的提升具有重要意义。

链接: https://arxiv.org/abs/2504.06532
作者: Hailong Shu,Weiwei Song,Yue Wang,Jiping Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wind direction forecasting plays a crucial role in optimizing wind energy production, but faces significant challenges due to the circular nature of directional data, error accumulation in multi-step forecasting, and complex meteorological interactions. This paper presents a novel model, WaveHiTS, which integrates wavelet transform with Neural Hierarchical Interpolation for Time Series to address these challenges. Our approach decomposes wind direction into U-V components, applies wavelet transform to capture multi-scale frequency patterns, and utilizes a hierarchical structure to model temporal dependencies at multiple scales, effectively mitigating error propagation. Experiments conducted on real-world meteorological data from Inner Mongolia, China demonstrate that WaveHiTS significantly outperforms deep learning models (RNN, LSTM, GRU), transformer-based approaches (TFT, Informer, iTransformer), and hybrid models (EMD-LSTM). The proposed model achieves RMSE values of approximately 19.2°-19.4° compared to 56°-64° for deep learning recurrent models, maintaining consistent accuracy across all forecasting steps up to 60 minutes ahead. Moreover, WaveHiTS demonstrates superior robustness with vector correlation coefficients (VCC) of 0.985-0.987 and hit rates of 88.5%-90.1%, substantially outperforming baseline models. Ablation studies confirm that each component-wavelet transform, hierarchical structure, and U-V decomposition-contributes meaningfully to overall performance. These improvements in wind direction nowcasting have significant implications for enhancing wind turbine yaw control efficiency and grid integration of wind energy.
zh

[AI-20] Beyond Moores Law: Harnessing the Redshift of Generative AI with Effective Hardware-Software Co-Design

【速读】：该论文试图解决的问题是传统计算范式下硬件与软件之间日益僵化且过时的解耦设计哲学所面临的挑战，特别是在摩尔定律性能增益放缓、物理限制显现的情况下，如何重新定义系统设计原则以满足现代计算的需求。论文指出，硬件与软件之间的清晰边界正在迅速消失，需要通过硬件-软件协同设计（Hardware-Software Co-Design）来重塑系统抽象和设计理念。解决方案的关键在于推动硬件与软件在系统设计中的深度融合，将系统抽象提升至核心地位，并重新审视设计原则，以应对现代计算的高性能需求。此外，论文还探讨了“硬件彩票”（Hardware Lottery）概念及其对下一代计算创新的制约，并探索缓解其影响的方向。

链接: https://arxiv.org/abs/2504.06531
作者: Amir Yazdanbakhsh
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:For decades, Moore’s Law has served as a steadfast pillar in computer architecture and system design, promoting a clear abstraction between hardware and software. This traditional Moore’s computing paradigm has deepened the rift between the two, enabling software developers to achieve near-exponential performance gains often without needing to delve deeply into hardware-specific optimizations. Yet today, Moore’s Law – with its once relentless performance gains now diminished to incremental improvements – faces inevitable physical barriers. This stagnation necessitates a reevaluation of the conventional system design philosophy. The traditional decoupled system design philosophy, which maintains strict abstractions between hardware and software, is increasingly obsolete. The once-clear boundary between software and hardware is rapidly dissolving, replaced by co-design. It is imperative for the computing community to intensify its commitment to hardware-software co-design, elevating system abstractions to first-class citizens and reimagining design principles to satisfy the insatiable appetite of modern computing. Hardware-software co-design is not a recent innovation. To illustrate its historical evolution, I classify its development into five relatively distinct epochs''. This post also highlights the growing influence of the architecture community in interdisciplinary teams -- particularly alongside ML researchers -- and explores why current co-design paradigms are struggling in today's computing landscape. Additionally, I will examine the concept of the hardware lottery’’ and explore directions to mitigate its constraining influence on the next era of computing innovation.
zh

[AI-21] he Power of the Pareto Front: Balancing Uncertain Rewards for Adaptive Experimentation in scanning probe microscopy

【速读】：该论文旨在解决自动化实验中优化目标不确定或概率化时的挑战，提出利用多目标贝叶斯优化（Multi-Objective Bayesian Optimization, MOBO）在自主实验中平衡多个竞争性奖励。其关键在于通过计算和分析帕累托前沿（Pareto front），不仅指导优化过程，还提供不同目标间权衡的物理洞察，并结合人机交互决策框架，使研究人员能够基于领域专业知识微调实验参数，从而提升测量质量、可重复性和效率，推动自主科学发现的发展。

链接: https://arxiv.org/abs/2504.06525
作者: Yu Liu,Sergei V. Kalinin
机构: 未知
类目: Machine Learning (cs.LG); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:Automated experimentation has the potential to revolutionize scientific discovery, but its effectiveness depends on well-defined optimization targets, which are often uncertain or probabilistic in real-world settings. In this work, we demonstrate the application of Multi-Objective Bayesian Optimization (MOBO) to balance multiple, competing rewards in autonomous experimentation. Using scanning probe microscopy (SPM) imaging, one of the most widely used and foundational SPM modes, we show that MOBO can optimize imaging parameters to enhance measurement quality, reproducibility, and efficiency. A key advantage of this approach is the ability to compute and analyze the Pareto front, which not only guides optimization but also provides physical insights into the trade-offs between different objectives. Additionally, MOBO offers a natural framework for human-in-the-loop decision-making, enabling researchers to fine-tune experimental trade-offs based on domain expertise. By standardizing high-quality, reproducible measurements and integrating human input into AI-driven optimization, this work highlights MOBO as a powerful tool for advancing autonomous scientific discovery.
zh

[AI-22] Exploiting Meta-Learning-based Poisoning Attacks for Graph Link Prediction

【速读】：本文旨在解决变分图自编码器（Variational Graph Auto-Encoder, VGAE）在图对抗攻击场景下的脆弱性问题，填补现有研究主要集中于图卷积网络（Graph Convolution Network, GCN）而对VGAE相关鲁棒性提升关注不足的空白。论文的关键在于提出了一种基于元学习（meta-learning）技术的无权图投毒攻击方法，通过精心设计的攻击策略显著削弱VGAE在链接预测任务中的性能。实验结果表明，所提方法不仅大幅降低了链接预测的准确性，且优于其他最先进的同类方法。

链接: https://arxiv.org/abs/2504.06492
作者: Mingchen Li,Di Zhuang,Keyu Chen,Dumindu Samaraweera,Morris Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Link prediction in graph data utilizes various algorithms and machine learning/deep learning models to predict potential relationships between graph nodes. This technique has found widespread use in numerous real-world applications, including recommendation systems, community networks, and biological structures. However, recent research has highlighted the vulnerability of link prediction models to adversarial attacks, such as poisoning and evasion attacks. Addressing the vulnerability of these models is crucial to ensure stable and robust performance in link prediction applications. While many works have focused on enhancing the robustness of the Graph Convolution Network (GCN) model, the Variational Graph Auto-Encoder (VGAE), a sophisticated model for link prediction, has not been thoroughly investigated in the context of graph adversarial attacks. To bridge this gap, this article proposes an unweighted graph poisoning attack approach using meta-learning techniques to undermine VGAE’s link prediction performance. We conducted comprehensive experiments on diverse datasets to evaluate the proposed method and its parameters, comparing it with existing approaches in similar settings. Our results demonstrate that our approach significantly diminishes link prediction performance and outperforms other state-of-the-art methods.
zh

[AI-23] Agent -Arena: A General Framework for Evaluating Control Algorithms

【速读】：该论文旨在解决机器人研究中因多样环境和控制算法导致的适应性挑战，以及数据驱动方法中广泛超参数调整的需求。解决方案的关键在于提出Agent-Arena，一个通用的Python框架，能够无缝集成、复制、开发和测试决策策略，适用于多种基准环境，并且同时支持模拟和真实机器人场景，从而显著降低跨环境算法适配的难度。

链接: https://arxiv.org/abs/2504.06468
作者: Halid Abdulrahim Kadi,Kasim Terzić
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 20 pages and 1 figure

点击查看摘要

Abstract:Robotic research is inherently challenging, requiring expertise in diverse environments and control algorithms. Adapting algorithms to new environments often poses significant difficulties, compounded by the need for extensive hyper-parameter tuning in data-driven methods. To address these challenges, we present Agent-Arena, a Python framework designed to streamline the integration, replication, development, and testing of decision-making policies across a wide range of benchmark environments. Unlike existing frameworks, Agent-Arena is uniquely generalised to support all types of control algorithms and is adaptable to both simulation and real-robot scenarios. Please see our GitHub repository this https URL.
zh

[AI-24] Federated Neural Architecture Search with Model-Agnostic Meta Learning

【速读】：本文旨在解决联邦学习（Federated Learning, FL）中因用户数据分布异质性导致的神经架构搜索（Neural Architecture Search, NAS）过程耗时过长的问题。在联邦环境下进行NAS时，由于搜索空间庞大以及需要频繁重新训练模型，其效率受到限制。为了解决这一挑战，论文提出FedMetaNAS框架，通过将元学习（meta-learning）与NAS结合，显著加快架构搜索速度并提升准确性。关键在于引入Gumbel-Softmax重参数化方法以放松搜索空间中的混合操作，并利用模型无关的元学习（Model-Agnostic Meta-Learning）优化局部搜索过程，同时采用软剪枝技术逐步稀疏化架构，从而避免了重新训练阶段，确保最终选择的架构性能稳定且可以直接投入使用。实验结果表明，相比FedNAS，FedMetaNAS不仅大幅减少了50%以上的搜索时间，还实现了更高的精度。

链接: https://arxiv.org/abs/2504.06457
作者: Xinyuan Huang,Jiechao Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Federated Learning (FL) often struggles with data heterogeneity due to the naturally uneven distribution of user data across devices. Federated Neural Architecture Search (NAS) enables collaborative search for optimal model architectures tailored to heterogeneous data to achieve higher accuracy. However, this process is time-consuming due to extensive search space and retraining. To overcome this, we introduce FedMetaNAS, a framework that integrates meta-learning with NAS within the FL context to expedite the architecture search by pruning the search space and eliminating the retraining stage. Our approach first utilizes the Gumbel-Softmax reparameterization to facilitate relaxation of the mixed operations in the search space. We then refine the local search process by incorporating Model-Agnostic Meta-Learning, where a task-specific learner adapts both weights and architecture parameters (alphas) for individual tasks, while a meta learner adjusts the overall model weights and alphas based on the gradient information from task learners. Following the meta-update, we propose soft pruning using the same trick on search space to gradually sparsify the architecture, ensuring that the performance of the chosen architecture remains robust after pruning which allows for immediate use of the model without retraining. Experimental evaluations demonstrate that FedMetaNAS significantly accelerates the search process by more than 50% with higher accuracy compared to FedNAS.
zh

[AI-25] Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models

【速读】：该论文旨在解决AI生成内容与人类文本难以区分所带来的透明性和问责制挑战，特别是如何将水印策略直接嵌入模型权重并在生成的输出中体现的问题。论文的关键解决方案是通过微调模型的一对低秩适配器（low-rank adapters），其中一个作为文本生成模型，另一个作为检测器，使得在生成的文本中嵌入一个微妙的水印，并同时优化其可检测性。这种方法实现了端到端的全学习水印策略，但面临平衡水印鲁棒性、自然性和任务性能的优化挑战。论文讨论了优化这一min-max目标的策略，并展示了此方法对指令微调的改进效果。

链接: https://arxiv.org/abs/2504.06446
作者: Fay Elhassan,Niccolò Ajroldi,Antonio Orvieto,Jonas Geiping
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The indistinguishability of AI-generated content from human text raises challenges in transparency and accountability. While several methods exist to watermark models behind APIs, embedding watermark strategies directly into model weights that are later reflected in the outputs of the model is challenging. In this study we propose a strategy to finetune a pair of low-rank adapters of a model, one serving as the text-generating model, and the other as the detector, so that a subtle watermark is embedded into the text generated by the first model and simultaneously optimized for detectability by the second. In this way, the watermarking strategy is fully learned end-to-end. This process imposes an optimization challenge, as balancing watermark robustness, naturalness, and task performance requires trade-offs. We discuss strategies on how to optimize this min-max objective and present results showing the effect of this modification to instruction finetuning.
zh

[AI-26] Physical spline for denoising object trajectory data by combining splines ML feature regression and model knowledge

【速读】：本文旨在解决从含有噪声的测量数据中估计动态驾驶状态（位置、速度、加速度和航向）的问题。解决方案的关键在于提出了一种既能处理完整观测又能应对部分观测的方法，通过确保速度是加速度的积分、位置是速度的积分以及车辆只能沿其朝向方向移动的运动学一致性约束，生成具有动力学一致性的精化轨迹信号。此外，引入正则化以防止状态的极端变化。该方法被实现为一个可配置的 Python 库，并能够仅基于位置数据进行轨迹估计。一个重要应用是改进记录的轨迹数据，以作为机器学习模型的参考输入。最后，文章展示了该方法的结果并与真实数据进行了对比。

链接: https://arxiv.org/abs/2504.06404
作者: Jonas Torzewski
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures, this https URL

点击查看摘要

Abstract:This article presents a method for estimating the dynamic driving states (position, velocity, acceleration and heading) from noisy measurement data. The proposed approach is effective with both complete and partial observations, producing refined trajectory signals with kinematic consistency, ensuring that velocity is the integral of acceleration and position is the integral of velocity. Additionally, the method accounts for the constraint that vehicles can only move in the direction of their orientation. The method is implemented as a configurable python library that also enables trajectory estimation solely based on position data. Regularization is applied to prevent extreme state variations. A key application is enhancing recorded trajectory data for use as reference inputs in machine learning models. At the end, the article presents the results of the method along with a comparison to ground truth data.
zh

[AI-27] MM-STFlowNet: A Transportation Hub-Oriented Multi-Mode Passenger Flow Prediction Method via Spatial-Temporal Dynamic Graph Modeling

【速读】：该论文旨在解决传统方法在大型交通枢纽多模式协作管理中的局限性，即仅关注总体客流量而忽视枢纽内不同交通模式之间的相互依赖关系。为了解决这一问题，论文提出了一种名为MM-STFlowNet的综合多模式预测框架，其关键是基于动态时空图建模。首先通过信号分解和卷积技术实现集成的时间特征处理策略，以应对数据尖峰和高波动性；接着引入空间-时间动态图卷积循环网络（STDGCRN），结合自适应通道注意力机制捕捉多交通模式间的详细时空依赖关系；最后利用自注意力机制整合各类外部因素，进一步提升预测精度。实验结果表明，MM-STFlowNet在真实世界数据集上的表现达到当前最优水平，特别是在高峰期提供了有价值的管理见解。

链接: https://arxiv.org/abs/2504.06325
作者: Ronghui Zhang,Wenbin Xing,Mengran Li,Zihan Wang,Junzhou Chen,Xiaolei Ma,Zhiyuan Liu,Zhengbing He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and refined passenger flow prediction is essential for optimizing the collaborative management of multiple collection and distribution modes in large-scale transportation hubs. Traditional methods often focus only on the overall passenger volume, neglecting the interdependence between different modes within the hub. To address this limitation, we propose MM-STFlowNet, a comprehensive multi-mode prediction framework grounded in dynamic spatial-temporal graph modeling. Initially, an integrated temporal feature processing strategy is implemented using signal decomposition and convolution techniques to address data spikes and high volatility. Subsequently, we introduce the Spatial-Temporal Dynamic Graph Convolutional Recurrent Network (STDGCRN) to capture detailed spatial-temporal dependencies across multiple traffic modes, enhanced by an adaptive channel attention mechanism. Finally, the self-attention mechanism is applied to incorporate various external factors, further enhancing prediction accuracy. Experiments on a real-world dataset from Guangzhounan Railway Station in China demonstrate that MM-STFlowNet achieves state-of-the-art performance, particularly during peak periods, providing valuable insight for transportation hub management.
zh

[AI-28] From Stability to Inconsistency: A Study of Moral Preferences in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在日常应用中隐含偏差和道德倾向理解不足的问题。为了解决这一问题，论文的关键在于引入了一个基于道德基础理论（Moral Foundations Theory）的道德基础LLM数据集（MFD-LLM），并通过六个核心道德基础来概念化人类道德。此外，论文提出了一种新的评估方法，通过回答一系列现实世界的道德困境，全面捕捉LLMs揭示的道德偏好。研究发现，最先进的模型表现出高度一致的价值偏好，但缺乏一致性。

链接: https://arxiv.org/abs/2504.06324
作者: Monika Jotautaite,Mary Phuong,Chatrik Singh Mangat,Maria Angelica Martinez
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly integrate into our daily lives, it becomes crucial to understand their implicit biases and moral tendencies. To address this, we introduce a Moral Foundations LLM dataset (MFD-LLM) grounded in Moral Foundations Theory, which conceptualizes human morality through six core foundations. We propose a novel evaluation method that captures the full spectrum of LLMs’ revealed moral preferences by answering a range of real-world moral dilemmas. Our findings reveal that state-of-the-art models have remarkably homogeneous value preferences, yet demonstrate a lack of consistency.
zh

[AI-29] Mosaic: Composite Projection Pruning for Resource-efficient LLM s

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在硬件部署上的计算和内存资源限制问题。现有基于粗粒度剪枝（coarse-grained pruning）的方法虽然能够减小模型规模，但因其耗时且不可避免地移除关键参数，导致剪枝后的模型质量显著下降。论文的关键创新在于提出了一种新颖的细粒度剪枝方法——投影剪枝（Projection Pruning），并通过复合投影剪枝（Composite Projection Pruning）进一步优化，将无结构剪枝（unstructured pruning）与有结构剪枝（structured pruning）相结合，在保持模型精度的同时有效减小模型规模。此外，论文开发了名为Mosaic的系统，用于创建和部署基于复合投影剪枝的剪枝LLMs，并通过多平台、多种模型及数据集验证其性能与质量。结果显示，Mosaic不仅比现有方法快7.19倍，而且其剪枝模型在困惑度和准确性方面分别提升了84.2%和31.4%，同时推理速度提高67%，GPU内存使用降低68%。

链接: https://arxiv.org/abs/2504.06323
作者: Bailey J. Eccles,Leon Wong,Blesson Varghese
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extensive compute and memory requirements limit the deployment of large language models (LLMs) on any hardware. Compression methods, such as pruning, can reduce model size, which in turn reduces resource requirements. State-of-the-art pruning is based on coarse-grained methods. They are time-consuming and inherently remove critical model parameters, adversely impacting the quality of the pruned model. This paper introduces projection pruning, a novel fine-grained method for pruning LLMs. In addition, LLM projection pruning is enhanced by a new approach we refer to as composite projection pruning - the synergistic combination of unstructured pruning that retains accuracy and structured pruning that reduces model size. We develop Mosaic, a novel system to create and deploy pruned LLMs using composite projection pruning. Mosaic is evaluated using a range of performance and quality metrics on multiple hardware platforms, LLMs, and datasets. Mosaic is 7.19x faster in producing models than existing approaches. Mosaic models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning. Up to 67% faster inference and 68% lower GPU memory use is noted for Mosaic models.
zh

[AI-30] Assessing employment and labour issues implicated by using AI

【速读】：该论文试图解决的问题是当前人工智能（AI）与工作研究领域中占主导地位的还原论方法的局限性，即过度关注将任务和技能视为可替代的组成部分，而忽视了任务、角色以及工作环境之间的相互依赖性。论文的关键解决方案在于提出两种互补的方法：一是基于民族志的、富含上下文的研究方法，强调AI如何重新配置工作环境和专业知识；二是以关系为基础的任务分析方法，连接微观层面的工作描述与宏观层面的劳动力趋势。作者认为，有效的AI影响评估应超越自动化率的预测，涵盖伦理、福祉以及专业知识等相关问题。通过实证案例研究，论文展示了AI如何重塑人机关系、专业角色以及隐性知识实践，并最终呼吁采用以人为本的、整体性的框架来指导组织和政策决策，平衡技术可能性与工作的社会适宜性和可持续性。

链接: https://arxiv.org/abs/2504.06322
作者: Thijs Willems,Darion Jin Hotan,Jiawen Cheryl Tang,Norakmal Hakim bin Norhashim,King Wang Poon,Zi An Galvyn Goh,Radha Vinod
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: This manuscript is accepted for publication in Emad Yaghmaei, et al., eds., Global Perspectives on AI Impact Assessment (Oxford University Press, forthcoming 2025)

点击查看摘要

Abstract:This chapter critiques the dominant reductionist approach in AI and work studies, which isolates tasks and skills as replaceable components. Instead, it advocates for a systemic perspective that emphasizes the interdependence of tasks, roles, and workplace contexts. Two complementary approaches are proposed: an ethnographic, context-rich method that highlights how AI reconfigures work environments and expertise; and a relational task-based analysis that bridges micro-level work descriptions with macro-level labor trends. The authors argue that effective AI impact assessments must go beyond predicting automation rates to include ethical, well-being, and expertise-related questions. Drawing on empirical case studies, they demonstrate how AI reshapes human-technology relations, professional roles, and tacit knowledge practices. The chapter concludes by calling for a human-centric, holistic framework that guides organizational and policy decisions, balancing technological possibilities with social desirability and sustainability of work.
zh

[AI-31] Hybrid Temporal Differential Consistency Autoencoder for Efficient and Sustainable Anomaly Detection in Cyber-Physical Systems

【速读】：该论文致力于解决因快速数字化及物联网设备与工业控制系统集成导致的关键基础设施（尤其是供水系统）面临的网络攻击增加问题，特别是通过开发高效的入侵检测系统来应对由网络物理系统引入的新漏洞。论文的关键在于利用传感器数据中的时间相关性、将物理原理融入机器学习模型以及优化边缘应用的计算效率，提出了一种基于混合自编码器的方法——即扩展的时间微分一致性损失（TDC）的混合TDC-AE模型。此方法不仅实现了最先进的分类性能，还提高了3%的异常检测速度，同时保持了传统自编码器的计算效率，并减少了全连接层的数量，从而提供了一个更可持续且高效的整体方案。这种方法展示了如何借助受物理学启发的一致性原则来增强异常检测能力，从而提升网络物理系统的整体韧性。

链接: https://arxiv.org/abs/2504.06320
作者: Michael Somma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cyberattacks on critical infrastructure, particularly water distribution systems, have increased due to rapid digitalization and the integration of IoT devices and industrial control systems (ICS). These cyber-physical systems (CPS) introduce new vulnerabilities, requiring robust and automated intrusion detection systems (IDS) to mitigate potential threats. This study addresses key challenges in anomaly detection by leveraging time correlations in sensor data, integrating physical principles into machine learning models, and optimizing computational efficiency for edge applications. We build upon the concept of temporal differential consistency (TDC) loss to capture the dynamics of the system, ensuring meaningful relationships between dynamic states. Expanding on this foundation, we propose a hybrid autoencoder-based approach, referred to as hybrid TDC-AE, which extends TDC by incorporating both deterministic nodes and conventional statistical nodes. This hybrid structure enables the model to account for non-deterministic processes. Our approach achieves state-of-the-art classification performance while improving time to detect anomalies by 3%, outperforming the BATADAL challenge leader without requiring domain-specific knowledge, making it broadly applicable. Additionally, it maintains the computational efficiency of conventional autoencoders while reducing the number of fully connected layers, resulting in a more sustainable and efficient solution. The method demonstrates how leveraging physics-inspired consistency principles enhances anomaly detection and strengthens the resilience of cyber-physical systems.
zh

[AI-32] Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

【速读】：该论文旨在解决大型语言模型（LLMs）在推理过程中因高带宽内存（HBM）带宽限制而表现出明显内存带宽瓶颈的问题。论文提出了一种面向二级缓存（L2 Cache）的异步键值（KV）缓存预取方法，通过计算与负载重叠的方式突破内存带宽限制。关键在于通过在活动计算窗口期间主动调度空闲内存带宽，将所需的KV缓存预先加载到GPU L2缓存中，从而实现后续访问的高速缓存命中，并有效隐藏HBM访问延迟。实验结果表明，该方法在注意力内核效率上提升了2.15倍，在端到端吞吐量上提高了1.97倍，优于最先进的基线FlashAttention-3，同时保持了与现有优化技术的正交性，可无缝集成至当前推理框架中。

链接: https://arxiv.org/abs/2504.06319
作者: Yanhao Dong,Yubo Miao,Weinan Li,Xiao Zheng,Chao Wang,Feng Lyu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap. By strategically scheduling idle memory bandwidth during active computation windows, our method proactively prefetches required KV Cache into GPU L2 cache, enabling high-speed L2 cache hits for subsequent accesses and effectively hiding HBM access latency within computational cycles. Extensive experiments on NVIDIA H20 GPUs demonstrate that the proposed method achieves 2.15x improvement in attention kernel efficiency and up to 1.97x end-to-end throughput enhancement, surpassing state-of-the-art baseline FlashAttention-3. Notably, our solution maintains orthogonality to existing optimization techniques and can be integrated with current inference frameworks, providing a scalable latency-hiding solution for next-generation LLM inference engines.
zh

[AI-33] DMol: A Schedule-Driven Diffusion Model for Highly Efficient and Versatile Molecule Generation

【速读】：该论文旨在解决小分子生成任务中的有效性（validity）问题，同时提升生成效率。现有方法如DiGress在有效性方面已达到较高水平，但仍有改进空间。论文提出了一种新的图扩散模型——\emphDMol，其关键在于通过重新设计目标函数（objective function）以及引入“图噪声”（graph noise）调度策略，实现了更高效的生成过程。具体而言，在每次扩散步骤中，仅更新分子图中大小可变的节点子集，从而显著减少了所需的扩散步数（至少降低10倍），并将运行时间缩短至约一半。此外，该方法能够与基于junction-tree的图表示结合使用，通过将频繁出现的碳环结构压缩为超节点（supernodes），进一步提升了生成的有效性（约2%），增加了方法的新颖性，并由于图规模减小而进一步优化了运行效率。

链接: https://arxiv.org/abs/2504.06312
作者: Peizhi Niu,Yu-Hsiang Wang,Vishal Rana,Chetan Rupakheti,Abhishek Pandey,Olgica Milenkovic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a new graph diffusion model for small molecule generation, \emphDMol, which outperforms the state-of-the-art DiGress model in terms of validity by roughly 1.5% across all benchmarking datasets while reducing the number of diffusion steps by at least 10 -fold, and the running time to roughly one half. The performance improvements are a result of a careful change in the objective function and a ``graph noise" scheduling approach which, at each diffusion step, allows one to only change a subset of nodes of varying size in the molecule graph. Another relevant property of the method is that it can be easily combined with junction-tree-like graph representations that arise by compressing a collection of relevant ring structures into supernodes. Unlike classical junction-tree techniques that involve VAEs and require complicated reconstruction steps, compressed DMol directly performs graph diffusion on a graph that compresses only a carefully selected set of frequent carbon rings into supernodes, which results in straightforward sample generation. This compressed DMol method offers additional validity improvements over generic DMol of roughly 2% , increases the novelty of the method, and further improves the running time due to reductions in the graph size.
zh

[AI-34] Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Encoding

【速读】：该论文旨在解决现有旋转位置嵌入（Rotary Position Embedding, RoPE）变体缺乏统一理论基础的问题，尤其是在高维情况下的不足。论文的关键在于提出了一种基于李群和李代数理论的系统性数学框架，用于构建RoPE。通过这一框架，作者识别出RoPE的两个核心属性——相对性和可逆性，并推导出一维、二维以及N维情况下有效的RoPE的一般约束和构造方法。研究证明，RoPE必须位于特殊正交李代数的最大交换子代数（MASA）的基底上，且标准RoPE对应于极大环面子代数。此外，作者建议通过学习正交基变换来建模跨维度交互作用。此框架不仅统一并解释了现有的RoPE设计，还为扩展到新的模态和任务提供了原则性的途径。

链接: https://arxiv.org/abs/2504.06308
作者: Haiping Liu,Hongpeng Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rotary Position Embedding (RoPE) is widely adopted in Transformers due to its ability to encode relative positions with high efficiency and extrapolation capability. However, existing RoPE variants lack a unified theoretical foundation, especially in higher dimensions. In this paper, we propose a systematic mathematical framework for RoPE grounded in Lie group and Lie algebra theory. We identify two core properties of RoPE, named relativity and reversibility, and derive general constraints and constructions for valid RoPE in 1D, 2D, and N-dimensional (ND). We prove that RoPE must lie in the basis of a maximal abelian subalgebra (MASA) of the special orthogonal Lie algebra, and show that standard RoPE corresponds to the maximal toral subalgebra. Furthermore, we propose to model inter-dimensional interactions by learning an orthogonal basis transformation. Our framework unifies and explains existing RoPE designs, while enabling principled extensions to new modalities and tasks.
zh

[AI-35] Optimizing Large Language Models : Metrics Energy Efficiency and Case Study Insights

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在快速普及过程中导致的显著能源消耗和碳排放问题，这对生成式 AI 技术的可持续性构成了严峻挑战。论文的关键解决方案在于引入能量效率优化技术，通过案例研究和框架展示了战略性量化（strategic quantization）和本地推理（local inference）技术如何能够在不牺牲操作有效性的情况下大幅降低 LLM 的碳足迹。实验结果表明，这些方法在量化后可将能源消耗和碳排放减少高达 45%，特别适用于资源受限的环境。这一研究为在保持高精度和响应能力的同时实现 AI 的可持续性提供了切实可行的指导。

链接: https://arxiv.org/abs/2504.06307
作者: Tahniat Khan,Soroor Motie,Sedef Akinli Kocak,Shaina Raza
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid adoption of large language models (LLMs) has led to significant energy consumption and carbon emissions, posing a critical challenge to the sustainability of generative AI technologies. This paper explores the integration of energy-efficient optimization techniques in the deployment of LLMs to address these environmental concerns. We present a case study and framework that demonstrate how strategic quantization and local inference techniques can substantially lower the carbon footprints of LLMs without compromising their operational effectiveness. Experimental results reveal that these methods can reduce energy consumption and carbon emissions by up to 45% post quantization, making them particularly suitable for resource-constrained environments. The findings provide actionable insights for achieving sustainability in AI while maintaining high levels of accuracy and responsiveness.
zh

[AI-36] Well2Flow: Reconstruction of reservoir states from sparse wells using score-based generative models

【速读】：本文旨在解决利用稀疏井点观测数据重建盐水含水层中空间变化的渗透率和饱和度场的问题。为实现这一目标，论文提出了一种基于分数生成模型的方法，通过建模由高保真油藏模拟得到的渗透率与饱和度联合分布，训练神经网络以学习多相流体在多孔介质中复杂的时空动力学特性。解决方案的关键在于引入了一种将物理约束和测井数据引导相结合的新方法，通过条件生成模型有效重建渗透率和饱和度场，并显著提高了重建结果的精度和物理合理性。此外，该框架展示了在不同地质场景中的强大泛化能力，凸显了其在数据稀缺条件下的油藏管理任务中实际应用的潜力。

链接: https://arxiv.org/abs/2504.06305
作者: Shiqin Zeng,Haoyun Li,Abhinav Prakash Gahlot,Felix J. Herrmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:This study investigates the use of score-based generative models for reservoir simulation, with a focus on reconstructing spatially varying permeability and saturation fields in saline aquifers, inferred from sparse observations at two well locations. By modeling the joint distribution of permeability and saturation derived from high-fidelity reservoir simulations, the proposed neural network is trained to learn the complex spatiotemporal dynamics governing multiphase fluid flow in porous media. During inference, the framework effectively reconstructs both permeability and saturation fields by conditioning on sparse vertical profiles extracted from well log data. This approach introduces a novel methodology for incorporating physical constraints and well log guidance into generative models, significantly enhancing the accuracy and physical plausibility of the reconstructed subsurface states. Furthermore, the framework demonstrates strong generalization capabilities across varying geological scenarios, highlighting its potential for practical deployment in data-scarce reservoir management tasks.
zh

[AI-37] Resurrecting Socrates in the Age of AI: A Study Protocol for Evaluating a Socratic Tutor to Support Research Question Development in Higher Education

【速读】：该论文旨在解决如何利用生成式 AI (Generative AI) 支持而非替代高等教育中学生的研究能力发展，特别是研究问题的构建这一学术技能。论文的关键在于提出一种基于建构主义学习理论的新型基于 AI 的苏格拉底导师系统（AI Socratic Tutor），通过对话式教学法引导学生进行迭代式反思性提问，促进系统 2 思维的发展，并减少对 AI 生成输出的过度依赖。该方案的核心在于结合准实验设计，通过双盲专家评审评估学生基于背景文本提出的研究问题质量，并结合混合方法分析学生在技能迁移及主观感知方面的表现，从而探索生成式 AI 在教育中的人机协作设计原则。

链接: https://arxiv.org/abs/2504.06294
作者: Ben Degen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Formulating research questions is a foundational yet challenging academic skill, one that generative AI systems often oversimplify by offering instant answers at the expense of student reflection. This protocol lays out a study grounded in constructivist learning theory to evaluate a novel AI-based Socratic Tutor, designed to foster cognitive engagement and scaffold research question development in higher education. Anchored in dialogic pedagogy, the tutor engages students through iterative, reflective questioning, aiming to promote System 2 thinking and counteract overreliance on AI-generated outputs. In a quasi-experimental design, approximately 80 German pre-service biology teacher students will be randomly assigned to one of two groups: an AI Socratic Tutor condition and an uninstructed chatbot control. Across multiple cycles, students are expected to formulate research questions based on background texts, with quality assessed through double-blind expert review. The study also examines transfer of skills to novel phenomena and captures student perceptions through mixed-methods analysis, including surveys, interviews and reflective journals. This study aims to advance the understanding of how generative AI can be pedagogically aligned to support, not replace, human cognition and offers design principles for human-AI collaboration in education.
zh

[AI-38] Dynamic Evaluation Framework for Personalized and Trustworthy Agents : A Multi-Session Approach to Preference Adaptability

【速读】：该论文试图解决个性化代理（Personalized Agents）在决策与行动能力可信度评估方面存在的问题，当前的评估方法无法有效捕捉用户交互的动态性和演化特性。论文的关键解决方案在于提出了一种全新的综合框架，该框架通过构建具有独特属性和偏好的用户画像（User Personas），使代理与模拟用户进行结构化访谈以收集偏好，并提供定制化推荐。这些推荐随后借助由大语言模型（LLMs）驱动的仿真环境进行动态评估，从而实现适应性和迭代式的评价过程。这一灵活框架旨在支持多种代理和应用场景，确保对推荐策略的全面且多面评估，重点关注主动、个性化和可信赖的特性。

链接: https://arxiv.org/abs/2504.06277
作者: Chirag Shah,Hideo Joho,Kirandeep Kaur,Preetam Prabhu Srikar Dammu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in generative AI have significantly increased interest in personalized agents. With increased personalization, there is also a greater need for being able to trust decision-making and action taking capabilities of these agents. However, the evaluation methods for these agents remain outdated and inadequate, often failing to capture the dynamic and evolving nature of user interactions. In this conceptual article, we argue for a paradigm shift in evaluating personalized and adaptive agents. We propose a comprehensive novel framework that models user personas with unique attributes and preferences. In this framework, agents interact with these simulated users through structured interviews to gather their preferences and offer customized recommendations. These recommendations are then assessed dynamically using simulations driven by Large Language Models (LLMs), enabling an adaptive and iterative evaluation process. Our flexible framework is designed to support a variety of agents and applications, ensuring a comprehensive and versatile evaluation of recommendation strategies that focus on proactive, personalized, and trustworthy aspects.
zh

[AI-39] A Cascaded Architecture for Extractive Summarization of Multimedia Content via Audio-to-Text Alignment

【速读】：该论文旨在解决从多媒体内容（如YouTube视频）中提取关键见解的问题。解决方案的关键在于提出了一种级联架构，通过音频到文本的对齐实现抽取式摘要生成。该框架整合了Microsoft Azure Speech进行音频转文本转换，并结合先进的抽取式摘要模型（如Whisper、Pegasus和Facebook BART XSum）。系统利用Pytube、Pydub和SpeechRecognition等工具完成内容检索、音频提取和转录，并通过命名实体识别和语义角色标注增强语言分析能力。实验评估表明，尽管存在转录错误等挑战，该级联架构在ROUGE和F1分数上的表现优于传统方法。

链接: https://arxiv.org/abs/2504.06275
作者: Tanzir Hossain,Ar-Rafi Islam,Md. Sabbir Hossain,Annajiat Alim Rasel
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This study presents a cascaded architecture for extractive summarization of multimedia content via audio-to-text alignment. The proposed framework addresses the challenge of extracting key insights from multimedia sources like YouTube videos. It integrates audio-to-text conversion using Microsoft Azure Speech with advanced extractive summarization models, including Whisper, Pegasus, and Facebook BART XSum. The system employs tools such as Pytube, Pydub, and SpeechRecognition for content retrieval, audio extraction, and transcription. Linguistic analysis is enhanced through named entity recognition and semantic role labeling. Evaluation using ROUGE and F1 scores demonstrates that the cascaded architecture outperforms conventional summarization methods, despite challenges like transcription errors. Future improvements may include model fine-tuning and real-time processing. This study contributes to multimedia summarization by improving information retrieval, accessibility, and user experience.
zh

[AI-40] Joint Group Profiling and Recommendation via Deep Neural Network-based Multi-Task Learning

【速读】：该论文旨在解决群体推荐系统中生成符合群体集体偏好推荐的挑战，这些挑战与个体推荐场景存在显著差异。论文提出了一种基于深度神经网络的多任务学习框架——Joint Group Profiling and Recommendation，通过联合学习群体画像构建与推荐任务，在单一模型中统一这两个任务。解决方案的关键在于通过联合学习加深对群体动态的理解，共享两个任务之间的表示以发现对两者都重要的潜在特征，从而生成更丰富且信息量更大的群体嵌入。此外，引入注意力机制以动态评估不同群体特征和项目属性的相关性，确保模型优先关注最具影响力的信息。实验结果表明，该多任务学习方法在真实数据集上的准确性始终优于基线模型，验证了其有效性和鲁棒性。

链接: https://arxiv.org/abs/2504.06274
作者: Ngoc Luyen Le,Marie-Hélène Abel
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Group recommender systems aim to generate recommendations that align with the collective preferences of a group, introducing challenges that differ significantly from those in individual recommendation scenarios. This paper presents Joint Group Profiling and Recommendation via Deep Neural Network-based Multi-Task Learning, a framework that unifies group profiling and recommendation tasks within a single model. By jointly learning these tasks, the model develops a deeper understanding of group dynamics, leading to improved recommendation accuracy. The shared representations between the two tasks facilitate the discovery of latent features essential to both, resulting in richer and more informative group embeddings. To further enhance performance, an attention mechanism is integrated to dynamically evaluate the relevance of different group features and item attributes, ensuring the model prioritizes the most impactful information. Experiments and evaluations on real-world datasets demonstrate that our multi-task learning approach consistently outperforms baseline models in terms of accuracy, validating its effectiveness and robustness.
zh

[AI-41] RAVEN: An Agent ic Framework for Multimodal Entity Discovery from Large-Scale Video Collections AAAI-2025 AAAI2025

【速读】：该论文试图解决在大规模视频集合中实现多模态实体发现与检索的问题。解决方案的关键在于提出了一种名为RAVEN的自适应AI代理框架，其核心创新包括：(1) 一个类别理解步骤以推断视频主题及通用实体；(2) 一种动态定义领域特定实体及其属性的模式生成机制；(3) 一种利用语义检索和模式引导提示的丰富实体提取过程。RAVEN的设计使其能够灵活集成不同的视觉语言模型（VLMs）和大型语言模型（LLMs），支持个性化搜索、内容发现以及可扩展的信息检索应用。

链接: https://arxiv.org/abs/2504.06272
作者: Kevin Dela Rosa
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Presented at AI Agent for Information Retrieval: Generating and Ranking (Agent4IR) @ AAAI 2025 [ this https URL ]

点击查看摘要

Abstract:We present RAVEN an adaptive AI agent framework designed for multimodal entity discovery and retrieval in large-scale video collections. Synthesizing information across visual, audio, and textual modalities, RAVEN autonomously processes video data to produce structured, actionable representations for downstream tasks. Key contributions include (1) a category understanding step to infer video themes and general-purpose entities, (2) a schema generation mechanism that dynamically defines domain-specific entities and attributes, and (3) a rich entity extraction process that leverages semantic retrieval and schema-guided prompting. RAVEN is designed to be model-agnostic, allowing the integration of different vision-language models (VLMs) and large language models (LLMs) based on application-specific requirements. This flexibility supports diverse applications in personalized search, content discovery, and scalable information retrieval, enabling practical applications across vast datasets.
zh

[AI-42] Addressing Cold-start Problem in Click-Through Rate Prediction via Supervised Diffusion Modeling

【速读】：本文旨在解决推荐和广告平台中点击率预测（CTR prediction）的冷启动问题，即当用户行为数据有限或缺失时，导致物品ID嵌入（ID embedding）学习效果不佳，从而影响新物品的表现。为了解决这一问题，论文提出了一种新颖的扩散模型（diffusion model），其关键是通过定义一种在ID嵌入空间与侧信息空间（side information space）之间的新型扩散过程，生成预热嵌入（warmed-up embedding）以缓解冷启动问题。此外，由于该扩散模型是非马尔可夫过程（non-Markovian），可以通过推导子序列加速训练，并且模型同时采用变分推理（variational inference）和二元交叉熵（binary cross-entropy）目标进行监督，以确保在冷启动和预热阶段均能有效生成高质量的嵌入。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.06270
作者: Wenqiao Zhu,Lulu Wang,Jun Wu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting Click-Through Rates is a crucial function within recommendation and advertising platforms, as the output of CTR prediction determines the order of items shown to users. The Embedding \ MLP paradigm has become a standard approach for industrial recommendation systems and has been widely deployed. However, this paradigm suffers from cold-start problems, where there is either no or only limited user action data available, leading to poorly learned ID embeddings. The cold-start problem hampers the performance of new items. To address this problem, we designed a novel diffusion model to generate a warmed-up embedding for new items. Specifically, we define a novel diffusion process between the ID embedding space and the side information space. In addition, we can derive a sub-sequence from the diffusion steps to expedite training, given that our diffusion model is non-Markovian. Our diffusion model is supervised by both the variational inference and binary cross-entropy objectives, enabling it to generate warmed-up embeddings for items in both the cold-start and warm-up phases. Additionally, we have conducted extensive experiments on three recommendation datasets. The results confirmed the effectiveness of our approach.
zh

[AI-43] Leverag ing LLM s for User Stories in AI Systems: UStAI Dataset

【速读】：该论文旨在解决AI系统需求获取与分析中存在的挑战，特别是针对其不确定性及对敏感数据的高度依赖。论文的关键在于探索利用大型语言模型（Large Language Models, LLMs）生成启发自学术论文摘要的AI系统用户故事，以此作为生成高质量需求文档的新方法。通过使用三种不同的LLMs从26个领域的42篇学术论文摘要中生成了1260个用户故事，并采用Quality User Story (QUS) 框架评估其质量，同时识别相关的非功能性需求（Non-Functional Requirements, NFRs）和伦理原则。研究结果表明，所调查的LLMs能够基于不同利益相关者的需求生成具有潜力的用户故事，为AI系统早期需求获取阶段的研究提供了有价值的辅助工具。论文还公开了一个由多种LLMs生成的故事数据集（UStAI），供进一步研究使用。

链接: https://arxiv.org/abs/2504.00513
作者: Asma Yamani,Malak Baslyman,Moataz Ahmed
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI systems are gaining widespread adoption across various sectors and domains. Creating high-quality AI system requirements is crucial for aligning the AI system with business goals and consumer values and for social responsibility. However, with the uncertain nature of AI systems and the heavy reliance on sensitive data, more research is needed to address the elicitation and analysis of AI systems requirements. With the proprietary nature of many AI systems, there is a lack of open-source requirements artifacts and technical requirements documents for AI systems, limiting broader research and investigation. With Large Language Models (LLMs) emerging as a promising alternative to human-generated text, this paper investigates the potential use of LLMs to generate user stories for AI systems based on abstracts from scholarly papers. We conducted an empirical evaluation using three LLMs and generated 1260 user stories from 42 abstracts from 26 domains. We assess their quality using the Quality User Story (QUS) framework. Moreover, we identify relevant non-functional requirements (NFRs) and ethical principles. Our analysis demonstrates that the investigated LLMs can generate user stories inspired by the needs of various stakeholders, offering a promising approach for generating user stories for research purposes and for aiding in the early requirements elicitation phase of AI systems. We have compiled and curated a collection of stories generated by various LLMs into a dataset (UStAI), which is now publicly available for use.
zh

[AI-44] Data-driven Power Loss Identification through Physics-Based Thermal Model Backpropagation

【速读】：该论文旨在解决在实际应用中难以直接测量功率损耗的问题，提出了一种结合基于物理的热建模与数据驱动技术的创新混合框架。该框架通过温度测量来准确识别和校正功率损耗。解决方案的关键在于采用级联架构，其中神经网络通过简化的热模型反向传播来学习修正名义功率损耗模型的输出，并引入归一化策略和基于物理的训练损失函数以确保稳定性与物理一致性。实验结果表明，该方法显著降低了温度估计误差（从7.2±6.8°C降至0.3±0.3°C）和功率损耗预测误差（从5.4±6.6W降至0.2±0.3W），尤其适用于存在热模型不确定性的实时工业应用场景。

链接: https://arxiv.org/abs/2504.00133
作者: Mattia Scarpa,Francesco Pase,Ruggero Carli,Mattia Bruschetta,Franscesco Toso
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
备注: Accepted by European Control Conference (ECC) 2020, 8 pages, 7 figures

点击查看摘要

Abstract:Digital twins for power electronics require accurate power losses whose direct measurements are often impractical or impossible in real-world applications. This paper presents a novel hybrid framework that combines physics-based thermal modeling with data-driven techniques to identify and correct power losses accurately using only temperature measurements. Our approach leverages a cascaded architecture where a neural network learns to correct the outputs of a nominal power loss model by backpropagating through a reduced-order thermal model. We explore two neural architectures, a bootstrapped feedforward network, and a recurrent neural network, demonstrating that the bootstrapped feedforward approach achieves superior performance while maintaining computational efficiency for real-time applications. Between the interconnection, we included normalization strategies and physics-guided training loss functions to preserve stability and ensure physical consistency. Experimental results show that our hybrid model reduces both temperature estimation errors (from 7.2±6.8°C to 0.3±0.3°C) and power loss prediction errors (from 5.4±6.6W to 0.2±0.3W) compared to traditional physics-based approaches, even in the presence of thermal model uncertainties. This methodology allows us to accurately estimate power losses without direct measurements, making it particularly helpful for real-time industrial applications where sensor placement is hindered by cost and physical limitations.
zh

[AI-45] Multi-objective Optimization in CPU Design Space Exploration: Attention is All You Need

【速读】：该论文旨在解决现代CPU设计空间（Design Space）因微架构参数数量激增而导致的设计空间探索（Design Space Exploration, DSE）效率低下和准确性不足的问题。传统DSE框架在大规模设计空间中面临模型不准确及对参数影响缺乏深入洞察的挑战，难以在有限时间内高效识别最优微架构配置。

解决方案的关键在于引入了一种名为AttentionDSE的新方法，其核心思想是利用注意力机制（Attention Mechanism）直接建立微架构参数与其对预测性能贡献之间的映射关系。这种方法不仅提升了性能预测的准确性，还增强了模型的可解释性。此外，通过动态调整注意力权重，AttentionDSE能够快速响应设计变化，并精准定位导致性能瓶颈的关键微架构参数或组件。实验结果表明，与现有最先进的DSE框架相比，AttentionDSE显著减少了80%以上的探索时间，并在SPEC 2017基准测试中实现了3.9%的Pareto Hypervolume提升，同时保持了较高的预测精度和效率。

链接: https://arxiv.org/abs/2410.18368
作者: Runzhen Xue,Hao Wu,Mingyu Yan,Ziheng Xiao,Xiaochun Ye,Dongrui Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Design space exploration (DSE) enables architects to systematically evaluate various design options, guiding decisions on the most suitable configurations to meet specific objectives such as optimizing performance, power, and area. However, the growing complexity of modern CPUs has dramatically increased the number of micro-architectural parameters and expanded the overall design space, making DSE more challenging and time-consuming. Existing DSE frameworks struggle in large-scale design spaces due to inaccurate models and limited insights into parameter impact, hindering efficient identification of optimal micro-architectures within tight timeframes. In this work, we introduce AttentionDSE. Its key idea is to use the attention mechanism to establish a direct mapping of micro-architectural parameters to their contributions to predicted performance. This approach enhances both the prediction accuracy and interpretability of the performance model. Furthermore, the weights are dynamically adjusted, enabling the model to respond to design changes and effectively pinpoint the key micro-architectural parameters/components responsible for performance bottlenecks. Thus, AttentionDSE accurately, purposefully, and rapidly discovers optimal designs. Experiments on SPEC 2017 demonstrate that AttentionDSE significantly reduces exploration time by over 80% and achieves 3.9% improvement in Pareto Hypervolume compared to state-of-the-art DSE frameworks while maintaining superior prediction accuracy and efficiency with an increasing number of parameters. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2410.18368 [cs.LG] (or arXiv:2410.18368v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.18368 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-46] Continuous-Variable Quantum Encoding Techniques: A Comparative Study of Embedding Techniques and Their Impact on Machine Learning Performance

【速读】：本文旨在探索连续变量量子计算（Continuous-Variable Quantum Computing, CVQC）与经典机器学习的结合点，重点关注CVQC数据编码技术（如位移编码、压缩编码）以及来自离散量子计算的瞬时量子多项式（Instantaneous Quantum Polynomial, IQP）编码对经典机器学习模型的影响。研究通过广泛的实证分析评估这些编码方法在逻辑回归、支持向量机、K近邻算法及集成方法（如随机森林和LightGBM）中的性能提升效果。关键在于发现基于CVQC的编码方法显著增强了特征表达能力，特别是在高维复杂数据集上提升了分类准确率和F1分数，但伴随不同的计算开销。此外，研究探讨了量子表达能力和经典可学习性之间的权衡，为实际应用中整合这些量子编码提供了重要见解。因此，论文的核心贡献在于强调CVQC在量子数据表示及其与经典机器学习工作流集成中的作用，为量子-经典混合学习领域的研究奠定了基础。

链接: https://arxiv.org/abs/2504.06497
作者: Minati Rath,Hema Date
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores the intersection of continuous-variable quantum computing (CVQC) and classical machine learning, focusing on CVQC data encoding techniques, including Displacement encoding and squeezing encoding, alongside Instantaneous Quantum Polynomial (IQP) encoding from discrete quantum computing. We perform an extensive empirical analysis to assess the impact of these encoding methods on classical machine learning models, such as Logistic Regression, Support Vector Machines, K-Nearest Neighbors, and ensemble methods like Random Forest and LightGBM. Our findings indicate that CVQC-based encoding methods significantly enhance feature expressivity, resulting in improved classification accuracy and F1 scores, especially in high-dimensional and complex datasets. However, these improvements come with varying computational costs, which depend on the complexity of the encoding and the architecture of the machine learning models. Additionally, we examine the trade-off between quantum expressibility and classical learnability, offering valuable insights into the practical feasibility of incorporating these quantum encodings into real-world applications. This study contributes to the growing body of research on quantum-classical hybrid learning, emphasizing the role of CVQC in advancing quantum data representation and its integration into classical machine learning workflows.
zh

[AI-47] AI-Assisted Transport of Radioactive Ion Beams

【速读】：该论文试图解决放射性束流提取与传输过程中依赖耗时的手动调优方法的问题，这种传统方法需要专家手动优化数百个参数。解决方案的关键在于引入了一种利用人工智能（Artificial Intelligence, AI）辅助放射性束流传输的系统，并通过实证表明其在效率和性能上优于标准调优方法。该方法的核心优势在于提升操作效率的同时增强科学产出，且具有广泛的应用潜力，可推广至全球其他放射性束流设施。

链接: https://arxiv.org/abs/2504.06469
作者: Sergio Lopez-Caceres,Daniel Santiago-Gonzalez
机构: 未知
类目: Accelerator Physics (physics.acc-ph); Artificial Intelligence (cs.AI); Nuclear Experiment (nucl-ex)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:Beams of radioactive heavy ions allow researchers to study rare and unstable atomic nuclei, shedding light into the internal structure of exotic nuclei and on how chemical elements are formed in stars. However, the extraction and transport of radioactive beams rely on time-consuming expert-driven tuning methods, where hundreds of parameters are manually optimized. Here, we introduce a system that uses Artificial Intelligence (AI) to assist in the radioactive beam transport process. We apply our methodology to real-life scenarios showing advantages when compared with standard tuning methods. Our method can be extended to other radioactive beam facilities around the world to improve operational efficiency and enhance scientific output.
zh

[AI-48] Evaluating Mutation Techniques in Genetic Algorithm-Based Quantum Circuit Synthesis GECCO2025

【速读】：该论文致力于解决量子电路优化问题，特别是在噪声中间规模量子（NISQ）设备上的优化挑战，这类设备受限于有限的量子比特数量和较高的错误率。论文的关键解决方案在于利用遗传算法（Genetic Algorithms, GAs）实现量子电路的自动化优化，并通过分析不同变异策略的影响，发现结合“删除”（delete）和“交换”（swap）策略的方法显著提升了优化效率和性能，从而开发出更高效的基于遗传算法的量子电路优化器。

链接: https://arxiv.org/abs/2504.06413
作者: Michael Kölle,Tom Bintener,Maximilian Zorn,Gerhard Stenzel,Leo Sünkel,Thomas Gabor,Claudia Linnhoff-Popien
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: Accepted at GECCO 2025

点击查看摘要

Abstract:Quantum computing leverages the unique properties of qubits and quantum parallelism to solve problems intractable for classical systems, offering unparalleled computational potential. However, the optimization of quantum circuits remains critical, especially for noisy intermediate-scale quantum (NISQ) devices with limited qubits and high error rates. Genetic algorithms (GAs) provide a promising approach for efficient quantum circuit synthesis by automating optimization tasks. This work examines the impact of various mutation strategies within a GA framework for quantum circuit synthesis. By analyzing how different mutations transform circuits, it identifies strategies that enhance efficiency and performance. Experiments utilized a fitness function emphasizing fidelity, while accounting for circuit depth and T operations, to optimize circuits with four to six qubits. Comprehensive hyperparameter testing revealed that combining delete and swap strategies outperformed other approaches, demonstrating their effectiveness in developing robust GA-based quantum circuit optimizers.
zh

[AI-49] A Geometric-Aware Perspective and Beyond: Hybrid Quantum-Classical Machine Learning Methods

【速读】：该论文旨在解决如何将几何机器学习（Geometric Machine Learning, GML）与量子机器学习（Quantum Machine Learning, QML）相结合，以提升机器学习模型在处理复杂数据结构时的表现。论文的关键在于提出了一种统一视角，将QML视为GML的一个更具有表达能力的分支。这一方案的核心在于利用量子态（无论是纯态还是混合态）所处的弯曲流形特性（如投影希尔伯特空间或密度算子流形），这些特性与经典数据中的协方差矩阵位于对称正定（SPD）矩阵流形上的情况类似，或者图像集合占据格拉斯曼流形的情况相似。此外，QML还受益于量子特有的属性，如由纠缠引起的曲率，这可以产生更丰富的核结构和更精细的数据嵌入。通过结合经典流形特征提取与量子嵌入，即使受到近期硬件限制的影响，混合架构也已经显示出实际的优势。论文详细探讨了量子态几何基础的数学处理，并强调了其与经典黎曼几何及流形优化的平行关系。最终，论文提出了开放的研究挑战和未来方向，包括量子大语言模型、量子强化学习以及新兴硬件方法，展示了结合GML和QML原理如何推动下一代机器智能的发展。

链接: https://arxiv.org/abs/2504.06328
作者: Azadeh Alavia,Hossein Akhoundib,Fatemeh Kouchmeshkib,Mojtaba Mahmoodianc,Sanduni Jayasinghec,Yongli Rena,Abdolrahman Alavi
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 3 figures

点击查看摘要

Abstract:Geometric Machine Learning (GML) has shown that respecting non-Euclidean geometry in data spaces can significantly improve performance over naive Euclidean assumptions. In parallel, Quantum Machine Learning (QML) has emerged as a promising paradigm that leverages superposition, entanglement, and interference within quantum state manifolds for learning tasks. This paper offers a unifying perspective by casting QML as a specialized yet more expressive branch of GML. We argue that quantum states, whether pure or mixed, reside on curved manifolds (e.g., projective Hilbert spaces or density-operator manifolds), mirroring how covariance matrices inhabit the manifold of symmetric positive definite (SPD) matrices or how image sets occupy Grassmann manifolds. However, QML also benefits from purely quantum properties, such as entanglement-induced curvature, that can yield richer kernel structures and more nuanced data embeddings. We illustrate these ideas with published and newly discussed results, including hybrid classical -quantum pipelines for diabetic foot ulcer classification and structural health monitoring. Despite near-term hardware limitations that constrain purely quantum solutions, hybrid architectures already demonstrate tangible benefits by combining classical manifold-based feature extraction with quantum embeddings. We present a detailed mathematical treatment of the geometrical underpinnings of quantum states, emphasizing parallels to classical Riemannian geometry and manifold-based optimization. Finally, we outline open research challenges and future directions, including Quantum Large Language Models (LLMs), quantum reinforcement learning, and emerging hardware approaches, demonstrating how synergizing GML and QML principles can unlock the next generation of machine intelligence. Comments: 19 pages, 3 figures Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.06328 [quant-ph] (or arXiv:2504.06328v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2504.06328 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-50] Predicting Survivability of Cancer Patients with Metastatic Patterns Using Explainable AI

【速读】：该论文旨在解决癌症患者生存率预测的问题，特别是针对具有转移模式的患者。解决方案的关键在于利用机器学习技术，通过综合分析来自27种癌症类型的25,775名患者的基因组和临床数据，构建高性能的预测模型。研究评估了包括XGBoost、Naïve Bayes、Decision Tree、Logistic Regression和Random Forest在内的五种模型，并通过超参数调优和网格搜索优化性能，最终XGBoost以AUC值0.82的表现脱颖而出。此外，为了提升模型的可解释性，采用了SHapley Additive exPlanations (SHAP) 方法，揭示了诸如转移病灶数量、肿瘤突变负荷、基因组改变比例以及器官特异性转移等关键预测因子。这些成果不仅提供了显著影响患者预后的关键指标，还为临床医生制定个性化诊疗方案提供了可行的洞见，从而有望改善患者护理质量。

链接: https://arxiv.org/abs/2504.06306
作者: Polycarp Nalela,Deepthi Rao,Praveen Rao
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cancer remains a leading global health challenge and a major cause of mortality. This study leverages machine learning (ML) to predict the survivability of cancer patients with metastatic patterns using the comprehensive MSK-MET dataset, which includes genomic and clinical data from 25,775 patients across 27 cancer types. We evaluated five ML models-XGBoost, Naïve Bayes, Decision Tree, Logistic Regression, and Random Fores using hyperparameter tuning and grid search. XGBoost emerged as the best performer with an area under the curve (AUC) of 0.82. To enhance model interpretability, SHapley Additive exPlanations (SHAP) were applied, revealing key predictors such as metastatic site count, tumor mutation burden, fraction of genome altered, and organ-specific metastases. Further survival analysis using Kaplan-Meier curves, Cox Proportional Hazards models, and XGBoost Survival Analysis identified significant predictors of patient outcomes, offering actionable insights for clinicians. These findings could aid in personalized prognosis and treatment planning, ultimately improving patient care.
zh

机器学习

[LG-0] Neural Motion Simulator: Pushing the Limit of World Models in Reinforcement Learning CVPR2025

链接: https://arxiv.org/abs/2504.07095
作者: Chenjie Hao,Weyl Lu,Yifan Xu,Yubei Chen
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 pages (main), 2-page appendix, 8 figures, accepted by CVPR 2025

点击查看摘要

Abstract:An embodied system must not only model the patterns of the external world but also understand its own motion dynamics. A motion dynamic model is essential for efficient skill acquisition and effective planning. In this work, we introduce the neural motion simulator (MoSim), a world model that predicts the future physical state of an embodied system based on current observations and actions. MoSim achieves state-of-the-art performance in physical state prediction and provides competitive performance across a range of downstream tasks. This works shows that when a world model is accurate enough and performs precise long-horizon predictions, it can facilitate efficient skill acquisition in imagined worlds and even enable zero-shot reinforcement learning. Furthermore, MoSim can transform any model-free reinforcement learning (RL) algorithm into a model-based approach, effectively decoupling physical environment modeling from RL algorithm development. This separation allows for independent advancements in RL algorithms and world modeling, significantly improving sample efficiency and enhancing generalization capabilities. Our findings highlight that world models for motion dynamics is a promising direction for developing more versatile and capable embodied systems.

[LG-1] Identifying Unknown Stochastic Dynamics via Finite expression methods

链接: https://arxiv.org/abs/2504.07085
作者: Senwei Liang,Chunmei Wang,Xingjian Xu
类目: Machine Learning (cs.LG)
*备注: 27 pages, 20 figures

点击查看摘要

Abstract:Modeling stochastic differential equations (SDEs) is crucial for understanding complex dynamical systems in various scientific fields. Recent methods often employ neural network-based models, which typically represent SDEs through a combination of deterministic and stochastic terms. However, these models usually lack interpretability and have difficulty generalizing beyond their training domain. This paper introduces the Finite Expression Method (FEX), a symbolic learning approach designed to derive interpretable mathematical representations of the deterministic component of SDEs. For the stochastic component, we integrate FEX with advanced generative modeling techniques to provide a comprehensive representation of SDEs. The numerical experiments on linear, nonlinear, and multidimensional SDEs demonstrate that FEX generalizes well beyond the training domain and delivers more accurate long-term predictions compared to neural network-based methods. The symbolic expressions identified by FEX not only improve prediction accuracy but also offer valuable scientific insights into the underlying dynamics of the systems, paving the way for new scientific discoveries.

[LG-2] o Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning

链接: https://arxiv.org/abs/2504.07052
作者: Tian Qin,David Alvarez-Melis,Samy Jelassi,Eran Malach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models have significantly improved their reasoning abilities, particularly through techniques involving search and backtracking. Backtracking naturally scales test-time compute by enabling sequential, linearized exploration via long chain-of-thought (CoT) generation. However, this is not the only strategy for scaling test-time compute: parallel sampling with best-of-n selection provides an alternative that generates diverse solutions simultaneously. Despite the growing adoption of sequential search, its advantages over parallel sampling–especially under a fixed compute budget remain poorly understood. In this paper, we systematically compare these two approaches on two challenging reasoning tasks: CountDown and Sudoku. Surprisingly, we find that sequential search underperforms parallel sampling on CountDown but outperforms it on Sudoku, suggesting that backtracking is not universally beneficial. We identify two factors that can cause backtracking to degrade performance: (1) training on fixed search traces can lock models into suboptimal strategies, and (2) explicit CoT supervision can discourage “implicit” (non-verbalized) reasoning. Extending our analysis to reinforcement learning (RL), we show that models with backtracking capabilities benefit significantly from RL fine-tuning, while models without backtracking see limited, mixed gains. Together, these findings challenge the assumption that backtracking universally enhances LLM reasoning, instead revealing a complex interaction between task structure, training data, model scale, and learning paradigm.

[LG-3] Identifying Key Challenges of Hardness-Based Resampling

链接: https://arxiv.org/abs/2504.07031
作者: Pawel Pukowski,Venet Osmani
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE TPAMI

点击查看摘要

Abstract:Performance gap across classes remains a persistent challenge in machine learning, often attributed to variations in class hardness. One way to quantify class hardness is through sample complexity - the minimum number of samples required to effectively learn a given class. Sample complexity theory suggests that class hardness is driven by differences in the amount of data required for generalization. That is, harder classes need substantially more samples to achieve generalization. Therefore, hardness-based resampling is a promising approach to mitigate these performance disparities. While resampling has been studied extensively in data-imbalanced settings, its impact on balanced datasets remains unexplored. This raises the fundamental question whether resampling is effective because it addresses data imbalance or hardness imbalance. We begin addressing this question by introducing class imbalance into balanced datasets and evaluate its effect on performance disparities. We oversample hard classes and undersample easy classes to bring hard classes closer to their sample complexity requirements while maintaining a constant dataset size for fairness. We estimate class-level hardness using the Area Under the Margin (AUM) hardness estimator and leverage it to compute resampling ratios. Using these ratios, we perform hardness-based resampling on the well-known CIFAR-10 and CIFAR-100 datasets. Contrary to theoretical expectations, our results show that hardness-based resampling does not meaningfully affect class-wise performance disparities. To explain this discrepancy, we conduct detailed analyses to identify key challenges unique to hardness-based imbalance, distinguishing it from traditional data-based imbalance. Our insights help explain why theoretical sample complexity expectations fail to translate into practical performance gains and we provide guidelines for future research. Comments: Submitted to IEEE TPAMI Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.07031 [cs.LG] (or arXiv:2504.07031v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.07031 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Using ML filters to help automated vulnerability repairs: when it helps and when it doesnt

链接: https://arxiv.org/abs/2504.07027
作者: Maria Camporese,Fabio Massacci
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:[Context:] The acceptance of candidate patches in automated program repair has been typically based on testing oracles. Testing requires typically a costly process of building the application while ML models can be used to quickly classify patches, thus allowing more candidate patches to be generated in a positive feedback loop. [Problem:] If the model predictions are unreliable (as in vulnerability detection) they can hardly replace the more reliable oracles based on testing. [New Idea:] We propose to use an ML model as a preliminary filter of candidate patches which is put in front of a traditional filter based on testing. [Preliminary Results:] We identify some theoretical bounds on the precision and recall of the ML algorithm that makes such operation meaningful in practice. With these bounds and the results published in the literature, we calculate how fast some of state-of-the art vulnerability detectors must be to be more effective over a traditional AVR pipeline such as APR4Vuln based just on testing.

[LG-5] Adapting GT2-FLS for Uncertainty Quantification: A Blueprint Calibration Strategy

链接: https://arxiv.org/abs/2504.07017
作者: Yusuf Guven,Tufan Kumbasar
类目: Machine Learning (cs.LG)
*备注: in IEEE International Conference on Fuzzy Systems, 2025

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is crucial for deploying reliable Deep Learning (DL) models in high-stakes applications. Recently, General Type-2 Fuzzy Logic Systems (GT2-FLSs) have been proven to be effective for UQ, offering Prediction Intervals (PIs) to capture uncertainty. However, existing methods often struggle with computational efficiency and adaptability, as generating PIs for new coverage levels (\phi_d) typically requires retraining the model. Moreover, methods that directly estimate the entire conditional distribution for UQ are computationally expensive, limiting their scalability in real-world scenarios. This study addresses these challenges by proposing a blueprint calibration strategy for GT2-FLSs, enabling efficient adaptation to any desired \phi_d without retraining. By exploring the relationship between \alpha -plane type reduced sets and uncertainty coverage, we develop two calibration methods: a lookup table-based approach and a derivative-free optimization algorithm. These methods allow GT2-FLSs to produce accurate and reliable PIs while significantly reducing computational overhead. Experimental results on high-dimensional datasets demonstrate that the calibrated GT2-FLS achieves superior performance in UQ, highlighting its potential for scalable and practical applications.

[LG-6] FAME: Introducing Fuzzy Additive Models for Explainable AI

链接: https://arxiv.org/abs/2504.07011
作者: Omer Bahadir Gokmen,Yusuf Guven,Tufan Kumbasar
类目: Machine Learning (cs.LG)
*备注: in the IEEE International Conference on Fuzzy Systems, 2025

点击查看摘要

Abstract:In this study, we introduce the Fuzzy Additive Model (FAM) and FAM with Explainability (FAME) as a solution for Explainable Artificial Intelligence (XAI). The family consists of three layers: (1) a Projection Layer that compresses the input space, (2) a Fuzzy Layer built upon Single Input-Single Output Fuzzy Logic Systems (SFLS), where SFLS functions as subnetworks within an additive index model, and (3) an Aggregation Layer. This architecture integrates the interpretability of SFLS, which uses human-understandable if-then rules, with the explainability of input-output relationships, leveraging the additive model structure. Furthermore, using SFLS inherently addresses issues such as the curse of dimensionality and rule explosion. To further improve interpretability, we propose a method for sculpting antecedent space within FAM, transforming it into FAME. We show that FAME captures the input-output relationships with fewer active rules, thus improving clarity. To learn the FAM family, we present a deep learning framework. Through the presented comparative results, we demonstrate the promising potential of FAME in reducing model complexity while retaining interpretability, positioning it as a valuable tool for XAI.

[LG-7] Neural Signal Compression using RAMAN tinyML Accelerator for BCI Applications

链接: https://arxiv.org/abs/2504.06996
作者: Adithya Krishna,Sohan Debnath,André van Schaik,Mahesh Mehendale,Chetan Singh Thakur
类目: Hardware Architecture (cs.AR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-quality, multi-channel neural recording is indispensable for neuroscience research and clinical applications. Large-scale brain recordings often produce vast amounts of data that must be wirelessly transmitted for subsequent offline analysis and decoding, especially in brain-computer interfaces (BCIs) utilizing high-density intracortical recordings with hundreds or thousands of electrodes. However, transmitting raw neural data presents significant challenges due to limited communication bandwidth and resultant excessive heating. To address this challenge, we propose a neural signal compression scheme utilizing Convolutional Autoencoders (CAEs), which achieves a compression ratio of up to 150 for compressing local field potentials (LFPs). The CAE encoder section is implemented on RAMAN, an energy-efficient tinyML accelerator designed for edge computing, and subsequently deployed on an Efinix Ti60 FPGA with 37.3k LUTs and 8.6k register utilization. RAMAN leverages sparsity in activation and weights through zero skipping, gating, and weight compression techniques. Additionally, we employ hardware-software co-optimization by pruning CAE encoder model parameters using a hardware-aware balanced stochastic pruning strategy, resolving workload imbalance issues and eliminating indexing overhead to reduce parameter storage requirements by up to 32.4%. Using the proposed compact depthwise separable convolutional autoencoder (DS-CAE) model, the compressed neural data from RAMAN is reconstructed offline with superior signal-to-noise and distortion ratios (SNDR) of 22.6 dB and 27.4 dB, along with R2 scores of 0.81 and 0.94, respectively, evaluated on two monkey neural recordings.

[LG-8] Dissimilar Batch Decompositions of Random Datasets

链接: https://arxiv.org/abs/2504.06991
作者: Ghurumuruhan Ganesan
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: Accepted for publication in Sankhya A

点击查看摘要

Abstract:For better learning, large datasets are often split into small batches and fed sequentially to the predictive model. In this paper, we study such batch decompositions from a probabilistic perspective. We assume that data points (possibly corrupted) are drawn independently from a given space and define a concept of similarity between two data points. We then consider decompositions that restrict the amount of similarity within each batch and obtain high probability bounds for the minimum size. We demonstrate an inherent tradeoff between relaxing the similarity constraint and the overall size and also use martingale methods to obtain bounds for the maximum size of data subsets with a given similarity.

[LG-9] Free Random Projection for In-Context Reinforcement Learning

链接: https://arxiv.org/abs/2504.06983
作者: Tomohiro Hayase,Benoît Collins,Nakamasa Inoue
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 25 pages

点击查看摘要

Abstract:Hierarchical inductive biases are hypothesized to promote generalizable policies in reinforcement learning, as demonstrated by explicit hyperbolic latent representations and architectures. Therefore, a more flexible approach is to have these biases emerge naturally from the algorithm. We introduce Free Random Projection, an input mapping grounded in free probability theory that constructs random orthogonal matrices where hierarchical structure arises inherently. The free random projection integrates seamlessly into existing in-context reinforcement learning frameworks by encoding hierarchical organization within the input space without requiring explicit architectural modifications. Empirical results on multi-environment benchmarks show that free random projection consistently outperforms the standard random projection, leading to improvements in generalization. Furthermore, analyses within linearly solvable Markov decision processes and investigations of the spectrum of kernel random matrices reveal the theoretical underpinnings of free random projection’s enhanced performance, highlighting its capacity for effective adaptation in hierarchically structured state spaces.

[LG-10] ASRL:A robust loss function with potential for development

链接: https://arxiv.org/abs/2504.06935
作者: Chenyu Hui,Anran Zhang,Xintong Li
类目: Machine Learning (cs.LG)
*备注: five pages and three figures

点击查看摘要

Abstract:In this article, we proposed a partition:wise robust loss function based on the previous robust loss function. The characteristics of this loss function are that it achieves high robustness and a wide range of applicability through partition-wise design and adaptive parameter adjustment. Finally, the advantages and development potential of this loss function were verified by applying this loss function to the regression question and using five different datasets (with different dimensions, different sample numbers, and different fields) to compare with the other loss functions. The results of multiple experiments have proven the advantages of our loss function .

[LG-11] RO-FIGS: Efficient and Expressive Tree-Based Ensembles for Tabular Data

链接: https://arxiv.org/abs/2504.06927
作者: Urška Matjašec,Nikola Simidjievski,Mateja Jamnik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tree-based models are often robust to uninformative features and can accurately capture non-smooth, complex decision boundaries. Consequently, they often outperform neural network-based models on tabular datasets at a significantly lower computational cost. Nevertheless, the capability of traditional tree-based ensembles to express complex relationships efficiently is limited by using a single feature to make splits. To improve the efficiency and expressiveness of tree-based methods, we propose Random Oblique Fast Interpretable Greedy-Tree Sums (RO-FIGS). RO-FIGS builds on Fast Interpretable Greedy-Tree Sums, and extends it by learning trees with oblique or multivariate splits, where each split consists of a linear combination learnt from random subsets of features. This helps uncover interactions between features and improves performance. The proposed method is suitable for tabular datasets with both numerical and categorical features. We evaluate RO-FIGS on 22 real-world tabular datasets, demonstrating superior performance and much smaller models over other tree- and neural network-based methods. Additionally, we analyse their splits to reveal valuable insights into feature interactions, enriching the information learnt from SHAP summary plots, and thereby demonstrating the enhanced interpretability of RO-FIGS models. The proposed method is well-suited for applications, where balance between accuracy and interpretability is essential.

[LG-12] he Importance of Being Discrete: Measuring the Impact of Discretization in End-to-End Differentially Private Synthetic Data

链接: https://arxiv.org/abs/2504.06923
作者: Georgi Ganev,Meenatchi Sundaram Muthu Selva Annamalai,Sofiane Mahiou,Emiliano De Cristofaro
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differentially Private (DP) generative marginal models are often used in the wild to release synthetic tabular datasets in lieu of sensitive data while providing formal privacy guarantees. These models approximate low-dimensional marginals or query workloads; crucially, they require the training data to be pre-discretized, i.e., continuous values need to first be partitioned into bins. However, as the range of values (or their domain) is often inferred directly from the training data, with the number of bins and bin edges typically defined arbitrarily, this approach can ultimately break end-to-end DP guarantees and may not always yield optimal utility. In this paper, we present an extensive measurement study of four discretization strategies in the context of DP marginal generative models. More precisely, we design DP versions of three discretizers (uniform, quantile, and k-means) and reimplement the PrivTree algorithm. We find that optimizing both the choice of discretizer and bin count can improve utility, on average, by almost 30% across six DP marginal models, compared to the default strategy and number of bins, with PrivTree being the best-performing discretizer in the majority of cases. We demonstrate that, while DP generative models with non-private discretization remain vulnerable to membership inference attacks, applying DP during discretization effectively mitigates this risk. Finally, we propose an optimized approach for automatically selecting the optimal number of bins, achieving high utility while reducing both privacy budget consumption and computational overhead. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2504.06923 [cs.CR] (or arXiv:2504.06923v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.06923 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] Regret Bounds for Robust Online Decision Making

链接: https://arxiv.org/abs/2504.06820
作者: Alexander Appel,Vanessa Kosoy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a framework which generalizes “decision making with structured observations” by allowing robust (i.e. multivalued) models. In this framework, each model associates each decision with a convex set of probability distributions over outcomes. Nature can choose distributions out of this set in an arbitrary (adversarial) manner, that can be nonoblivious and depend on past history. The resulting framework offers much greater generality than classical bandits and reinforcement learning, since the realizability assumption becomes much weaker and more realistic. We then derive a theory of regret bounds for this framework. Although our lower and upper bounds are not tight, they are sufficient to fully characterize power-law learnability. We demonstrate this theory in two special cases: robust linear bandits and tabular robust online reinforcement learning. In both cases, we derive regret bounds that improve state-of-the-art (except that we do not address computational efficiency).

[LG-14] Deep Neural Koopman Operator-based Economic Model Predictive Control of Shipboard Carbon Capture System

链接: https://arxiv.org/abs/2504.06818
作者: Minghao Han,Xunyuan Yin
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Shipboard carbon capture is a promising solution to help reduce carbon emissions in international shipping. In this work, we propose a data-driven dynamic modeling and economic predictive control approach within the Koopman framework. This integrated modeling and control approach is used to achieve safe and energy-efficient process operation of shipboard post-combustion carbon capture plants. Specifically, we propose a deep neural Koopman operator modeling approach, based on which a Koopman model with time-varying model parameters is established. This Koopman model predicts the overall economic operational cost and key system outputs, based on accessible partial state measurements. By leveraging this learned model, a constrained economic predictive control scheme is developed. Despite time-varying parameters involved in the formulated model, the formulated optimization problem associated with the economic predictive control design is convex, and it can be solved efficiently during online control implementations. Extensive tests are conducted on a high-fidelity simulation environment for shipboard post-combustion carbon capture processes. Four ship operational conditions are taken into account. The results show that the proposed method significantly improves the overall economic operational performance and carbon capture rate. Additionally, the proposed method guarantees safe operation by ensuring that hard constraints on the system outputs are satisfied.

[LG-15] Robust Classification with Noisy Labels Based on Posterior Maximization

链接: https://arxiv.org/abs/2504.06805
作者: Nicola Novello,Andrea M. Tonello
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing objective functions robust to label noise is crucial for real-world classification algorithms. In this paper, we investigate the robustness to label noise of an f -divergence-based class of objective functions recently proposed for supervised classification, herein referred to as f -PML. We show that, in the presence of label noise, any of the f -PML objective functions can be corrected to obtain a neural network that is equal to the one learned with the clean dataset. Additionally, we propose an alternative and novel correction approach that, during the test phase, refines the posterior estimated by the neural network trained in the presence of label noise. Then, we demonstrate that, even if the considered f -PML objective functions are not symmetric, they are robust to symmetric label noise for any choice of f -divergence, without the need for any correction approach. This allows us to prove that the cross-entropy, which belongs to the f -PML class, is robust to symmetric label noise. Finally, we show that such a class of objective functions can be used together with refined training strategies, achieving competitive performance against state-of-the-art techniques of classification with label noise.

[LG-16] Beware of “Explanations” of AI

链接: https://arxiv.org/abs/2504.06791
作者: David Martens,Galit Shmueli,Theodoros Evgeniou,Kevin Bauer,Christian Janiesch,Stefan Feuerriegel,Sebastian Gabel,Sofie Goethals,Travis Greene,Nadja Klein,Mathias Kraus,Niklas Kühl,Claudia Perlich,Wouter Verbeke,Alona Zharova,Patrick Zschech,Foster Provost
类目: Machine Learning (cs.LG)
*备注: This work was inspired by Dagstuhl Seminar 24342

点击查看摘要

Abstract:Understanding the decisions made and actions taken by increasingly complex AI system remains a key challenge. This has led to an expanding field of research in explainable artificial intelligence (XAI), highlighting the potential of explanations to enhance trust, support adoption, and meet regulatory standards. However, the question of what constitutes a “good” explanation is dependent on the goals, stakeholders, and context. At a high level, psychological insights such as the concept of mental model alignment can offer guidance, but success in practice is challenging due to social and technical factors. As a result of this ill-defined nature of the problem, explanations can be of poor quality (e.g. unfaithful, irrelevant, or incoherent), potentially leading to substantial risks. Instead of fostering trust and safety, poorly designed explanations can actually cause harm, including wrong decisions, privacy violations, manipulation, and even reduced AI adoption. Therefore, we caution stakeholders to beware of explanations of AI: while they can be vital, they are not automatically a remedy for transparency or responsible AI adoption, and their misuse or limitations can exacerbate harm. Attention to these caveats can help guide future research to improve the quality and impact of AI explanations.

[LG-17] FedMerge: Federated Personalization via Model Merging

链接: https://arxiv.org/abs/2504.06768
作者: Shutong Chen,Tianyi Zhou,Guodong Long,Jing Jiang,Chengqi Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One global model in federated learning (FL) might not be sufficient to serve many clients with non-IID tasks and distributions. While there has been advances in FL to train multiple global models for better personalization, they only provide limited choices to clients so local finetuning is still indispensable. In this paper, we propose a novel ``FedMerge’’ approach that can create a personalized model per client by simply merging multiple global models with automatically optimized and customized weights. In FedMerge, a few global models can serve many non-IID clients, even without further local finetuning. We formulate this problem as a joint optimization of global models and the merging weights for each client. Unlike existing FL approaches where the server broadcasts one or multiple global models to all clients, the server only needs to send a customized, merged model to each client. Moreover, instead of periodically interrupting the local training and re-initializing it to a global model, the merged model aligns better with each client’s task and data distribution, smoothening the local-global gap between consecutive rounds caused by client drift. We evaluate FedMerge on three different non-IID settings applied to different domains with diverse tasks and data types, in which FedMerge consistently outperforms existing FL approaches, including clustering-based and mixture-of-experts (MoE) based methods.

[LG-18] Efficient Deployment of Spiking Neural Networks on SpiNNaker2 for DVS Gesture Recognition Using Neuromorphic Intermediate Representation

链接: https://arxiv.org/abs/2504.06748
作者: Sirine Arfa,Bernhard Vogginger,Chen Liu,Johannes Partzsch,Mark Schone,Christian Mayr
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 8 tables, Conference-2025 Neuro Inspired Computational Elements (NICE)

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are highly energy-efficient during inference, making them particularly suitable for deployment on neuromorphic hardware. Their ability to process event-driven inputs, such as data from dynamic vision sensors (DVS), further enhances their applicability to edge computing tasks. However, the resource constraints of edge hardware necessitate techniques like weight quantization, which reduce the memory footprint of SNNs while preserving accuracy. Despite its importance, existing quantization methods typically focus on synaptic weights quantization without taking account of other critical parameters, such as scaling neuron firing thresholds. To address this limitation, we present the first benchmark for the DVS gesture recognition task using SNNs optimized for the many-core neuromorphic chip SpiNNaker2. Our study evaluates two quantization pipelines for fixed-point computations. The first approach employs post training quantization (PTQ) with percentile-based threshold scaling, while the second uses quantization aware training (QAT) with adaptive threshold scaling. Both methods achieve accurate 8-bit on-chip inference, closely approximating 32-bit floating-point performance. Additionally, our baseline SNNs perform competitively against previously reported results without specialized techniques. These models are deployed on SpiNNaker2 using the neuromorphic intermediate representation (NIR). Ultimately, we achieve 94.13% classification accuracy on-chip, demonstrating the SpiNNaker2’s potential for efficient, low-energy neuromorphic computing. Comments: 8 pages, 3 figures, 8 tables, Conference-2025 Neuro Inspired Computational Elements (NICE) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.06748 [cs.LG] (or arXiv:2504.06748v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.06748 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2025 Neuro Inspired Computational Elements (NICE)

[LG-19] PETNet – Coincident Particle Event Detection using Spiking Neural Networks

链接: https://arxiv.org/abs/2504.06730
作者: Jan Debus,Charlotte Debus,Günther Dissertori,Markus Götz
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:Spiking neural networks (SNN) hold the promise of being a more biologically plausible, low-energy alternative to conventional artificial neural networks. Their time-variant nature makes them particularly suitable for processing time-resolved, sparse binary data. In this paper, we investigate the potential of leveraging SNNs for the detection of photon coincidences in positron emission tomography (PET) data. PET is a medical imaging technique based on injecting a patient with a radioactive tracer and detecting the emitted photons. One central post-processing task for inferring an image of the tracer distribution is the filtering of invalid hits occurring due to e.g. absorption or scattering processes. Our approach, coined PETNet, interprets the detector hits as a binary-valued spike train and learns to identify photon coincidence pairs in a supervised manner. We introduce a dedicated multi-objective loss function and demonstrate the effects of explicitly modeling the detector geometry on simulation data for two use-cases. Our results show that PETNet can outperform the state-of-the-art classical algorithm with a maximal coincidence detection F_1 of 95.2%. At the same time, PETNet is able to predict photon coincidences up to 36 times faster than the classical approach, highlighting the great potential of SNNs in particle physics applications.

[LG-20] Plastic tensor networks for interpretable generative modeling

链接: https://arxiv.org/abs/2504.06722
作者: Katsuya O. Akamatsu,Kenji Harada,Tsuyoshi Okubo,Naoki Kawashima
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)
*备注: 37 pages, 16 figures

点击查看摘要

Abstract:A structural optimization scheme for a single-layer nonnegative adaptive tensor tree (NATT) that models a target probability distribution is proposed. The NATT scheme, by construction, has the advantage that it is interpretable as a probabilistic graphical model. We consider the NATT scheme and a recently proposed Born machine adaptive tensor tree (BMATT) optimization scheme and demonstrate their effectiveness on a variety of generative modeling tasks where the objective is to infer the hidden structure of a provided dataset. Our results show that in terms of minimizing the negative log-likelihood, the single-layer scheme has model performance comparable to the Born machine scheme, though not better. The tasks include deducing the structure of binary bitwise operations, learning the internal structure of random Bayesian networks given only visible sites, and a real-world example related to hierarchical clustering where a cladogram is constructed from mitochondrial DNA sequences. In doing so, we also show the importance of the choice of network topology and the versatility of a least-mutual information criterion in selecting a candidate structure for a tensor tree, as well as discuss aspects of these tensor tree generative models including their information content and interpretability.

[LG-21] Clustering and novel class recognition: evaluating bioacoustic deep learning feature extractors

链接: https://arxiv.org/abs/2504.06710
作者: Vincent S. Kather,Burooj Ghani,Dan Stowell
类目: Machine Learning (cs.LG)
*备注: conference

点击查看摘要

Abstract:In computational bioacoustics, deep learning models are composed of feature extractors and classifiers. The feature extractors generate vector representations of the input sound segments, called embeddings, which can be input to a classifier. While benchmarking of classification scores provides insights into specific performance statistics, it is limited to species that are included in the models’ training data. Furthermore, it makes it impossible to compare models trained on very different taxonomic groups. This paper aims to address this gap by analyzing the embeddings generated by the feature extractors of 15 bioacoustic models spanning a wide range of setups (model architectures, training data, training paradigms). We evaluated and compared different ways in which models structure embedding spaces through clustering and kNN classification, which allows us to focus our comparison on feature extractors independent of their classifiers. We believe that this approach lets us evaluate the adaptability and generalization potential of models going beyond the classes they were trained on.

[LG-22] Benchmarking Convolutional Neural Network and Graph Neural Network based Surrogate Models on a Real-World Car External Aerodynamics Dataset

链接: https://arxiv.org/abs/2504.06699
作者: Sam Jacob Jacob,Markus Mrosek,Carsten Othmer,Harald Köstler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aerodynamic optimization is crucial for developing eco-friendly, aerodynamic, and stylish cars, which requires close collaboration between aerodynamicists and stylists, a collaboration impaired by the time-consuming nature of aerodynamic simulations. Surrogate models offer a viable solution to reduce this overhead, but they are untested in real-world aerodynamic datasets. We present a comparative evaluation of two surrogate modeling approaches for predicting drag on a real-world dataset: a Convolutional Neural Network (CNN) model that uses a signed distance field as input and a commercial tool based on Graph Neural Networks (GNN) that directly processes a surface mesh. In contrast to previous studies based on datasets created from parameterized geometries, our dataset comprises 343 geometries derived from 32 baseline vehicle geometries across five distinct car projects, reflecting the diverse, free-form modifications encountered in the typical vehicle development process. Our results show that the CNN-based method achieves a mean absolute error of 2.3 drag counts, while the GNN-based method achieves 3.8. Both methods achieve approximately 77% accuracy in predicting the direction of drag change relative to the baseline geometry. While both methods effectively capture the broader trends between baseline groups (set of samples derived from a single baseline geometry), they struggle to varying extents in capturing the finer intra-baseline group variations. In summary, our findings suggest that aerodynamicists can effectively use both methods to predict drag in under two minutes, which is at least 600 times faster than performing a simulation. However, there remains room for improvement in capturing the finer details of the geometry.

[LG-23] Robust and Noise-resilient Long-Term Prediction of Spatiotemporal Data Using Variational Mode Graph Neural Networks with 3D Attention IJCNN

链接: https://arxiv.org/abs/2504.06660
作者: Osama Ahmad,Zubair Khalid
类目: Machine Learning (cs.LG)
*备注: Accepted in IJCNN, 2025

点击查看摘要

Abstract:This paper focuses on improving the robustness of spatiotemporal long-term prediction using a variational mode graph convolutional network (VMGCN) by introducing 3D channel attention. The deep learning network for this task relies on historical data inputs, yet real-time data can be corrupted by sensor noise, altering its distribution. We model this noise as independent and identically distributed (i.i.d.) Gaussian noise and incorporate it into the LargeST traffic volume dataset, resulting in data with both inherent and additive noise components. Our approach involves decomposing the corrupted signal into modes using variational mode decomposition, followed by feeding the data into a learning pipeline for prediction. We integrate a 3D attention mechanism encompassing spatial, temporal, and channel attention. The spatial and temporal attention modules learn their respective correlations, while the channel attention mechanism is used to suppress noise and highlight the significant modes in the spatiotemporal signals. Additionally, a learnable soft thresholding method is implemented to exclude unimportant modes from the feature vector, and a feature reduction method based on the signal-to-noise ratio (SNR) is applied. We compare the performance of our approach against baseline models, demonstrating that our method achieves superior long-term prediction accuracy, robustness to noise, and improved performance with mode truncation compared to the baseline models. The code of the paper is available at this https URL.

[LG-24] NAPER: Fault Protection for Real-Time Resource-Constrained Deep Neural Networks

链接: https://arxiv.org/abs/2504.06591
作者: Rian Adam Rajagede,Muhammad Husni Santriaji,Muhammad Arya Fikriansyah,Hilal Hudan Nuha,Yanjie Fu,Yan Solihin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:Fault tolerance in Deep Neural Networks (DNNs) deployed on resource-constrained systems presents unique challenges for high-accuracy applications with strict timing requirements. Memory bit-flips can severely degrade DNN accuracy, while traditional protection approaches like Triple Modular Redundancy (TMR) often sacrifice accuracy to maintain reliability, creating a three-way dilemma between reliability, accuracy, and timeliness. We introduce NAPER, a novel protection approach that addresses this challenge through ensemble learning. Unlike conventional redundancy methods, NAPER employs heterogeneous model redundancy, where diverse models collectively achieve higher accuracy than any individual model. This is complemented by an efficient fault detection mechanism and a real-time scheduler that prioritizes meeting deadlines by intelligently scheduling recovery operations without interrupting inference. Our evaluations demonstrate NAPER’s superiority: 40% faster inference in both normal and fault conditions, maintained accuracy 4.2% higher than TMR-based strategies, and guaranteed uninterrupted operation even during fault recovery. NAPER effectively balances the competing demands of accuracy, reliability, and timeliness in real-time DNN applications

[LG-25] CAFE-AD: Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving ICRA2025

链接: https://arxiv.org/abs/2504.06584
作者: Junrui Zhang,Chenjie Wang,Jie Peng,Haoyu Li,Jianmin Ji,Yu Zhang,Yanyong Zhang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: ICRA 2025; first two authors contributed equally

点击查看摘要

Abstract:Imitation learning based planning tasks on the nuPlan dataset have gained great interest due to their potential to generate human-like driving behaviors. However, open-loop training on the nuPlan dataset tends to cause causal confusion during closed-loop testing, and the dataset also presents a long-tail distribution of scenarios. These issues introduce challenges for imitation learning. To tackle these problems, we introduce CAFE-AD, a Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving method, designed to enhance feature representation across various scenario types. We develop an adaptive feature pruning module that ranks feature importance to capture the most relevant information while reducing the interference of noisy information during training. Moreover, we propose a cross-scenario feature interpolation module that enhances scenario information to introduce diversity, enabling the network to alleviate over-fitting in dominant scenarios. We evaluate our method CAFE-AD on the challenging public nuPlan Test14-Hard closed-loop simulation benchmark. The results demonstrate that CAFE-AD outperforms state-of-the-art methods including rule-based and hybrid planners, and exhibits the potential in mitigating the impact of long-tail distribution within the dataset. Additionally, we further validate its effectiveness in real-world environments. The code and models will be made available at this https URL.

[LG-26] abKAN: Advancing Tabular Data Analysis using Kolmograv-Arnold Network

链接: https://arxiv.org/abs/2504.06559
作者: Ali Eslamian,Alireza Afzal Aghaei,Qiang Cheng
类目: Machine Learning (cs.LG)
*备注: 27 pages, 12 figures, 13 tables

点击查看摘要

Abstract:Tabular data analysis presents unique challenges due to its heterogeneous feature types, missing values, and complex interactions. While traditional machine learning methods, such as gradient boosting, often outperform deep learning approaches, recent advancements in neural architectures offer promising alternatives. This paper introduces TabKAN, a novel framework that advances tabular data modeling using Kolmogorov-Arnold Networks (KANs). Unlike conventional deep learning models, KANs leverage learnable activation functions on edges, enhancing both interpretability and training efficiency. Our contributions include: (1) the introduction of modular KAN-based architectures tailored for tabular data analysis, (2) the development of a transfer learning framework for KAN models, enabling effective knowledge transfer between domains, (3) the development of model-specific interpretability for tabular data learning, reducing reliance on post hoc and model-agnostic analysis, and (4) comprehensive evaluation of vanilla supervised learning across binary and multi-class classification tasks. Through extensive benchmarking on diverse public datasets, TabKAN demonstrates superior performance in supervised learning while significantly outperforming classical and Transformer-based models in transfer learning scenarios. Our findings highlight the advantage of KAN-based architectures in efficiently transferring knowledge across domains, bridging the gap between traditional machine learning and deep learning for structured data.

[LG-27] Controller Distillation Reduces Frag ile Brain-Body Co-Adaptation and Enables Migrations in MAP-Elites

链接: https://arxiv.org/abs/2504.06523
作者: Alican Mertan,Nick Cheney
类目: Robotics (cs.RO); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted at the Genetic and Evolutionary Computation Conference 2025 Complex Systems track as a full paper

点击查看摘要

Abstract:Brain-body co-optimization suffers from fragile co-adaptation where brains become over-specialized for particular bodies, hindering their ability to transfer well to others. Evolutionary algorithms tend to discard such low-performing solutions, eliminating promising morphologies. Previous work considered applying MAP-Elites, where niche descriptors are based on morphological features, to promote better search over morphology space. In this work, we show that this approach still suffers from fragile co-adaptation: where a core mechanism of MAP-Elites, creating stepping stones through solutions that migrate from one niche to another, is disrupted. We suggest that this disruption occurs because the body mutations that move an offspring to a new morphological niche break the robots’ fragile brain-body co-adaptation and thus significantly decrease the performance of those potential solutions – reducing their likelihood of outcompeting an existing elite in that new niche. We utilize a technique, we call Pollination, that periodically replaces the controllers of certain solutions with a distilled controller with better generalization across morphologies to reduce fragile brain-body co-adaptation and thus promote MAP-Elites migrations. Pollination increases the success of body mutations and the number of migrations, resulting in better quality-diversity metrics. We believe we develop important insights that could apply to other domains where MAP-Elites is used.

[LG-28] GTS-LUM: Reshaping User Behavior Modeling with LLM s in Telecommunications Industry

链接: https://arxiv.org/abs/2504.06511
作者: Liu Shi,Tianwu Zhou,Wei Xu,Li Liu,Zhexin Cui,Shaoyi Liang,Haoxing Niu,Yichong Tian,Jianwei Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As telecommunication service providers shifting their focus to analyzing user behavior for package design and marketing interventions, a critical challenge lies in developing a unified, end-to-end framework capable of modeling long-term and periodic user behavior sequences with diverse time granularities, multi-modal data inputs, and heterogeneous labels. This paper introduces GTS-LUM, a novel user behavior model that redefines modeling paradigms in telecommunication settings. GTS-LUM adopts a (multi-modal) encoder-adapter-LLM decoder architecture, enhanced with several telecom-specific innovations. Specifically, the model incorporates an advanced timestamp processing method to handle varying time granularities. It also supports multi-modal data inputs – including structured tables and behavior co-occurrence graphs – and aligns these with semantic information extracted by a tokenizer using a Q-former structure. Additionally, GTS-LUM integrates a front-placed target-aware mechanism to highlight historical behaviors most relevant to the target. Extensive experiments on industrial dataset validate the effectiveness of this end-to-end framework and also demonstrate that GTS-LUM outperforms LLM4Rec approaches which are popular in recommendation systems, offering an effective and generalizing solution for user behavior modeling in telecommunications.

[LG-29] Data-driven Fuzzy Control for Time-Optimal Aggressive Trajectory Following

链接: https://arxiv.org/abs/2504.06500
作者: August Phelps,Juan Augusto Paredes Salazar,Ankit Goel
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages, 10 figures, submitted to MECC 2025

点击查看摘要

Abstract:Optimal trajectories that minimize a user-defined cost function in dynamic systems require the solution of a two-point boundary value problem. The optimization process yields an optimal control sequence that depends on the initial conditions and system parameters. However, the optimal sequence may result in undesirable behavior if the system’s initial conditions and parameters are erroneous. This work presents a data-driven fuzzy controller synthesis framework that is guided by a time-optimal trajectory for multicopter tracking problems. In particular, we consider an aggressive maneuver consisting of a mid-air flip and generate a time-optimal trajectory by numerically solving the two-point boundary value problem. A fuzzy controller consisting of a stabilizing controller near hover conditions and an autoregressive moving average (ARMA) controller, trained to mimic the time-optimal aggressive trajectory, is constructed using the Takagi-Sugeno fuzzy framework.

[LG-30] Classifying Subjective Time Perception in a Multi-robot Control Scenario Using Eye-tracking Information

链接: https://arxiv.org/abs/2504.06442
作者: Till Aust,Julian Kaduk,Heiko Hamann
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:As automation and mobile robotics reshape work environments, rising expectations for productivity increase cognitive demands on human operators, leading to potential stress and cognitive overload. Accurately assessing an operator’s mental state is critical for maintaining performance and well-being. We use subjective time perception, which can be altered by stress and cognitive load, as a sensitive, low-latency indicator of well-being and cognitive strain. Distortions in time perception can affect decision-making, reaction times, and overall task effectiveness, making it a valuable metric for adaptive human-swarm interaction systems. We study how human physiological signals can be used to estimate a person’s subjective time perception in a human-swarm interaction scenario as example. A human operator needs to guide and control a swarm of small mobile robots. We obtain eye-tracking data that is classified for subjective time perception based on questionnaire data. Our results show that we successfully estimate a person’s time perception from eye-tracking data. The approach can profit from individual-based pretraining using only 30 seconds of data. In future work, we aim for robots that respond to human operator needs by automatically classifying physiological data in a closed control loop. Comments: This work has been submitted to the IEEE for possible publication Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2504.06442 [cs.RO] (or arXiv:2504.06442v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2504.06442 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Graph Neural Network-Based Distributed Optimal Control for Linear Networked Systems: An Online Distributed Training Approach

链接: https://arxiv.org/abs/2504.06439
作者: Zihao Song,Panos J. Antsaklis,Hai Lin
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:In this paper, we consider the distributed optimal control problem for linear networked systems. In particular, we are interested in learning distributed optimal controllers using graph recurrent neural networks (GRNNs). Most of the existing approaches result in centralized optimal controllers with offline training processes. However, as the increasing demand of network resilience, the optimal controllers are further expected to be distributed, and are desirable to be trained in an online distributed fashion, which are also the main contributions of our work. To solve this problem, we first propose a GRNN-based distributed optimal control method, and we cast the problem as a self-supervised learning problem. Then, the distributed online training is achieved via distributed gradient computation, and inspired by the (consensus-based) distributed optimization idea, a distributed online training optimizer is designed. Furthermore, the local closed-loop stability of the linear networked system under our proposed GRNN-based controller is provided by assuming that the nonlinear activation function of the GRNN-based controller is both local sector-bounded and slope-restricted. The effectiveness of our proposed method is illustrated by numerical simulations using a specifically developed simulator.

[LG-32] SPIRe: Boosting LLM Inference Throughput with Speculative Decoding

链接: https://arxiv.org/abs/2504.06419
作者: Sanjit Neelam,Daniel Heinlein,Vaclav Cvicek,Akshay Mishra,Reiner Pope
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speculative decoding (SD) has been shown to reduce the latency of autoregressive decoding (AD) by 2-3x for small batch sizes. However, increasing throughput and therefore reducing the cost per token requires decoding with large batch sizes. Recent work shows that SD can accelerate decoding with large batch sizes too if the context is sufficiently long and the draft model’s KV cache is sparse. We introduce SPIRe, a draft model that combines static sparse attention, pruned initialization, and feedback memory to increase the modeled throughput of speculative decoding by over 100% compared to speculation with a much smaller draft model and by over 35% compared to the strong baseline of sparse self-speculation. Our approach is particularly effective when context lengths vary significantly across requests.

[LG-33] Releasing Differentially Private Event Logs Using Generative Models

链接: https://arxiv.org/abs/2504.06418
作者: Frederik Wangelik,Majid Rafiei,Mahsa Pourbafrani,Wil M.P. van der Aalst
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: arXiv admin note: text overlap with arXiv:2303.16704

点击查看摘要

Abstract:In recent years, the industry has been witnessing an extended usage of process mining and automated event data analysis. Consequently, there is a rising significance in addressing privacy apprehensions related to the inclusion of sensitive and private information within event data utilized by process mining algorithms. State-of-the-art research mainly focuses on providing quantifiable privacy guarantees, e.g., via differential privacy, for trace variants that are used by the main process mining techniques, e.g., process discovery. However, privacy preservation techniques designed for the release of trace variants are still insufficient to meet all the demands of industry-scale utilization. Moreover, ensuring privacy guarantees in situations characterized by a high occurrence of infrequent trace variants remains a challenging endeavor. In this paper, we introduce two novel approaches for releasing differentially private trace variants based on trained generative models. With TraVaG, we leverage \textitGenerative Adversarial Networks (GANs) to sample from a privatized implicit variant distribution. Our second method employs \textitDenoising Diffusion Probabilistic Models that reconstruct artificial trace variants from noise via trained Markov chains. Both methods offer industry-scale benefits and elevate the degree of privacy assurances, particularly in scenarios featuring a substantial prevalence of infrequent variants. Also, they overcome the shortcomings of conventional privacy preservation techniques, such as bounding the length of variants and introducing fake variants. Experimental results on real-life event data demonstrate that our approaches surpass state-of-the-art techniques in terms of privacy guarantees and utility preservation.

[LG-34] Unifying Autoregressive and Diffusion-Based Sequence Generation

链接: https://arxiv.org/abs/2504.06416
作者: Nima Fathi,Torsten Scholak,Pierre-André Noël
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present significant extensions to diffusion-based sequence generation models, blurring the line with autoregressive language models. We introduce hyperschedules, which assign distinct noise schedules to individual token positions, generalizing both autoregressive models (e.g., GPT) and conventional diffusion models (e.g., SEDD, MDLM) as special cases. Second, we propose two hybrid token-wise noising processes that interpolate between absorbing and uniform processes, enabling the model to fix past mistakes, and we introduce a novel inference algorithm that leverages this new feature in a simplified context inspired from MDLM. To support efficient training and inference, we design attention masks compatible with KV-caching. Our methods achieve state-of-the-art perplexity and generate diverse, high-quality sequences across standard benchmarks, suggesting a promising path for autoregressive diffusion-based sequence generation.

[LG-35] Low Rank Learning for Offline Query Optimization SIGMOD2025

链接: https://arxiv.org/abs/2504.06399
作者: Zixuan Yi,Yao Tian,Zachary G. Ives,Ryan Marcus
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: To appear in SIGMOD 2025

点击查看摘要

Abstract:Recent deployments of learned query optimizers use expensive neural networks and ad-hoc search policies. To address these issues, we introduce \textscLimeQO, a framework for offline query optimization leveraging low-rank learning to efficiently explore alternative query plans with minimal resource usage. By modeling the workload as a partially observed, low-rank matrix, we predict unobserved query plan latencies using purely linear methods, significantly reducing computational overhead compared to neural networks. We formalize offline exploration as an active learning problem, and present simple heuristics that reduces a 3-hour workload to 1.5 hours after just 1.5 hours of exploration. Additionally, we propose a transductive Tree Convolutional Neural Network (TCNN) that, despite higher computational costs, achieves the same workload reduction with only 0.5 hours of exploration. Unlike previous approaches that place expensive neural networks directly in the query processing ``hot’’ path, our approach offers a low-overhead solution and a no-regressions guarantee, all without making assumptions about the underlying DBMS. The code is available in \hrefthis https URLthis https URL.

[LG-36] Sharpness-Aware Parameter Selection for Machine Unlearning

链接: https://arxiv.org/abs/2504.06398
作者: Saber Malekmohammadi,Hong kyu Lee,Li Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:It often happens that some sensitive personal information, such as credit card numbers or passwords, are mistakenly incorporated in the training of machine learning models and need to be removed afterwards. The removal of such information from a trained model is a complex task that needs to partially reverse the training process. There have been various machine unlearning techniques proposed in the literature to address this problem. Most of the proposed methods revolve around removing individual data samples from a trained model. Another less explored direction is when features/labels of a group of data samples need to be reverted. While the existing methods for these tasks do the unlearning task by updating the whole set of model parameters or only the last layer of the model, we show that there are a subset of model parameters that have the largest contribution in the unlearning target features. More precisely, the model parameters with the largest corresponding diagonal value in the Hessian matrix (computed at the learned model parameter) have the most contribution in the unlearning task. By selecting these parameters and updating them during the unlearning stage, we can have the most progress in unlearning. We provide theoretical justifications for the proposed strategy by connecting it to sharpness-aware minimization and robust unlearning. We empirically show the effectiveness of the proposed strategy in improving the efficacy of unlearning with a low computational cost.

[LG-37] SPoRt – Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL

链接: https://arxiv.org/abs/2504.06386
作者: Jacques Cloete,Nikolaus Vertovec,Alessandro Abate
类目: Machine Learning (cs.LG)
*备注: 9 pages + 16 pages supplementary material, 3 figures + 6 figures supplementary material

点击查看摘要

Abstract:To apply reinforcement learning to safety-critical applications, we ought to provide safety guarantees during both policy training and deployment. In this work we present novel theoretical results that provide a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setup: the bound, based on a maximum policy ratio' that is computed with respect to a safe’ base policy, can also be more generally applied to temporally-extended properties (beyond safety) and to robust control problems. We thus present SPoRt, which also provides a data-driven approach for obtaining such a bound for the base policy, based on scenario theory, and which includes Projected PPO, a new projection-based approach for training the task-specific policy while maintaining a user-specified bound on property violation. Hence, SPoRt enables the user to trade off safety guarantees in exchange for task-specific performance. Accordingly, we present experimental results demonstrating this trade-off, as well as a comparison of the theoretical bound to posterior bounds based on empirical violation rates.

[LG-38] An Information-Geometric Approach to Artificial Curiosity

链接: https://arxiv.org/abs/2504.06355
作者: Alexander Nedergaard,Pablo A. Morales
类目: Machine Learning (cs.LG)
*备注: 22 pages, 1 figure

点击查看摘要

Abstract:Learning in environments with sparse rewards remains a fundamental challenge in reinforcement learning. Artificial curiosity addresses this limitation through intrinsic rewards to guide exploration, however, the precise formulation of these rewards has remained elusive. Ideally, such rewards should depend on the agent’s information about the environment, remaining agnostic to the representation of the information – an invariance central to information geometry. Leveraging information geometry, we show that invariance under congruent Markov morphisms and the agent-environment interaction, uniquely constrains intrinsic rewards to concave functions of the reciprocal occupancy. Additional geometrically motivated restrictions effectively limits the candidates to those determined by a real parameter that governs the occupancy space geometry. Remarkably, special values of this parameter are found to correspond to count-based and maximum entropy exploration, revealing a geometric exploration-exploitation trade-off. This framework provides important constraints to the engineering of intrinsic reward while integrating foundational exploration methods into a single, cohesive model.

[LG-39] Physics-informed KAN PointNet: Deep learning for simultaneous solutions to inverse problems in incompressible flow on numerous irregular geometries

链接: https://arxiv.org/abs/2504.06327
作者: Ali Kashefi,Tapan Mukerji
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have gained attention as a promising alternative to traditional Multilayer Perceptrons (MLPs) for deep learning applications in computational physics, especially within the framework of physics-informed neural networks (PINNs). Physics-informed Kolmogorov-Arnold Networks (PIKANs) and their variants have been introduced and evaluated to solve inverse problems. However, similar to PINNs, current versions of PIKANs are limited to obtaining solutions for a single computational domain per training run; consequently, a new geometry requires retraining the model from scratch. Physics-informed PointNet (PIPN) was introduced to address this limitation for PINNs. In this work, we introduce physics-informed Kolmogorov-Arnold PointNet (PI-KAN-PointNet) to extend this capability to PIKANs. PI-KAN-PointNet enables the simultaneous solution of an inverse problem over multiple irregular geometries within a single training run, reducing computational costs. We construct KANs using Jacobi polynomials and investigate their performance by considering Jacobi polynomials of different degrees and types in terms of both computational cost and prediction accuracy. As a benchmark test case, we consider natural convection in a square enclosure with a cylinder, where the cylinder’s shape varies across a dataset of 135 geometries. We compare the performance of PI-KAN-PointNet with that of PIPN (i.e., physics-informed PointNet with MLPs) and observe that, with approximately an equal number of trainable parameters and similar computational cost, PI-KAN-PointNet provides more accurate predictions. Finally, we explore the combination of KAN and MLP in constructing a physics-informed PointNet. Our findings indicate that a physics-informed PointNet model employing MLP layers as the encoder and KAN layers as the decoder represents the optimal configuration among all models investigated.

[LG-40] Different Paths Same Destination: Designing New Physics-Inspired Dynamical Systems with Engineered Stability to Minimize the Ising Hamiltonian

链接: https://arxiv.org/abs/2504.06280
作者: E.M.H.E.B. Ekanayake,N. Shukla
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Oscillator Ising machines (OIMs) represent an exemplar case of using physics-inspired non-linear dynamical systems to solve computationally challenging combinatorial optimization problems (COPs). The computational performance of such systems is highly sensitive to the underlying dynamical properties, the topology of the input graph, and their relative compatibility. In this work, we explore the concept of designing different dynamical systems that minimize the same objective function but exhibit drastically different dynamical properties. Our goal is to leverage this diversification in dynamics to reduce the sensitivity of the computational performance to the underlying graph, and subsequently, enhance the overall effectiveness of such physics-based computational methods. To this end, we introduce a novel dynamical system, the Dynamical Ising Machine (DIM), which, like the OIM, minimizes the Ising Hamiltonian but offers significantly different dynamical properties. We analyze the characteristic properties of the DIM and compare them with those of the OIM. We also show that the relative performance of each model is dependent on the input graph. Our work illustrates that using multiple dynamical systems with varying properties to solve the same COP enables an effective method that is less sensitive to the input graph, while producing robust solutions.

[LG-41] Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling ICLR

链接: https://arxiv.org/abs/2504.07065
作者: Riselda Kodra,Hadjer Benmeziane,Irem Boybat,William Andrew Simon
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: Accepted as Tiny Paper at MLGenX workshop, ICLR, 2025

点击查看摘要

Abstract:The ability to quickly and accurately identify microbial species in a sample, known as metagenomic profiling, is critical across various fields, from healthcare to environmental science. This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences, a process known as basecalling, via a multi-objective deep neural network for simultaneous basecalling and multi-class genome classification. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers, and a pre-configured ranking strategy allowing top-K species accuracy, giving users flexibility to choose between higher accuracy or higher speed at identifying the species. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers, attaining an average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a total of 17 genomes in the Wick bacterial dataset. The work presented here has implications for future studies in metagenomic profiling by accelerating the bottleneck step of matching the DNA sequence to the correct genome.

[LG-42] Assumption-free fidelity bounds for hardware noise characterization

链接: https://arxiv.org/abs/2504.07010
作者: Nicolo Colombo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 3 figures, 2 tables

点击查看摘要

Abstract:In the Quantum Supremacy regime, quantum computers may overcome classical machines on several tasks if we can estimate, mitigate, or correct unavoidable hardware noise. Estimating the error requires classical simulations, which become unfeasible in the Quantum Supremacy regime. We leverage Machine Learning data-driven approaches and Conformal Prediction, a Machine Learning uncertainty quantification tool known for its mild assumptions and finite-sample validity, to find theoretically valid upper bounds of the fidelity between noiseless and noisy outputs of quantum devices. Under reasonable extrapolation assumptions, the proposed scheme applies to any Quantum Computing hardware, does not require modeling the device’s noise sources, and can be used when classical simulations are unavailable, e.g. in the Quantum Supremacy regime.

[LG-43] Artificial Intelligence for Pediatric Height Prediction Using Large-Scale Longitudinal Body Composition Data

链接: https://arxiv.org/abs/2504.06979
作者: Dohyun Chun,Hae Woon Jung,Jongho Kang,Woo Young Jang,Jihun Kim
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 23 pages, 7 figures, 2 tables

点击查看摘要

Abstract:This study developed an accurate artificial intelligence model for predicting future height in children and adolescents using anthropometric and body composition data from the GP Cohort Study (588,546 measurements from 96,485 children aged 7-18). The model incorporated anthropometric measures, body composition, standard deviation scores, and growth velocity parameters, with performance evaluated using RMSE, MAE, and MAPE. Results showed high accuracy with males achieving average RMSE, MAE, and MAPE of 2.51 cm, 1.74 cm, and 1.14%, and females showing 2.28 cm, 1.68 cm, and 1.13%, respectively. Explainable AI approaches identified height SDS, height velocity, and soft lean mass velocity as crucial predictors. The model generated personalized growth curves by estimating individual-specific height trajectories, offering a robust tool for clinical decision support, early identification of growth disorders, and optimization of growth outcomes.

[LG-44] CRYSIM: Prediction of Symmetric Structures of Large Crystals with GPU-based Ising Machines

链接: https://arxiv.org/abs/2504.06878
作者: Chen Liang,Diptesh Das,Jiang Guo,Ryo Tamura,Zetian Mao,Koji Tsuda
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures, 1 table

点击查看摘要

Abstract:Solving black-box optimization problems with Ising machines is increasingly common in materials science. However, their application to crystal structure prediction (CSP) is still ineffective due to symmetry agnostic encoding of atomic coordinates. We introduce CRYSIM, an algorithm that encodes the space group, the Wyckoff positions combination, and coordinates of independent atomic sites as separate variables. This encoding reduces the search space substantially by exploiting the symmetry in space groups. When CRYSIM is interfaced to Fixstars Amplify, a GPU-based Ising machine, its prediction performance was competitive with CALYPSO and Bayesian optimization for crystals containing more than 150 atoms in a unit cell. Although it is not realistic to interface CRYSIM to current small-scale quantum devices, it has the potential to become the standard CSP algorithm in the coming quantum age.

[LG-45] Mass Balance Approximation of Unfolding Improves Potential-Like Methods for Protein Stability Predictions

链接: https://arxiv.org/abs/2504.06806
作者: Ivan Rossi,Guido Barducci,Tiziana Sanavia,Paola Turina,Emidio Capriotti,Piero Fariselli
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:The prediction of protein stability changes following single-point mutations plays a pivotal role in computational biology, particularly in areas like drug discovery, enzyme reengineering, and genetic disease analysis. Although deep-learning strategies have pushed the field forward, their use in standard workflows remains limited due to resource demands. Conversely, potential-like methods are fast, intuitive, and efficient. Yet, these typically estimate Gibbs free energy shifts without considering the free-energy variations in the unfolded protein state, an omission that may breach mass balance and diminish accuracy. This study shows that incorporating a mass-balance correction (MBC) to account for the unfolded state significantly enhances these methods. While many machine learning models partially model this balance, our analysis suggests that a refined representation of the unfolded state may improve the predictive performance.

[LG-46] Hybrid machine learning models based on physical patterns to accelerate CFD simulations: a short guide on autoregressive models

链接: https://arxiv.org/abs/2504.06774
作者: Arindam Sengupta,Rodrigo Abadía-Heredia,Ashton Hetherington,José Miguel Pérez,Soledad Le Clainche
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate modeling of the complex dynamics of fluid flows is a fundamental challenge in computational physics and engineering. This study presents an innovative integration of High-Order Singular Value Decomposition (HOSVD) with Long Short-Term Memory (LSTM) architectures to address the complexities of reduced-order modeling (ROM) in fluid dynamics. HOSVD improves the dimensionality reduction process by preserving multidimensional structures, surpassing the limitations of Singular Value Decomposition (SVD). The methodology is tested across numerical and experimental data sets, including two- and three-dimensional (2D and 3D) cylinder wake flows, spanning both laminar and turbulent regimes. The emphasis is also on exploring how the depth and complexity of LSTM architectures contribute to improving predictive performance. Simpler architectures with a single dense layer effectively capture the periodic dynamics, demonstrating the network’s ability to model non-linearities and chaotic dynamics. The addition of extra layers provides higher accuracy at minimal computational cost. These additional layers enable the network to expand its representational capacity, improving the prediction accuracy and reliability. The results demonstrate that HOSVD outperforms SVD in all tested scenarios, as evidenced by using different error metrics. Efficient mode truncation by HOSVD-based models enables the capture of complex temporal patterns, offering reliable predictions even in challenging, noise-influenced data sets. The findings underscore the adaptability and robustness of HOSVD-LSTM architectures, offering a scalable framework for modeling fluid dynamics.

[LG-47] Quantum neural networks facilitating quantum state classification

链接: https://arxiv.org/abs/2504.06622
作者: Diksha Sharma,Vivek Balasaheb Sabale,Thirumalai M.,Atul Kumar
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The classification of quantum states into distinct classes poses a significant challenge. In this study, we address this problem using quantum neural networks in combination with a problem-inspired circuit and customised as well as predefined ansätz. To facilitate the resource-efficient quantum state classification, we construct the dataset of quantum states using the proposed problem-inspired circuit. The problem-inspired circuit incorporates two-qubit parameterised unitary gates of varying entangling power, which is further integrated with the ansätz, developing an entire quantum neural network. To demonstrate the capability of the selected ansätz, we visualise the mitigated barren plateaus. The designed quantum neural network demonstrates the efficiency in binary and multi-class classification tasks. This work establishes a foundation for the classification of multi-qubit quantum states and offers the potential for generalisation to multi-qubit pure quantum states.

[LG-48] Diffusion Factor Models: Generating High-Dimensional Returns with Factor Structure

链接: https://arxiv.org/abs/2504.06566
作者: Minshuo Chen,Renyuan Xu,Yumin Xu,Ruixun Zhang
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Mathematical Finance (q-fin.MF)
*备注:

点击查看摘要

Abstract:Financial scenario simulation is essential for risk management and portfolio optimization, yet it remains challenging especially in high-dimensional and small data settings common in finance. We propose a diffusion factor model that integrates latent factor structure into generative diffusion processes, bridging econometrics with modern generative AI to address the challenges of the curse of dimensionality and data scarcity in financial simulation. By exploiting the low-dimensional factor structure inherent in asset returns, we decompose the score function–a key component in diffusion models–using time-varying orthogonal projections, and this decomposition is incorporated into the design of neural network architectures. We derive rigorous statistical guarantees, establishing nonasymptotic error bounds for both score estimation at O(d^5/2 n^-2/(k+5)) and generated distribution at O(d^5/4 n^-1/2(k+5)), primarily driven by the intrinsic factor dimension k rather than the number of assets d, surpassing the dimension-dependent limits in the classical nonparametric statistics literature and making the framework viable for markets with thousands of assets. Numerical studies confirm superior performance in latent subspace recovery under small data regimes. Empirical analysis demonstrates the economic significance of our framework in constructing mean-variance optimal portfolios and factor portfolios. This work presents the first theoretical integration of factor structure with diffusion models, offering a principled approach for high-dimensional financial simulation with limited data.

[LG-49] Sparsified-Learning for Heavy-Tailed Locally Stationary Processes

链接: https://arxiv.org/abs/2504.06477
作者: Yingjie Wang,Mokhtar Z. Alaya,Salim Bouzebda,Xinsheng Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Sparsified Learning is ubiquitous in many machine learning tasks. It aims to regularize the objective function by adding a penalization term that considers the constraints made on the learned parameters. This paper considers the problem of learning heavy-tailed LSP. We develop a flexible and robust sparse learning framework capable of handling heavy-tailed data with locally stationary behavior and propose concentration inequalities. We further provide non-asymptotic oracle inequalities for different types of sparsity, including \ell_1 -norm and total variation penalization for the least square loss.

[LG-50] Deep Fair Learning: A Unified Framework for Fine-tuning Representations with Sufficient Networks

链接: https://arxiv.org/abs/2504.06470
作者: Enze Shi,Linglong Kong,Bei Jiang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring fairness in machine learning is a critical and challenging task, as biased data representations often lead to unfair predictions. To address this, we propose Deep Fair Learning, a framework that integrates nonlinear sufficient dimension reduction with deep learning to construct fair and informative representations. By introducing a novel penalty term during fine-tuning, our method enforces conditional independence between sensitive attributes and learned representations, addressing bias at its source while preserving predictive performance. Unlike prior methods, it supports diverse sensitive attributes, including continuous, discrete, binary, or multi-group types. Experiments on various types of data structure show that our approach achieves a superior balance between fairness and utility, significantly outperforming state-of-the-art baselines.

[LG-51] Deep spatio-temporal point processes: Advances and new directions

链接: https://arxiv.org/abs/2504.06364
作者: Xiuyuan Cheng,Zheng Dong,Yao Xie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Spatio-temporal point processes (STPPs) model discrete events distributed in time and space, with important applications in areas such as criminology, seismology, epidemiology, and social networks. Traditional models often rely on parametric kernels, limiting their ability to capture heterogeneous, nonstationary dynamics. Recent innovations integrate deep neural architectures – either by modeling the conditional intensity function directly or by learning flexible, data-driven influence kernels, substantially broadening their expressive power. This article reviews the development of the deep influence kernel approach, which enjoys statistical explainability, since the influence kernel remains in the model to capture the spatiotemporal propagation of event influence and its impact on future events, while also possessing strong expressive power, thereby benefiting from both worlds. We explain the main components in developing deep kernel point processes, leveraging tools such as functional basis decomposition and graph neural networks to encode complex spatial or network structures, as well as estimation using both likelihood-based and likelihood-free methods, and address computational scalability for large-scale data. We also discuss the theoretical foundation of kernel identifiability. Simulated and real-data examples highlight applications to crime analysis, earthquake aftershock prediction, and sepsis prediction modeling, and we conclude by discussing promising directions for the field.

[LG-52] DeepGDel: Deep Learning-based Gene Deletion Prediction Framework for Growth-Coupled Production in Genome-Scale Metabolic Models

链接: https://arxiv.org/abs/2504.06316
作者: Ziwei Yang,Takeyuki Tamura
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In genome-scale constraint-based metabolic models, gene deletion strategies are crucial for achieving growth-coupled production, where cell growth and target metabolite production are simultaneously achieved. While computational methods for calculating gene deletions have been widely explored and contribute to developing gene deletion strategy databases, current approaches are limited in leveraging new data-driven paradigms, such as machine learning, for more efficient strain design. Therefore, it is necessary to propose a fundamental framework for this objective. In this study, we first formulate the problem of gene deletion strategy prediction and then propose a framework for predicting gene deletion strategies for growth-coupled production in genome-scale metabolic models. The proposed framework leverages deep learning algorithms to learn and integrate sequential gene and metabolite data representation, enabling the automatic gene deletion strategy prediction. Computational experiment results demonstrate the feasibility of the proposed framework, showing substantial improvements over the baseline method. Specifically, the proposed framework achieves a 17.64%, 27.15%, and 18.07% increase in overall accuracy across three metabolic models of different scales under study, while maintaining balanced precision and recall in predicting gene deletion statuses. The source code and examples for the framework are publicly available at this https URL.

[LG-53] Generative AI Enhanced Financial Risk Management Information Retrieval

链接: https://arxiv.org/abs/2504.06293
作者: Amin Haeri,Jonathan Vitrano,Mahdi Ghelichi
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 2 tables, 1 equation

点击查看摘要

Abstract:Risk management in finance involves recognizing, evaluating, and addressing financial risks to maintain stability and ensure regulatory compliance. Extracting relevant insights from extensive regulatory documents is a complex challenge requiring advanced retrieval and language models. This paper introduces RiskData, a dataset specifically curated for finetuning embedding models in risk management, and RiskEmbed, a finetuned embedding model designed to improve retrieval accuracy in financial question-answering systems. The dataset is derived from 94 regulatory guidelines published by the Office of the Superintendent of Financial Institutions (OSFI) from 1991 to 2024. We finetune a state-of-the-art sentence BERT embedding model to enhance domain-specific retrieval performance typically for Retrieval-Augmented Generation (RAG) systems. Experimental results demonstrate that RiskEmbed significantly outperforms general-purpose and financial embedding models, achieving substantial improvements in ranking metrics. By open-sourcing both the dataset and the model, we provide a valuable resource for financial institutions and researchers aiming to develop more accurate and efficient risk management AI solutions.

信息检索

[IR-0] CHIME: A Compressive Framework for Holistic Interest Modeling

链接: https://arxiv.org/abs/2504.06780
作者: Yong Bai,Rui Xiang,Kaiyuan Li,Yongxiang Tang,Yanhua Cheng,Xialong Liu,Peng Jiang,Kun Gai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modeling holistic user interests is important for improving recommendation systems but is challenged by high computational cost and difficulty in handling diverse information with full behavior context. Existing search-based methods might lose critical signals during behavior selection. To overcome these limitations, we propose CHIME: A Compressive Framework for Holistic Interest Modeling. It uses adapted large language models to encode complete user behaviors with heterogeneous inputs. We introduce multi-granular contrastive learning objectives to capture both persistent and transient interest patterns and apply residual vector quantization to generate compact embeddings. CHIME demonstrates superior ranking performance across diverse datasets, establishing a robust solution for scalable holistic interest modeling in recommendation systems.

[IR-1] Unifying Search and Recommendation: A Generative Paradigm Inspired by Information Theory

链接: https://arxiv.org/abs/2504.06714
作者: Jujia Zhao,Wenjie Wang,Chen Xu,Xiuying Wang,Zhaochun Ren,Suzan Verberne
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems and search engines serve as foundational elements of online platforms, with the former delivering information proactively and the latter enabling users to seek information actively. Unifying both tasks in a shared model is promising since it can enhance user modeling and item understanding. Previous approaches mainly follow a discriminative paradigm, utilizing shared encoders to process input features and task-specific heads to perform each task. However, this paradigm encounters two key challenges: gradient conflict and manual design complexity. From the information theory perspective, these challenges potentially both stem from the same issue – low mutual information between the input features and task-specific outputs during the optimization process. To tackle these issues, we propose GenSR, a novel generative paradigm for unifying search and recommendation (SR), which leverages task-specific prompts to partition the model’s parameter space into subspaces, thereby enhancing mutual information. To construct effective subspaces for each task, GenSR first prepares informative representations for each subspace and then optimizes both subspaces in one unified model. Specifically, GenSR consists of two main modules: (1) Dual Representation Learning, which independently models collaborative and semantic historical information to derive expressive item representations; and (2) SR Task Unifying, which utilizes contrastive learning together with instruction tuning to generate task-specific outputs effectively. Extensive experiments on two public datasets show GenSR outperforms state-of-the-art methods across SR tasks. Our work introduces a new generative paradigm compared with previous discriminative methods and establishes its superiority from the mutual information perspective. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2504.06714 [cs.IR] (or arXiv:2504.06714v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.06714 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] oward Holistic Evaluation of Recommender Systems Powered by Generative Models

链接: https://arxiv.org/abs/2504.06667
作者: Yashar Deldjoo,Nikhil Mehta,Maheswaran Sathiamoorthy,Shuai Zhang,Pablo Castells,Julian McAuley
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems powered by generative models (Gen-RecSys) extend beyond classical item ranking by producing open-ended content, which simultaneously unlocks richer user experiences and introduces new risks. On one hand, these systems can enhance personalization and appeal through dynamic explanations and multi-turn dialogues. On the other hand, they might venture into unknown territory-hallucinating nonexistent items, amplifying bias, or leaking private information. Traditional accuracy metrics cannot fully capture these challenges, as they fail to measure factual correctness, content safety, or alignment with user intent. This paper makes two main contributions. First, we categorize the evaluation challenges of Gen-RecSys into two groups: (i) existing concerns that are exacerbated by generative outputs (e.g., bias, privacy) and (ii) entirely new risks (e.g., item hallucinations, contradictory explanations). Second, we propose a holistic evaluation approach that includes scenario-based assessments and multi-metric checks-incorporating relevance, factual grounding, bias detection, and policy compliance. Our goal is to provide a guiding framework so researchers and practitioners can thoroughly assess Gen-RecSys, ensuring effective personalization and responsible deployment. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2504.06667 [cs.IR] (or arXiv:2504.06667v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.06667 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] BBQRec: Behavior-Bind Quantization for Multi-Modal Sequential Recommendation

链接: https://arxiv.org/abs/2504.06636
作者: Kaiyuan Li,Rui Xiang,Yong Bai,Yongxiang Tang,Yanhua Cheng,Xialong Liu,Peng Jiang,Kun Gai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multi-modal sequential recommendation systems leverage auxiliary signals (e.g., text, images) to alleviate data sparsity in user-item interactions. While recent methods exploit large language models to encode modalities into discrete semantic IDs for autoregressive prediction, we identify two critical limitations: (1) Existing approaches adopt fragmented quantization, where modalities are independently mapped to semantic spaces misaligned with behavioral objectives, and (2) Over-reliance on semantic IDs disrupts inter-modal semantic coherence, thereby weakening the expressive power of multi-modal representations for modeling diverse user preferences. To address these challenges, we propose a Behavior-Bind multi-modal Quantization for Sequential Recommendation (BBQRec for short) featuring dual-aligned quantization and semantics-aware sequence modeling. First, our behavior-semantic alignment module disentangles modality-agnostic behavioral patterns from noisy modality-specific features through contrastive codebook learning, ensuring semantic IDs are inherently tied to recommendation tasks. Second, we design a discretized similarity reweighting mechanism that dynamically adjusts self-attention scores using quantized semantic relationships, preserving multi-modal synergies while avoiding invasive modifications to the sequence modeling architecture. Extensive evaluations across four real-world benchmarks demonstrate BBQRec’s superiority over the state-of-the-art baselines. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2504.06636 [cs.IR] (or arXiv:2504.06636v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.06636 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] A Serendipitous Recommendation System Considering User Curiosity

链接: https://arxiv.org/abs/2504.06633
作者: Zhelin Xu,Atsushi Matsumura
类目: Information Retrieval (cs.IR)
*备注: 15 pages, 3 figures, accepted as a full paper at iiWAS 2024

点击查看摘要

Abstract:To address the problem of narrow recommendation ranges caused by an emphasis on prediction accuracy, serendipitous recommendations, which consider both usefulness and unexpectedness, have attracted attention. However, realizing serendipitous recommendations is challenging due to the varying proportions of usefulness and unexpectedness preferred by different users, which is influenced by their differing desires for knowledge. In this paper, we propose a method to estimate the proportion of usefulness and unexpectedness that each user desires based on their curiosity, and make recommendations that match this preference. The proposed method estimates a user’s curiosity by considering both their long-term and short-term interests. Offline experiments were conducted using the MovieLens-1M dataset to evaluate the effectiveness of the proposed method. The experimental results demonstrate that our method achieves the same level of performance as state-of-the-art method while successfully providing serendipitous recommendations.

[IR-5] Diversity-aware Dual-promotion Poisoning Attack on Sequential Recommendation SIGIR2025

链接: https://arxiv.org/abs/2504.06586
作者: Yuchuan Zhao,Tong Chen,Junliang Yu,Kai Zheng,Lizhen Cui,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2025

点击查看摘要

Abstract:Sequential recommender systems (SRSs) excel in capturing users’ dynamic interests, thus playing a key role in various industrial applications. The popularity of SRSs has also driven emerging research on their security aspects, where data poisoning attack for targeted item promotion is a typical example. Existing attack mechanisms primarily focus on increasing the ranks of target items in the recommendation list by injecting carefully crafted interactions (i.e., poisoning sequences), which comes at the cost of demoting users’ real preferences. Consequently, noticeable recommendation accuracy drops are observed, restricting the stealthiness of the attack. Additionally, the generated poisoning sequences are prone to substantial repetition of target items, which is a result of the unitary objective of boosting their overall exposure and lack of effective diversity regularizations. Such homogeneity not only compromises the authenticity of these sequences, but also limits the attack effectiveness, as it ignores the opportunity to establish sequential dependencies between the target and many more items in the SRS. To address the issues outlined, we propose a Diversity-aware Dual-promotion Sequential Poisoning attack method named DDSP for SRSs. Specifically, by theoretically revealing the conflict between recommendation and existing attack objectives, we design a revamped attack objective that promotes the target item while maintaining the relevance of preferred items in a user’s ranking list. We further develop a diversity-aware, auto-regressive poisoning sequence generator, where a re-ranking method is in place to sequentially pick the optimal items by integrating diversity constraints.

[IR-6] Bridging Queries and Tables through Entities in Table Retrieval

链接: https://arxiv.org/abs/2504.06551
作者: Da Li,Keping Bi,Jiafeng Guo,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Table retrieval is essential for accessing information stored in structured tabular formats; however, it remains less explored than text retrieval. The content of the table primarily consists of phrases and words, which include a large number of entities, such as time, locations, persons, and organizations. Entities are well-studied in the context of text retrieval, but there is a noticeable lack of research on their applications in table retrieval. In this work, we explore how to leverage entities in tables to improve retrieval performance. First, we investigate the important role of entities in table retrieval from a statistical perspective and propose an entity-enhanced training framework. Subsequently, we use the type of entities to highlight entities instead of introducing an external knowledge base. Moreover, we design an interaction paradigm based on entity representations. Our proposed framework is plug-and-play and flexible, making it easy to integrate into existing table retriever training processes. Empirical results on two table retrieval benchmarks, NQ-TABLES and OTT-QA, show that our proposed framework is both simple and effective in enhancing existing retrievers. We also conduct extensive analyses to confirm the efficacy of different components. Overall, our work provides a promising direction for elevating table retrieval, enlightening future research in this area.

[IR-7] DiffusionCom: Structure-Aware Multimodal Diffusion Model for Multimodal Knowledge Graph Completion

链接: https://arxiv.org/abs/2504.06543
作者: Wei Huang,Meiyu Liang,Peining Li,Xu Hou,Yawen Li,Junping Du,Zhe Xue,Zeli Guan
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Most current MKGC approaches are predominantly based on discriminative models that maximize conditional likelihood. These approaches struggle to efficiently capture the complex connections in real-world knowledge graphs, thereby limiting their overall performance. To address this issue, we propose a structure-aware multimodal Diffusion model for multimodal knowledge graph Completion (DiffusionCom). DiffusionCom innovatively approaches the problem from the perspective of generative models, modeling the association between the (head, relation) pair and candidate tail entities as their joint probability distribution p((head, relation), (tail)) , and framing the MKGC task as a process of gradually generating the joint probability distribution from noise. Furthermore, to fully leverage the structural information in MKGs, we propose Structure-MKGformer, an adaptive and structure-aware multimodal knowledge representation learning method, as the encoder for DiffusionCom. Structure-MKGformer captures rich structural information through a multimodal graph attention network (MGAT) and adaptively fuses it with entity representations, thereby enhancing the structural awareness of these representations. This design effectively addresses the limitations of existing MKGC methods, particularly those based on multimodal pre-trained models, in utilizing structural information. DiffusionCom is trained using both generative and discriminative losses for the generator, while the feature extractor is optimized exclusively with discriminative loss. This dual approach allows DiffusionCom to harness the strengths of both generative and discriminative models. Extensive experiments on the FB15k-237-IMG and WN18-IMG datasets demonstrate that DiffusionCom outperforms state-of-the-art models.

[IR-8] Can we repurpose multiple-choice question-answering models to rerank retrieved documents? ACL

链接: https://arxiv.org/abs/2504.06276
作者: Jasper Kyle Catapang
类目: Information Retrieval (cs.IR)
*备注: Accepted to The 38th Pacific Asia Conference on Language, Information and Computation; PACLIC 38 (2024)

点击查看摘要

Abstract:Yes, repurposing multiple-choice question-answering (MCQA) models for document reranking is both feasible and valuable. This preliminary work is founded on mathematical parallels between MCQA decision-making and cross-encoder semantic relevance assessments, leading to the development of R*, a proof-of-concept model that harmonizes these approaches. Designed to assess document relevance with depth and precision, R* showcases how MCQA’s principles can improve reranking in information retrieval (IR) and retrieval-augmented generation (RAG) systems – ultimately enhancing search and dialogue in AI-powered systems. Through experimental validation, R* proves to improve retrieval accuracy and contribute to the field’s advancement by demonstrating a practical prototype of MCQA for reranking by keeping it lightweight.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-04-10

目录

概览 (2025-04-10)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载