Arxiv今日论文 | 2025-04-08

本篇博文主要内容为 2025-04-08 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成链式思维解释（chain-of-thought explanations）时，由于偏好优化（preference optimization）过程中潜在冲突导致的解释不忠实（reduced faithfulness）问题。具体而言，奖励模型（Reward Model, RM）在对齐阶段既要优化响应质量，又要确保解释的适当性（如减少偏见或遵守安全标准），但缺乏评估模型内部决策过程与生成解释之间一致性（consistency）的机制，从而可能导致模型通过“奖励劫持”（reward hacking）方式生成迎合奖励而非准确反映其推理过程的误导性解释。

解决方案的关键在于通过向RM输入中引入预测的因果归因（causal attribution of the prediction），使RM能够检测生成的自解释（self-explanation）与模型决策过程之间的差异。这种方法能够在控制实验中有效减少LLM生成误导性解释的趋势。

链接: https://arxiv.org/abs/2504.05294
作者: Pedro Ferreira,Wilker Aziz,Ivan Titov
机构: University of Amsterdam (阿姆斯特丹大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Chain-of-thought explanations are widely used to inspect the decision process of large language models (LLMs) and to evaluate the trustworthiness of model outputs, making them important for effective collaboration between LLMs and humans. We demonstrate that preference optimization - a key step in the alignment phase - can inadvertently reduce the faithfulness of these explanations. This occurs because the reward model (RM), which guides alignment, is tasked with optimizing both the expected quality of the response and the appropriateness of the explanations (e.g., minimizing bias or adhering to safety standards), creating potential conflicts. The RM lacks a mechanism to assess the consistency between the model’s internal decision process and the generated explanation. Consequently, the LLM may engage in “reward hacking” by producing a final response that scores highly while giving an explanation tailored to maximize reward rather than accurately reflecting its reasoning. To address this issue, we propose enriching the RM’s input with a causal attribution of the prediction, allowing the RM to detect discrepancies between the generated self-explanation and the model’s decision process. In controlled settings, we show that this approach reduces the tendency of the LLM to generate misleading explanations.
zh

[NLP-1] LiveVQA: Live Visual Knowledge Seeking

【速读】：该论文旨在解决视觉问答（Visual Question Answering, VQA）领域中利用最新互联网视觉知识的问题。论文通过创建一个名为LiveVQA的数据集，该数据集包含来自6个新闻网站、涵盖14个新闻类别的3,602个单跳与多跳视觉问题，来评估现有大型语言模型（Large Language Models, LLMs）在处理视觉知识相关问题时的能力。研究发现，尽管这些模型在文本问题上的表现优异，但在需要最新视觉知识的视觉问题上，即使配备搜索引擎等工具，仍存在显著差距。因此，论文的关键在于强调了先进视觉推理能力对于复杂多跳视觉问题的重要性，并指出了未来研究的重要方向，即提升模型获取和应用最新视觉信息的能力。

链接: https://arxiv.org/abs/2504.05288
作者: Mingyang Fu,Yuyang Peng,Benlin Liu,Yao Wan,Dongping Chen
机构: Huazhong University of Science and Technology (华中科技大学); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:We introduce LiveVQA, an automatically collected dataset of latest visual knowledge from the Internet with synthesized VQA problems. LiveVQA consists of 3,602 single- and multi-hop visual questions from 6 news websites across 14 news categories, featuring high-quality image-text coherence and authentic information. Our evaluation across 15 MLLMs (e.g., GPT-4o, Gemma-3, and Qwen-2.5-VL family) demonstrates that stronger models perform better overall, with advanced visual reasoning capabilities proving crucial for complex multi-hop questions. Despite excellent performance on textual problems, models with tools like search engines still show significant gaps when addressing visual questions requiring latest visual knowledge, highlighting important areas for future research.
zh

[NLP-2] Enhancing LLM -Based Short Answer Grading with Retrieval-Augmented Generation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在科学教育短答题评估中的局限性，特别是其因缺乏领域知识而导致的任务特定需求理解不足及评分表现不佳的问题。论文的关键解决方案是提出了一种自适应的检索增强生成（Retrieval-Augmented Generation, RAG）框架，通过动态检索和整合与问题及学生答案上下文相关的领域特定知识，提升评分准确性。该方法结合语义搜索与精心策划的教育资源，从参考材料中提取有价值的信息，实验结果表明，相比基础LLM方法，所提出的系统在科学教育数据集上的评分精度显著提高。这表明RAG增强的评分系统能够以高效的方式提供可靠的支持。

链接: https://arxiv.org/abs/2504.05276
作者: Yucheng Chu,Peng He,Hang Li,Haoyu Han,Kaiqi Yang,Yu Xue,Tingting Li,Joseph Krajcik,Jiliang Tang
机构: Michigan State University; Washington State University; Michigan State University; Michigan State University; Washington State University; Michigan State University; Washington State University; Michigan State University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Short answer assessment is a vital component of science education, allowing evaluation of students’ complex three-dimensional understanding. Large language models (LLMs) that possess human-like ability in linguistic tasks are increasingly popular in assisting human graders to reduce their workload. However, LLMs’ limitations in domain knowledge restrict their understanding in task-specific requirements and hinder their ability to achieve satisfactory performance. Retrieval-augmented generation (RAG) emerges as a promising solution by enabling LLMs to access relevant domain-specific knowledge during assessment. In this work, we propose an adaptive RAG framework for automated grading that dynamically retrieves and incorporates domain-specific knowledge based on the question and student answer context. Our approach combines semantic search and curated educational sources to retrieve valuable reference materials. Experimental results in a science education dataset demonstrate that our system achieves an improvement in grading accuracy compared to baseline LLM approaches. The findings suggest that RAG-enhanced grading systems can serve as reliable support with efficient performance gains.
zh

[NLP-3] Do PhD-level LLM s Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）是否真正学习了数学原理，还是仅仅依赖于模式记忆。通过研究基本的两位整数加法（范围从 0 到 (2^{64})），论文关注两个核心属性：交换律（commutativity, (A+B=B+A)）和组合泛化（compositional generalization，通过同构符号映射，如 (7 \rightarrow y)）。论文的关键解决方案在于设计了一个简单但有效的实验设置，通过对比数值加法与符号映射任务中的表现差异，揭示了当前最先进的 LLMs 在符号映射任务中性能显著下降至 ≤7.5%，且频繁违反交换律（超过 1,700 次 (A+B \neq B+A)），表明其未能有效泛化已学规则。此外，显式提供加法规则导致性能平均下降 81.2%，而自我解释方法保持了基线准确性，进一步证明了 LLM 的算术处理方式与人类定义的原则不一致。论文的核心结论指出，当前 LLMs 更依赖于记忆模式而非真正的规则学习，强调了现有架构的局限性，并呼吁开发新的方法以实现真正的数学推理能力。

链接: https://arxiv.org/abs/2504.05262
作者: Yang Yan,Yu Lu,Renjun Xu,Zhenzhong Lan
机构: Zhejiang University (浙江大学); School of Engineering, Westlake University (西湖大学工程学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition ( 0 to 2^64 ), probing two core properties: commutativity ( A+B=B+A ) and compositional generalization (via isomorphic symbolic mappings, e.g., 7 \rightarrow y ). While state-of-the-art LLMs achieve 73.8-99.8% accuracy on numerical addition, performance collapses to \leq 7.5% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of A+B \neq B+A ) further support this. Explicitly providing addition rules degrades performance by 81.2% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.
zh

[NLP-4] Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在时间推理（temporal reasoning）能力上的不足，特别是处理与时间相关的信息，如事件排序、持续时间和时间间关系等任务的能力。时间推理对于问答、日程安排和历史分析等应用至关重要，但传统LLMs在这方面的表现有限。

解决方案的关键在于提出了一种名为TISER的新框架，它通过结合时间线构建与迭代自省的多阶段过程来增强LLMs的时间推理能力。TISER利用测试时扩展（test-time scaling）技术延长推理轨迹的长度，从而更有效地捕捉复杂的时间依赖关系。这一策略不仅提升了推理准确性，还增强了推理过程的可追溯性。实验结果表明，TISER在多个基准数据集（包括分布外测试集）上达到了最先进的性能，并使较小的开源模型在具有挑战性的时间推理任务中超越了较大的闭源模型。

链接: https://arxiv.org/abs/2504.05258
作者: Adrián Bazaga,Rexhina Blloshmi,Bill Byrne,Adrià de Gispert
机构: University of Cambridge (剑桥大学); Amazon AGI (亚马逊AGI); Microsoft (微软)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks. However, they struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships. These capabilities are critical for applications including question answering, scheduling, and historical analysis. In this paper, we introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection. Our approach leverages test-time scaling to extend the length of reasoning traces, enabling models to capture complex temporal dependencies more effectively. This strategy not only boosts reasoning accuracy but also improves the traceability of the inference process. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including out-of-distribution test sets, and reveal that TISER enables smaller open-source models to surpass larger closed-weight models on challenging temporal reasoning tasks.
zh

[NLP-5] LLM -based Automated Grading with Human-in-the-Loop

【速读】：该论文旨在解决自动短答案评分（Automatic Short Answer Grading, ASAG）在基于评分标准（rubric-based）评估中难以达到人类水平性能的问题。现有基于大型语言模型（Large Language Models, LLMs）的方法主要依赖全自动流程，在处理复杂的评分标准时仍存在局限性。论文的关键解决方案是提出了一种人机协作（human-in-the-loop, HITL）框架——GradeHITL，它利用LLMs的生成特性向人类专家提问，并结合其反馈动态优化评分标准，从而显著提升评分准确性，使ASAG更接近人类水平评价。

链接: https://arxiv.org/abs/2504.05239
作者: Hang Li,Yucheng Chu,Kaiqi Yang,Yasemin Copur-Gencturk,Jiliang Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance compared to traditional ASAG approaches but also move beyond simple comparisons with predefined “golden” answers, enabling more sophisticated grading scenarios, such as rubric-based evaluation. However, existing LLM-powered methods still face challenges in achieving human-level grading performance in rubric-based assessments due to their reliance on fully automated approaches. In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically. This adaptive process significantly improves grading accuracy, outperforming existing methods and bringing ASAG closer to human-level evaluation.
zh

[NLP-6] NoveltyBench: Evaluating Creativity and Diversity in Language Models

【速读】：该论文试图解决语言模型在生成多样性（diversity）和新颖性（novelty）方面的不足问题，即“模式崩溃”（mode collapse），这导致模型难以产生多样化和创新性的输出。论文的关键解决方案是引入了一个名为NoveltyBench的新基准测试集，它通过精心设计的提示词（prompts）来激发多样化的答案，并采用过滤后的真实用户查询。研究发现，当前最先进的语言模型在生成多样性方面显著低于人类作者，且同一家族中的较大模型往往比较小模型表现出更少的多样性。论文强调，虽然某些提示策略如上下文再生可以提高多样性，但现有模型在分布多样性上的根本缺乏限制了它们为寻求多样化响应的用户的实际效用，从而提出了需要新的训练和评估范式，以同时关注生成质量与创造力。

链接: https://arxiv.org/abs/2504.05228
作者: Yiming Zhang,Harshita Diddee,Susan Holm,Hanchen Liu,Xinyue Liu,Vinay Samuel,Barry Wang,Daphne Ippolito
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. Our work introduces NoveltyBench, a benchmark specifically designed to evaluate the ability of language models to produce multiple distinct and high-quality outputs. NoveltyBench utilizes prompts curated to elicit diverse answers and filtered real-world user queries. Evaluating 20 leading language models, we find that current state-of-the-art systems generate significantly less diversity than human writers. Notably, larger models within a family often exhibit less diversity than their smaller counterparts, challenging the notion that capability on standard benchmarks translates directly to generative utility. While prompting strategies like in-context regeneration can elicit diversity, our findings highlight a fundamental lack of distributional diversity in current models, reducing their utility for users seeking varied responses and suggesting the need for new training and evaluation paradigms that prioritize creativity alongside quality.
zh

[NLP-7] Proposing TAGbank as a Corpus of Tree-Adjoining Grammar Derivations

【速读】：本文旨在解决自然语言处理（NLP）领域中基于句法树库的资源缺乏与现有句法形式化表示不兼容的问题。具体而言，尽管像Penn Treebank和Universal Dependencies这样的句法资源提供了丰富的短语结构和依存关系标注，但缺乏大规模基于句法化语法形式化的语料库，特别是Tree-Adjoining Grammar (TAG)。为填补这一空白，论文提出了TAGbank，这是一个通过自动从现有句法树库提取TAG派生而构建的语料库。

解决方案的关键在于开发一种方法论，将短语结构注释映射到TAG派生上，并利用TAG的生成能力支持解析、语法归纳及语义分析。该方法借鉴了CCGbank的工作，并进一步扩展以包含TAG的独特结构性质，如透明的派生树及其捕捉长距离依赖的能力。此外，研究还讨论了在提取过程中面临的挑战，包括确保不同树库方案之间的一致性以及处理特定语言的句法特异性。最终，作者提议将TAGbank扩展至多语言环境，重点关注Penn Korean和Penn Chinese Treebanks，以探索TAG形式化的跨语言应用。通过提供一个基于派生的强大资源，TAGbank旨在服务于广泛的计算任务，并促进对TAG生成能力的理论理解。

链接: https://arxiv.org/abs/2504.05226
作者: Jungyeul Park
机构: The University of British Columbia (英属哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of lexicalized grammars, particularly Tree-Adjoining Grammar (TAG), has significantly advanced our understanding of syntax and semantics in natural language processing (NLP). While existing syntactic resources like the Penn Treebank and Universal Dependencies offer extensive annotations for phrase-structure and dependency parsing, there is a lack of large-scale corpora grounded in lexicalized grammar formalisms. To address this gap, we introduce TAGbank, a corpus of TAG derivations automatically extracted from existing syntactic treebanks. This paper outlines a methodology for mapping phrase-structure annotations to TAG derivations, leveraging the generative power of TAG to support parsing, grammar induction, and semantic analysis. Our approach builds on the work of CCGbank, extending it to incorporate the unique structural properties of TAG, including its transparent derivation trees and its ability to capture long-distance dependencies. We also discuss the challenges involved in the extraction process, including ensuring consistency across treebank schemes and dealing with language-specific syntactic idiosyncrasies. Finally, we propose the future extension of TAGbank to include multilingual corpora, focusing on the Penn Korean and Penn Chinese Treebanks, to explore the cross-linguistic application of TAG’s formalism. By providing a robust, derivation-based resource, TAGbank aims to support a wide range of computational tasks and contribute to the theoretical understanding of TAG’s generative capacity.
zh

[NLP-8] Leverag ing LLM s for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG

【速读】：该论文试图解决在训练检索模型（Retrieval Model）和可检索生成式对话模型（RAG）时，依赖昂贵的人类标注查询-文档相关性注释所带来的高成本问题。同时，研究探索了是否可以通过利用大型语言模型（LLMs）生成的注释来替代人工注释。论文的关键在于提出了一种以文档效用（utility）为核心的标注方法，并设计了一种新的损失函数Disj-InfoNCE，以减少LLMs生成低质量正样本的影响。实验结果表明，在跨域场景下，基于效用标注训练的检索器显著优于基于人工标注的模型，展现出更强的泛化能力；而在领域内场景下，结合少量人工标注数据即可使基于效用标注的模型性能与完全依赖人工标注的模型相当。

链接: https://arxiv.org/abs/2504.05220
作者: Hengran Zhang,Minghao Tang,Keping Bi,Jiafeng Guo,Shihao Liu,Daiting Shi,Dawei Yin,Xueqi Cheng
机构: CAS Key Lab of Network Data Science and Technology, ICT, CAS (中科院计算技术研究所网络数据科学与技术重点实验室); University of Chinese Academy of Sciences (中国科学院大学); Nankai University (南开大学); Baidu Inc (百度公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Retrieval models typically rely on costly human-labeled query-document relevance annotations for training and evaluation. To reduce this cost and leverage the potential of Large Language Models (LLMs) in relevance judgments, we aim to explore whether LLM-generated annotations can effectively replace human annotations in training retrieval models. Retrieval usually emphasizes relevance, which indicates “topic-relatedness” of a document to a query, while in RAG, the value of a document (or utility) depends on how it contributes to answer generation. Recognizing this mismatch, some researchers use LLM performance on downstream tasks with documents as labels, but this approach requires manual answers for specific tasks, leading to high costs and limited generalization. In another line of work, prompting LLMs to select useful documents as RAG references eliminates the need for human annotation and is not task-specific. If we leverage LLMs’ utility judgments to annotate retrieval data, we may retain cross-task generalization without human annotation in large-scale corpora. Therefore, we investigate utility-focused annotation via LLMs for large-scale retriever training data across both in-domain and out-of-domain settings on the retrieval and RAG tasks. To reduce the impact of low-quality positives labeled by LLMs, we design a novel loss function, i.e., Disj-InfoNCE. Our experiments reveal that: (1) Retrievers trained on utility-focused annotations significantly outperform those trained on human annotations in the out-of-domain setting on both tasks, demonstrating superior generalization capabilities. (2) LLM annotation does not replace human annotation in the in-domain setting. However, incorporating just 20% human-annotated data enables retrievers trained with utility-focused annotations to match the performance of models trained entirely with human annotations.
zh

[NLP-9] Unleashing the Power of LLM s in Dense Retrieval with Query Likelihood Modeling

【速读】：该论文旨在解决利用大型语言模型（Large Language Models, LLMs）进行密集检索（Dense Retrieval）时，由于其注意力机制限制导致难以有效建模全局信息的问题。论文提出通过查询似然最大化（Query Likelihood Maximization, QL）充分利用LLMs的生成能力，并引入一个辅助任务来改进对比学习中的判别式检索器。关键解决方案在于设计了两个主要组件：Attention Stop (AS) 和 Input Corruption (IC)。AS通过停止预测词针对前序词的注意力直到文档结束，从而捕捉全局语义；IC在预测过程中对输入文档的部分词进行掩码，进一步增强对文档整体语义的表征能力。实验表明，所提出的LLM-QL模型在MSMARCO数据集上的表现显著优于其他基于LLM的检索器，并且使用LLM-QL估计的查询似然进行排序大幅超越基于词汇的传统方法。

链接: https://arxiv.org/abs/2504.05216
作者: Hengran Zhang,Keping Bi,Jiafeng Guo,Xiaojie Sun,Shihao Liu,Daiting Shi,Dawei Yin,Xueqi Cheng
机构: CAS Key Lab of Network Data Science and Technology, ICT, CAS (中科院计算技术研究所网络数据科学与技术重点实验室); University of Chinese Academy of Sciences (中国科学院大学); Baidu Inc (百度公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Dense retrieval is a crucial task in Information Retrieval (IR) and is the foundation for downstream tasks such as re-ranking. Recently, large language models (LLMs) have shown compelling semantic understanding capabilities and are appealing to researchers studying dense retrieval. LLMs, as decoder-style generative models, are competent at language generation while falling short on modeling global information due to the lack of attention to tokens afterward. Inspired by the classical word-based language modeling approach for IR, i.e., the query likelihood (QL) model, we seek to sufficiently utilize LLMs’ generative ability by QL maximization. However, instead of ranking documents with QL estimation, we introduce an auxiliary task of QL maximization to yield a better backbone for contrastively learning a discriminative retriever. We name our model as LLM-QL. To condense global document semantics to a single vector during QL modeling, LLM-QL has two major components, Attention Stop (AS) and Input Corruption (IC). AS stops the attention of predictive tokens to previous tokens until the ending token of the document. IC masks a portion of tokens in the input documents during prediction. Experiments on MSMARCO show that LLM-QL can achieve significantly better performance than other LLM-based retrievers and using QL estimated by LLM-QL for ranking outperforms word-based QL by a large margin.
zh

[NLP-10] Post-Training Language Models for Continual Relation Extraction

【速读】：该论文旨在解决动态非平稳现实世界数据中实时构建知识图谱（Knowledge Graphs, KGs）所面临的挑战，特别是关系抽取（Relation Extraction, RE）在适应不断演化的数据时的局限性。传统模型通常依赖静态过时的数据集，难以应对持续变化的关系信息。为解决这一问题，论文聚焦于持续关系抽取（Continual Relation Extraction, CRE），通过增量学习新关系同时保留已有知识来克服“灾难性遗忘”（Catastrophic Forgetting）。关键解决方案在于利用预训练语言模型（Pre-trained Language Models, PLMs），尤其是大型语言模型（Large Language Models, LLMs），结合记忆回放（Memory Replay）技术，以改进任务增量微调的效果。实验结果表明，解码器-only模型（如Mistral-7B和Llama2-7B）及编码器-解码器模型（如Flan-T5 Base）在TACRED和FewRel数据集上的表现显著优于传统的基于编码器-only模型（如BERT）的方法，尤其是在已见任务的准确性以及整体性能（整体准确率和平均准确率）方面。这凸显了知识迁移、语言模型架构设计以及KG完整性的关键影响因素，并推动了LLMs与记忆回放技术在动态实时关系抽取中的应用。

链接: https://arxiv.org/abs/2504.05214
作者: Sefika Efeoglu,Adrian Paschke,Sonja Schimmler
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Real-world data, such as news articles, social media posts, and chatbot conversations, is inherently dynamic and non-stationary, presenting significant challenges for constructing real-time structured representations through knowledge graphs (KGs). Relation Extraction (RE), a fundamental component of KG creation, often struggles to adapt to evolving data when traditional models rely on static, outdated datasets. Continual Relation Extraction (CRE) methods tackle this issue by incrementally learning new relations while preserving previously acquired knowledge. This study investigates the application of pre-trained language models (PLMs), specifically large language models (LLMs), to CRE, with a focus on leveraging memory replay to address catastrophic forgetting. We evaluate decoder-only models (eg, Mistral-7B and Llama2-7B) and encoder-decoder models (eg, Flan-T5 Base) on the TACRED and FewRel datasets. Task-incremental fine-tuning of LLMs demonstrates superior performance over earlier approaches using encoder-only models like BERT on TACRED, excelling in seen-task accuracy and overall performance (measured by whole and average accuracy), particularly with the Mistral and Flan-T5 models. Results on FewRel are similarly promising, achieving second place in whole and average accuracy metrics. This work underscores critical factors in knowledge transfer, language model architecture, and KG completeness, advancing CRE with LLMs and memory replay for dynamic, real-time relation extraction.
zh

[NLP-11] Exploiting individual differences to bootstrap communication

【速读】：该论文试图解决如何从非交际行为中 bootstrap 出一个能够表达无限数量意义的通信系统的问题。解决方案的关键在于两个认知能力：一是个体在特定情境下表现出可预测的行为 (Predictable Behavior)，二是信号产生前的心理状态对齐 (Psychological Alignment)，这种对齐源于共享意向性 (Shared Intentionality)。这两个能力独立于通信本身，因此研究结果与基于社会认知普遍但高度发展的理论相兼容。

链接: https://arxiv.org/abs/2504.05211
作者: Richard A. Blythe,Casimir Fisch
机构: School of Physics and Astronomy, University of Edinburgh (爱丁堡大学物理与天文学院), Peter Guthrie Tait Road, Edinburgh EH9 3FD, UK; Institute for Atmospheric and Climate Sciences, ETH Zürich (瑞士苏黎世联邦理工学院大气与气候科学研究所), Universitätstrasse 16, 8092 Zürich, Switzerland
类目: Computation and Language (cs.CL); Physics and Society (physics.soc-ph); Populations and Evolution (q-bio.PE)
备注: 13 pages including supplementary information, 3 figures

点击查看摘要

Abstract:Establishing a communication system is hard because the intended meaning of a signal is unknown to its receiver when first produced, and the signaller also has no idea how that signal will be interpreted. Most theoretical accounts of the emergence of communication systems rely on feedback to reinforce behaviours that have led to successful communication in the past. However, providing such feedback requires already being able to communicate the meaning that was intended or interpreted. Therefore these accounts cannot explain how communication can be bootstrapped from non-communicative behaviours. Here we present a model that shows how a communication system, capable of expressing an unbounded number of meanings, can emerge as a result of individual behavioural differences in a large population without any pre-existing means to determine communicative success. The two key cognitive capabilities responsible for this outcome are behaving predictably in a given situation, and an alignment of psychological states ahead of signal production that derives from shared intentionality. Since both capabilities can exist independently of communication, our results are compatible with theories in which large flexible socially-learned communication systems like language are the product of a general but well-developed capacity for social cognition.
zh

[NLP-12] Concise Reasoning via Reinforcement Learning

【速读】：该论文试图解决大型语言模型（LLMs）推理模型在训练过程中因强化学习（RL）优化导致的令牌使用量大、计算成本高、资源需求增加以及响应时间延长的问题。论文通过数学分析揭示，生成冗长响应的倾向本质上源于训练过程中的RL优化，并质疑了较长响应必然提升推理准确性这一普遍假设。相反，研究发现简洁性与准确性之间存在被忽视的自然关联。解决方案的关键在于引入一个小型问题集和有限资源的后训练阶段，通过二次RL微调显著减少模型的思维链长度，同时保持甚至提高推理准确性。实验结果验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.05185
作者: Mehdi Fatemi,Banafsheh Rafiee,Mingjie Tang,Kartik Talamadupula
机构: Wand AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. Moreover, we show that introducing a secondary phase of RL post-training, using a small set of problems and limited resources, can significantly reduce a model’s chain of thought while maintaining or even enhancing accuracy. Finally, we validate our conclusions through extensive experimental results.
zh

[NLP-13] CARE: Aligning Language Models for Regional Cultural Awareness

【速读】：该论文试图解决现有语言模型（Language Models, LMs）中存在的西方中心主义偏见以及对多样化文化知识表示不足的问题。为应对这一挑战，以往的方法依赖合成数据，并仅以英语表达文化知识。本文的关键在于探索少量由人类编写的多语言文化偏好数据是否能够有效提升不同模型家族及规模的语言模型的文化适应性。为此，作者引入了CARE资源库，包含24.1k条由母语者精心标注的人类偏好响应，涉及关于中国文化与阿拉伯文化的2,580个问题，且具有更均衡的覆盖范围。通过使用CARE，研究证明在不损害通用能力的前提下，文化适配可以显著改进现有的语言模型。此外，还评估了在不同语言查询条件下，语言模型、母语者以及检索到的网络内容之间的文化意识差异。实验结果显示语言模型之间存在区域差异，这可能也反映了文档差距的现象：母语者倾向于默认接受日常文化常识和社会规范，而非母语者则更有可能主动寻求并记录这些信息。CARE数据集已公开可用，并计划未来加入日语数据。

链接: https://arxiv.org/abs/2504.05154
作者: Geyang Guo,Tarek Naous,Hiromi Wakaki,Yukiko Nishimura,Yuki Mitsufuji,Alan Ritter,Wei Xu
机构: Georgia Institute of Technology (乔治亚理工学院); Sony Group Corporation (索尼集团公司)
类目: Computation and Language (cs.CL)
备注: 24 pages

点击查看摘要

Abstract:Existing language models (LMs) often exhibit a Western-centric bias and struggle to represent diverse cultural knowledge. Previous attempts to address this rely on synthetic data and express cultural knowledge only in English. In this work, we study whether a small amount of human-written, multilingual cultural preference data can improve LMs across various model families and sizes. We first introduce CARE, a multilingual resource of 24.1k responses with human preferences on 2,580 questions about Chinese and Arab cultures, all carefully annotated by native speakers and offering more balanced coverage. Using CARE, we demonstrate that cultural alignment improves existing LMs beyond generic resources without compromising general capabilities. Moreover, we evaluate the cultural awareness of LMs, native speakers, and retrieved web content when queried in different languages. Our experiment reveals regional disparities among LMs, which may also be reflected in the documentation gap: native speakers often take everyday cultural commonsense and social norms for granted, while non-natives are more likely to actively seek out and document them. CARE is publicly available at this https URL (we plan to add Japanese data in the near future).
zh

[NLP-14] DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation

【速读】：该论文旨在解决语音翻译（Speech Translation, ST）中由于自动语音识别（Automatic Speech Recognition, ASR）噪声引入的 discourse 挑战，以及当前文档级上下文在 ST 中整合不足的问题。论文的关键在于提出了一种名为 DoCIA 的在线框架，通过将文档级上下文融入 ASR 精化、机器翻译（Machine Translation, MT）及 MT 精化阶段的辅助大语言模型（Large Language Model, LLM）模块，以提升 ST 性能。此外，DoCIA 在多层级利用文档级信息的同时控制计算开销，并引入一种有效机制防止过度精化导致的幻觉现象，从而确保结果可靠性。实验表明，DoCIA 在多个 LLM 上显著优于传统 ST 基线，在句子和 discourse 层面均展现出优越性能。

链接: https://arxiv.org/abs/2504.05122
作者: Xinglin Lyu,Wei Tang,Yuang Li,Xiaofeng Zhao,Ming Zhu,Junhui Li,Yunfei Lu,Min Zhang,Daimeng Wei,Hao Yang,Min Zhang
机构: Huawei Translation Services Center (华为翻译服务部), Beijing, China; Huawei Consumer Business Group (华为消费者业务部), Beijing, China; School of Computer Science and Technology, Soochow University (苏州大学计算机科学与技术学院), Suzhou, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Document-level context is crucial for handling discourse challenges in text-to-text document-level machine translation (MT). Despite the increased discourse challenges introduced by noise from automatic speech recognition (ASR), the integration of document-level context in speech translation (ST) remains insufficiently explored. In this paper, we develop DoCIA, an online framework that enhances ST performance by incorporating document-level context. DoCIA decomposes the ST pipeline into four stages. Document-level context is integrated into the ASR refinement, MT, and MT refinement stages through auxiliary LLM (large language model)-based modules. Furthermore, DoCIA leverages document-level information in a multi-level manner while minimizing computational overhead. Additionally, a simple yet effective determination mechanism is introduced to prevent hallucinations from excessive refinement, ensuring the reliability of the final results. Experimental results show that DoCIA significantly outperforms traditional ST baselines in both sentence and discourse metrics across four LLMs, demonstrating its effectiveness in improving ST performance.
zh

[NLP-15] AI for Climate Finance: Agent ic Retrieval and Multi-Step Reasoning for Early Warning System Investments

【速读】：该论文旨在解决跨多边开发银行（Multilateral Development Banks, MDBs）和基金中早期预警系统（Early Warning Systems, EWS）气候适应性金融投资标准化报告缺乏的问题。为应对这一挑战，论文提出了一种基于大型语言模型（LLM）的具身人工智能（agentic AI）系统，其关键在于整合上下文检索、微调以及多步推理能力，以提取相关财务数据、分类投资，并确保符合资金指导原则。该系统在实际应用中展示了显著性能提升，特别是在追踪CREWS基金中的EWS投资方面，采用具身检索增强生成（Retrieval-Augmented Generation, RAG）方法实现了87%的准确率、89%的精确率和83%的召回率。

链接: https://arxiv.org/abs/2504.05104
作者: Saeid Ario Vaghefi,Aymane Hachcham,Veronica Grasso,Jiska Manicus,Nakiete Msemo,Chiara Colesanti Senni,Markus Leippold
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tracking financial investments in climate adaptation is a complex and expertise-intensive task, particularly for Early Warning Systems (EWS), which lack standardized financial reporting across multilateral development banks (MDBs) and funds. To address this challenge, we introduce an LLM-based agentic AI system that integrates contextual retrieval, fine-tuning, and multi-step reasoning to extract relevant financial data, classify investments, and ensure compliance with funding guidelines. Our study focuses on a real-world application: tracking EWS investments in the Climate Risk and Early Warning Systems (CREWS) Fund. We analyze 25 MDB project documents and evaluate multiple AI-driven classification methods, including zero-shot and few-shot learning, fine-tuned transformer-based classifiers, chain-of-thought (CoT) prompting, and an agent-based retrieval-augmented generation (RAG) approach. Our results show that the agent-based RAG approach significantly outperforms other methods, achieving 87% accuracy, 89% precision, and 83% recall. Additionally, we contribute a benchmark dataset and expert-annotated corpus, providing a valuable resource for future research in AI-driven financial tracking and climate finance transparency.
zh

[NLP-16] State Tuning: State-based Test-Time Scaling on RWKV-7

【速读】：本文旨在解决在资源受限条件下提升模型性能的问题，特别是在保持原始RWKV-7架构效率的前提下，通过测试时扩展（test-time scaling）技术增强其表达能力。论文提出了一种名为状态微调（state tuning）的新颖方法，针对RNN-based RWKV-7语言模型设计。解决方案的关键在于三个创新点：首先，开发了一个观察者框架，使较小的模型能够复制并学习RWKV-7模型的状态动态；其次，采用核方法动态增加状态大小，以提高捕捉复杂模式的能力；最后，引入去相关反向传播（Decorrelated Backpropagation, DBP）优化扩增后的状态矩阵，从而改善收敛性和表达性。通过仅调整状态矩阵，证明了较小的模型可以在给定任务上超越更大模型，同时保留了RWKV-7架构的高效性，并利用测试时扩展技术提供更优的结果。

链接: https://arxiv.org/abs/2504.05097
作者: Liu Xiao,Li Zhiyuan,Lin Yueyu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-time scaling has emerged as a prominent research direction in machine learning, enabling models to enhance their expressive capabilities during this http URL, renowned for striking a delicate balance between efficiency and expressiveness, have benefited from test-time scaling techniques that leverage an expanding key-value (KV) cache to significantly improve this http URL this paper, we introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 this http URL exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model’s pre-trained weights. Our approach centers on three key innovations. First, we develop an observer framework that allows a smaller model to replicate and learn the state dynamics of the RWKV-7 model. Second, we employ a kernel method to dynamically upscale the state size, enhancing the model’s capacity to capture intricate patterns. Third, we integrate Decorrelated Backpropagation (DBP) to optimize the upscaled state matrix, thereby improving convergence and expressivity. By tuning only the state matrix, we demonstrate that a smaller model can outperform larger models on the given task. This method preserves the efficiency of the original RWKV-7 architecture while harnessing the power of test-time scaling to deliver superior results. Our findings underscore the potential of state tuning as an effective strategy for advancing model performance in resource-constrained settings. Our code is this https URL.
zh

[NLP-17] he Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

【速读】：该论文试图解决的问题是如何解释生成式提示（Chain-of-Thought, CoT）在基于模式的上下文学习（In-Context Learning, ICL）中相对于直接作答方法的相对性能下降。研究发现，尽管CoT提示被广泛认为能够提升大语言模型（Large Language Models, LLMs）的推理能力，但在多种模型规模和基准任务复杂度下，CoT及其变体的表现始终逊于直接作答方法。为了解决这一矛盾现象，论文通过系统性实验验证了几个假设性解释，并揭示了CoT性能背后的关键因素在于显式推理与隐式推理之间的二元对立。具体而言，虽然显式推理因LLMs难以从示例中推断潜在模式而表现不佳，但隐式推理机制尽管受到CoT理由增加的上下文距离干扰，仍能在一定程度上弥补缺陷并给出正确答案。然而，显式推理中的弱信号噪声会削弱整体过程的有效性，即便隐式机制部分缓解了结果偏差。关键解决方案在于理解这种显式-隐式二元性的本质，从而指导未来研究探索更精细且有效的推理方法以优化LLMs的表现。

链接: https://arxiv.org/abs/2504.05081
作者: Tianshi Zheng,Yixiang Chen,Chengxi Li,Chunyang Li,Qing Zong,Haochen Shi,Baixuan Xu,Yangqiu Song,Ginny Y. Wong,Simon See
机构: 未知
类目: Computation and Language (cs.CL)
备注: 30 pages, 12 tables, 6 figures

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs) through the generation of explicit explanatory rationales. However, our study reveals a surprising contradiction to this prevailing perspective. Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based in-context learning (ICL) datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental explicit-implicit duality driving CoT’s performance in pattern-based ICL: while explicit reasoning falters due to LLMs’ struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This duality explains CoT’s relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.
zh

[NLP-18] On the Performance of an Explainable Language Model on PubMedQA

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在医学领域应用中存在的不可解释性（non-interpretable）、幻觉生成（hallucination）、难以维护以及对训练和推理计算资源需求巨大的问题。论文提出了解决方案的关键在于开发一种基于替代架构的可解释语言模型——Gyan。Gyan是一种组合式语言模型（compositional language model），其知识与模型本身解耦（decoupled from knowledge），具备可信（trustable）、透明（transparent）、无幻觉生成且无需大量训练资源的特点，并能够轻松跨领域迁移。通过这些特性，Gyan在PubmedQA数据集上的表现达到了87.1%的准确率，超越了基于GPT-4的MedPrompt（82%）和Med-PaLM 2（81.8%）。

链接: https://arxiv.org/abs/2504.05074
作者: Venkat Srinivasan,Vishaal Jatav,Anushka Chandrababu,Geetika Sharma
机构: 未知
类目: Computation and Language (cs.CL)
备注: Working Paper

点击查看摘要

Abstract:Large language models (LLMs) have shown significant abilities in retrieving medical knowledge, reasoning over it and answering medical questions comparably to physicians. However, these models are not interpretable, hallucinate, are difficult to maintain and require enormous compute resources for training and inference. In this paper, we report results from Gyan, an explainable language model based on an alternative architecture, on the PubmedQA data set. The Gyan LLM is a compositional language model and the model is decoupled from knowledge. Gyan is trustable, transparent, does not hallucinate and does not require significant training or compute resources. Gyan is easily transferable across domains. Gyan-4.3 achieves SOTA results on PubmedQA with 87.1% accuracy compared to 82% by MedPrompt based on GPT-4 and 81.8% by Med-PaLM 2 (Google and DeepMind). We will be reporting results for other medical data sets - MedQA, MedMCQA, MMLU - Medicine in the future.
zh

[NLP-19] Not All Data Are Unlearned Equally

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中知识遗忘（unlearning）的问题，特别是针对从模型中移除特定数据点所学习到的知识，尤其是在隐私保护场景下移除关于命名实体的知识。论文指出，现有方法大多假设所有待遗忘的知识同等重要，即遗忘“蒙特利尔是加拿大的一座城市”与遗忘本文第一作者的电话号码被视为相同难度。然而，研究发现这一假设在LLMs的知识遗忘任务中并不成立。

论文的关键在于揭示遗忘成功与否依赖于待遗忘知识在预训练数据中的频率，并发现高频知识更难被有效遗忘。此外，研究还发现概率评估与基于生成的评估之间存在不一致现象，且这种不匹配随着模型规模增大而加剧。因此，论文强调需要改进现有的评估方法并提出新的LLMs知识遗忘技术，这些技术应充分考虑模型的训练数据特性。

链接: https://arxiv.org/abs/2504.05058
作者: Aravind Krishnan,Siva Reddy,Marius Mosbach
机构: Mila – Quebec AI Institute (魁北克人工智能研究所), McGill University (麦吉尔大学); Saarland University (萨尔兰大学); Canada CIFAR AI Chair (加拿大 CIFAR 人工智能主席)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine unlearning is concerned with the task of removing knowledge learned from particular data points from a trained model. In the context of large language models (LLMs), unlearning has recently received increased attention, particularly for removing knowledge about named entities from models for privacy purposes. While various approaches have been proposed to address the unlearning problem, most existing approaches treat all data points to be unlearned equally, i.e., unlearning that Montreal is a city in Canada is treated exactly the same as unlearning the phone number of the first author of this paper. In this work, we show that this all data is equal assumption does not hold for LLM unlearning. We study how the success of unlearning depends on the frequency of the knowledge we want to unlearn in the pre-training data of a model and find that frequency strongly affects unlearning, i.e., more frequent knowledge is harder to unlearn. Additionally, we uncover a misalignment between probability and generation-based evaluations of unlearning and show that this problem worsens as models become larger. Overall, our experiments highlight the need for better evaluation practices and novel methods for LLM unlearning that take the training data of models into account.
zh

[NLP-20] Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在指令微调和偏好学习后仍无法深度契合人类价值观的问题。论文指出，预训练过程中嵌入的有害知识会以“暗模式”形式持久存在于LLMs的参数记忆中，逃避对齐机制的保护，并在分布偏移下通过对抗诱导重新浮现。解决方案的关键在于理论分析与实证验证相结合：首先，论文从理论上证明当前对齐方法仅能在知识流形中形成局部“安全区域”，而预训练知识通过高似然对抗轨迹在全球范围内与有害概念保持连接；其次，通过分布偏移下的语义连贯性诱导方法，设计优化的对抗提示，系统性地绕过对齐约束，从而在23个最先进的对齐LLMs中实现19个模型100%的攻击成功率，揭示其普遍存在的脆弱性。

链接: https://arxiv.org/abs/2504.05050
作者: Jiawei Lian,Jianhong Pan,Lefan Wang,Yi Wang,Shaohui Mei,Lap-Pui Chau
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are foundational explorations to artificial general intelligence, yet their alignment with human values via instruction tuning and preference learning achieves only superficial compliance. Here, we demonstrate that harmful knowledge embedded during pretraining persists as indelible “dark patterns” in LLMs’ parametric memory, evading alignment safeguards and resurfacing under adversarial inducement at distributional shifts. In this study, we first theoretically analyze the intrinsic ethical vulnerability of aligned LLMs by proving that current alignment methods yield only local “safety regions” in the knowledge manifold. In contrast, pretrained knowledge remains globally connected to harmful concepts via high-likelihood adversarial trajectories. Building on this theoretical insight, we empirically validate our findings by employing semantic coherence inducement under distributional shifts–a method that systematically bypasses alignment constraints through optimized adversarial prompts. This combined theoretical and empirical approach achieves a 100% attack success rate across 19 out of 23 state-of-the-art aligned LLMs, including DeepSeek-R1 and LLaMA-3, revealing their universal vulnerabilities.
zh

[NLP-21] Batch Aggregation: An Approach to Enhance Text Classification with Correlated Augmented Data

【速读】：该论文旨在解决自然语言处理模型在领域特定任务（如临床试验）中因标注数据有限而导致的分类性能不足问题。传统文本增强方法通过生成人工数据扩充样本量，但忽视了增强文本之间的相关性，可能导致分类错误。为解决此问题，论文提出了一种名为“批量聚合”（Batch Aggregation, BAGG）的新方法，其关键是通过引入额外的聚合层显式建模增强文本之间的依赖关系，从而提升分类准确性。研究结果表明，BAGG在领域特定数据集上的性能提升尤为显著，最高可达10%-29%，并在标注数据受限的情况下展现出更强的鲁棒性和优于传统方法的表现。

链接: https://arxiv.org/abs/2504.05020
作者: Charco Hui,Yalu Wen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language processing models often face challenges due to limited labeled data, especially in domain specific areas, e.g., clinical trials. To overcome this, text augmentation techniques are commonly used to increases sample size by transforming the original input data into artificial ones with the label preserved. However, traditional text classification methods ignores the relationship between augmented texts and treats them as independent samples which may introduce classification error. Therefore, we propose a novel approach called ‘Batch Aggregation’ (BAGG) which explicitly models the dependence of text inputs generated through augmentation by incorporating an additional layer that aggregates results from correlated texts. Through studying multiple benchmark data sets across different domains, we found that BAGG can improve classification accuracy. We also found that the increase of performance with BAGG is more obvious in domain specific data sets, with accuracy improvements of up to 10-29%. Through the analysis of benchmark data, the proposed method addresses limitations of traditional techniques and improves robustness in text classification tasks. Our result demonstrates that BAGG offers more robust results and outperforms traditional approaches when training data is limited.
zh

[NLP-22] Mixture-of-Personas Language Models for Population Simulation

【速读】：该论文试图解决预训练大语言模型（Pretrained Large Language Models, LLMs）在模拟目标人群行为多样性时存在的不足，即由于个体与群体间的固有变异性，LLMs难以充分捕捉目标人群的行为多样性。为了解决这一问题，论文提出了一种名为“Persona Mixture (MoP)”的概率提示方法，其关键是通过一种上下文混合模型实现对LLM响应的对齐与多样化。MoP中的每个组件由一个带有角色（persona）和示例（exemplar）表征子群体行为的语言模型代理组成，角色和示例根据学习到的混合权重随机选择，从而在模拟过程中诱发多样化的LLM响应。该方法无需对模型进行微调，且具有灵活性和跨基础模型的可迁移性。实验表明，MoP在合成数据生成任务中优于竞争方法，在对齐和多样性指标上表现更优。

链接: https://arxiv.org/abs/2504.05019
作者: Ngoc Bui,Hieu Trung Nguyen,Shantanu Kumar,Julian Theodore,Weikang Qiu,Viet Anh Nguyen,Rex Ying
机构: Yale University (耶鲁大学); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advances in Large Language Models (LLMs) paved the way for their emerging applications in various domains, such as human behavior simulations, where LLMs could augment human-generated data in social science research and machine learning model training. However, pretrained LLMs often fail to capture the behavioral diversity of target populations due to the inherent variability across individuals and groups. To address this, we propose \textitMixture of Personas (MoP), a \textitprobabilistic prompting method that aligns the LLM responses with the target population. MoP is a contextual mixture model, where each component is an LM agent characterized by a persona and an exemplar representing subpopulation behaviors. The persona and exemplar are randomly chosen according to the learned mixing weights to elicit diverse LLM responses during simulation. MoP is flexible, requires no model finetuning, and is transferable across base models. Experiments for synthetic data generation show that MoP outperforms competing methods in alignment and diversity metrics.
zh

[NLP-23] Surveying Professional Writers on AI: Limitations Expectations and Fears

【速读】：该论文试图探讨大型语言模型（Large Language Models, LLMs）在专业写作中的应用及其带来的挑战，重点关注非英语使用者的语言支持、伦理问题以及其对作者声音和创造力的长期影响等尚未充分探索的关键方面。研究通过问卷调查（N=301）和交互式调研（N=36），针对经常使用AI工具的专业作家，考察了多语言环境下LLMs辅助写作的实践、伦理关切及用户期望。研究的关键在于揭示LLMs对非英语使用者的重要性、误信息传播的程度、领域与风格适应性、可用性及关键功能，这些发现可为LLMs的进一步发展提供指导，从而惠及作家群体及更广泛的用户群体。

链接: https://arxiv.org/abs/2504.05008
作者: Anastasiia Ivanova,Natalia Fedorova,Sergey Tilga,Ekaterina Artemova
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The rapid development of AI-driven tools, particularly large language models (LLMs), is reshaping professional writing. Still, key aspects of their adoption such as languages support, ethics, and long-term impact on writers voice and creativity remain underexplored. In this work, we conducted a questionnaire (N = 301) and an interactive survey (N = 36) targeting professional writers regularly using AI. We examined LLM-assisted writing practices across 25+ languages, ethical concerns, and user expectations. The findings of the survey demonstrate important insights, reflecting upon the importance of: LLMs adoption for non-English speakers; the degree of misinformation, domain and style adaptation; usability and key features of LLMs. These insights can guide further development, benefiting both writers and a broader user base.
zh

[NLP-24] Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中未被充分理解的价值机制问题，特别是其行为驱动的社会价值观编码机制。当前研究主要通过外部响应评估LLMs中的价值，侧重于AI安全性，但缺乏可解释性且无法有效评估现实世界中的社会价值。为了解决这一问题，论文提出了一种名为ValueExploration的新框架，专注于在神经元层面探索LLMs内部的社会价值观行为驱动机制。方案的关键在于构建了一个大规模双语基准C-voice，用于识别和评估LLMs中的中文社会价值观，并通过激活差异定位负责编码这些价值观的神经元。进一步通过神经元失活分析模型行为的变化，揭示价值观如何影响LLMs的决策机制。实验验证了该框架的有效性，相关基准和代码将公开。

链接: https://arxiv.org/abs/2504.04994
作者: Ling Hu,Yuemei Xu,Xiaoyang Gu,Letao Han
机构: School of Information Science and Technology (信息科学与技术学院), Beijing Foreign Studies University (北京外国语大学), Beijing, China (中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.
zh

[NLP-25] A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）中的“越狱”（jailbreak）漏洞问题，这是一种绕过对齐机制防护导致不安全输出的攻击方式，从而威胁到LLMs的完整性。论文的关键在于提出了一种基于LLMs训练领域的新型“越狱”攻击分类法，通过分析模型在泛化能力、目标设定及鲁棒性方面的对齐失败，将“越狱”行为归因于训练过程中不同语言领域的影响。这一视角揭示了现有方法的局限性，并基于模型潜在缺陷重新定义了“越狱”攻击的分类标准。与传统按提示构造方法（如模板提示）划分攻击的方式不同，本文提供的是一种更深层次理解LLMs行为的方法，其提出的四类分类法——不匹配的泛化、竞争的目标、对抗鲁棒性以及混合攻击——深入剖析了“越狱”漏洞的本质特征。最终，论文总结了从这一分类研究中得出的重要经验。

链接: https://arxiv.org/abs/2504.04976
作者: Carlos Peláez-González,Andrés Herrera-Poyatos,Cristina Zuheros,David Herrera-Poyatos,Virilo Tejedor,Francisco Herrera
机构: Department of Computer Science and Artificial Intelligence, Andalusian Institute of Data Science and Computational Intelligence (DaSCI), University of Granada (格拉纳达大学), Spain.
类目: Computation and Language (cs.CL)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:The study of large language models (LLMs) is a key area in open-world machine learning. Although LLMs demonstrate remarkable natural language processing capabilities, they also face several challenges, including consistency issues, hallucinations, and jailbreak vulnerabilities. Jailbreaking refers to the crafting of prompts that bypass alignment safeguards, leading to unsafe outputs that compromise the integrity of LLMs. This work specifically focuses on the challenge of jailbreak vulnerabilities and introduces a novel taxonomy of jailbreak attacks grounded in the training domains of LLMs. It characterizes alignment failures through generalization, objectives, and robustness gaps. Our primary contribution is a perspective on jailbreak, framed through the different linguistic domains that emerge during LLM training and alignment. This viewpoint highlights the limitations of existing approaches and enables us to classify jailbreak attacks on the basis of the underlying model deficiencies they exploit. Unlike conventional classifications that categorize attacks based on prompt construction methods (e.g., prompt templating), our approach provides a deeper understanding of LLM behavior. We introduce a taxonomy with four categories – mismatched generalization, competing objectives, adversarial robustness, and mixed attacks – offering insights into the fundamental nature of jailbreak vulnerabilities. Finally, we present key lessons derived from this taxonomic study.
zh

[NLP-26] owards Visual Text Grounding of Multimodal Large Language Model

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在视觉文本定位（visual text grounding）方面存在的显著挑战，尤其是在处理文档类文本丰富图像（如扫描表格和信息图表）时。当前基准测试主要集中于自然图像上的视觉定位任务，而忽视了文本密集型文档图像中的复杂布局和文本内容所带来的独特挑战。为填补这一空白，论文引入了TRIG任务，这是一个带有新设计指令数据集的任务，用于评估和提升MLLMs在文档问答中的文本丰富图像定位能力。

解决方案的关键在于提出了一种OCR-LLM-人类交互管道，用于创建包含800个人工标注的问题-答案对的基准数据集以及基于四个多样化数据集的90个大规模合成训练样本。此外，还提出了两种简单有效的TRIG方法：一种基于通用指令微调，另一种采用即插即用高效嵌入技术。通过在合成数据集上对MLLMs进行微调，这些方法显著提升了模型的空间推理和定位能力。

链接: https://arxiv.org/abs/2504.04974
作者: Ming Li,Ruiyi Zhang,Jian Chen,Jiuxiang Gu,Yufan Zhou,Franck Dernoncourt,Wanrong Zhu,Tianyi Zhou,Tong Sun
机构: Adobe Research (Adobe 研究院); University of Maryland (马里兰大学); University at Buffalo (布法罗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90 synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.
zh

[NLP-27] Few Dimensions are Enough: Fine-tuning BERT with Selected Dimensions Revealed Its Redundant Nature

【速读】：该论文试图解决在微调 BERT 模型时如何选择最终层输出以及各维度所含信息不明确的问题。论文通过在 GLUE 任务上对 token 向量、层和维度的有效性和冗余性进行全面研究，发现最终层中除 CLS 向量外的其他输出包含等效信息，大多数任务仅需 2-3 个维度，且较低层的贡献减少而高层间差异较小。关键解决方案在于评估不同层和维度的重要性，并验证冻结预训练层及跨任务微调的效果，揭示隐藏层在微调中的显著变化、BERT 的冗余性及其潜在的维度冗余问题。

链接: https://arxiv.org/abs/2504.04966
作者: Shion Fukuhata,Yoshinobu Kano
机构: Shizuoka University (静冈大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:When fine-tuning BERT models for specific tasks, it is common to select part of the final layer’s output and input it into a newly created fully connected layer. However, it remains unclear which part of the final layer should be selected and what information each dimension of the layers holds. In this study, we comprehensively investigated the effectiveness and redundancy of token vectors, layers, and dimensions through BERT fine-tuning on GLUE tasks. The results showed that outputs other than the CLS vector in the final layer contain equivalent information, most tasks require only 2-3 dimensions, and while the contribution of lower layers decreases, there is little difference among higher layers. We also evaluated the impact of freezing pre-trained layers and conducted cross-fine-tuning, where fine-tuning is applied sequentially to different tasks. The findings suggest that hidden layers may change significantly during fine-tuning, BERT has considerable redundancy, enabling it to handle multiple tasks simultaneously, and its number of dimensions may be excessive.
zh

[NLP-28] Constraint Multi-class Positive and Unlabeled Learning for Distantly Supervised Named Entity Recognition

【速读】：该论文试图解决远程监督命名实体识别（Distantly Supervised Named Entity Recognition, DS-NER）中因外部知识库自动标注数据的不完整性导致的高误报率问题。论文的关键解决方案是提出了一种名为“约束多类正样本与无标签学习”（Constraint Multi-class Positive and Unlabeled Learning, CMPU）的新方法，该方法在多个正样本类别的风险估计器中引入了一个约束因子，表明这种非负约束的风险估计器相较于以往基于传统正-无标签（PU）学习的方法，在有限正样本数据情况下具有更强的抗过拟合能力。此外，论文通过理论分析验证了CMPU方法的有效性，并通过在两个使用不同外部知识源标注的数据集上的实验进一步证明了其在DS-NER任务中的优越性能。

链接: https://arxiv.org/abs/2504.04963
作者: Yuzhe Zhang,Min Cen,Hong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 28pages, 3 figures. First submitted in Oct. 2023

点击查看摘要

Abstract:Distantly supervised named entity recognition (DS-NER) has been proposed to exploit the automatically labeled training data by external knowledge bases instead of human annotations. However, it tends to suffer from a high false negative rate due to the inherent incompleteness. To address this issue, we present a novel approach called \textbfConstraint \textbfMulti-class \textbfPositive and \textbfUnlabeled Learning (CMPU), which introduces a constraint factor on the risk estimator of multiple positive classes. It suggests that the constraint non-negative risk estimator is more robust against overfitting than previous PU learning methods with limited positive data. Solid theoretical analysis on CMPU is provided to prove the validity of our approach. Extensive experiments on two benchmark datasets that were labeled using diverse external knowledge sources serve to demonstrate the superior performance of CMPU in comparison to existing DS-NER methods.
zh

[NLP-29] M-Prometheus: A Suite of Open Multilingual LLM Judges

【速读】：该论文试图解决语言模型作为评委（LLM-as-a-judge）在多语言自动评估方法中存在的质量不均衡问题，特别是非英语语言的评估能力不足，这限制了具备更好多语言能力模型的发展。论文的关键解决方案在于引入M-Prometheus，这是一种开放权重的LLM评委集合，参数规模从3B到14B，能够提供多语言输出的直接评估和成对比较反馈。其有效性通过在覆盖超过20种语言的多语言奖励基准和涵盖4种语言对的文学机器翻译评估中的表现得以验证。此外，通过解码时的应用，M-Prometheus显著提升了三种测试语言生成输出的质量。最后，通过对多种因素的系统分析，论文确定了构建有效多语言评委的关键要素，包括基础模型的选择以及在原生多语言反馈数据而非翻译数据上的训练策略，并公开了模型、训练数据集和代码。

链接: https://arxiv.org/abs/2504.04953
作者: José Pombal,Dongkeun Yoon,Patrick Fernandes,Ian Wu,Seungone Kim,Ricardo Rei,Graham Neubig,André F. T. Martins
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The use of language models for automatically evaluating long-form text (LLM-as-a-judge) is becoming increasingly common, yet most LLM judges are optimized exclusively for English, with strategies for enhancing their multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for non-English languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs. Furthermore, M-Prometheus models can be leveraged at decoding time to significantly improve generated outputs across all 3 tested languages, showcasing their utility for the development of better multilingual models. Lastly, through extensive ablations, we identify the key factors for obtaining an effective multilingual judge, including backbone model selection and training on natively multilingual feedback data instead of translated data. We release our models, training dataset, and code.
zh

[NLP-30] A Llama walks into the Bar: Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在法律推理任务中因领域特定知识和推理过程复杂性带来的独特挑战。具体而言，研究关注如何通过有限的数据集（仅1,514道多州律师考试Multi-state Bar Examination, MBE问题）微调较小规模的语言模型（如Llama 2 7B和Llama 3 8B），以提升其法律问题回答的准确性。论文的关键解决方案在于采用一种基于结构化IRAC（Issue, Rule, Application, Conclusion）框架的知识蒸馏方法，将原始问题及其解答转化为结构化的推理形式，并进一步通过不同样本量的有监督微调（Supervised Fine-Tuning, SFT）策略优化模型性能。此外，研究还分析了选项选择偏差及其缓解措施，并探讨了提示类型、答案排序、响应格式及解码温度等变量对模型表现的影响。最终结果表明，即使在资源受限的情况下，领域特定的SFT仍能使某些模型配置接近人类基线水平。

链接: https://arxiv.org/abs/2504.04945
作者: Rean Fernandes,André Biedenkapp,Frank Hutter,Noor Awad
机构: Albert Ludwigs University Freiburg (阿尔伯特-路德维希斯弗莱堡大学); ELLIS Institute Tübingen (ELLIS图宾根研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: COLM 2025 preprint, 9 pages, 3 figures, 16 appendix pages

点击查看摘要

Abstract:Legal reasoning tasks present unique challenges for large language models (LLMs) due to the complexity of domain-specific knowledge and reasoning processes. This paper investigates how effectively smaller language models (Llama 2 7B and Llama 3 8B) can be fine-tuned with a limited dataset of 1,514 Multi-state Bar Examination (MBE) questions to improve legal question answering accuracy. We evaluate these models on the 2022 MBE questions licensed from JD Advising, the same dataset used in the ‘GPT-4 passes the Bar exam’ study. Our methodology involves collecting approximately 200 questions per legal domain across 7 domains. We distill the dataset using Llama 3 (70B) to transform explanations into a structured IRAC (Issue, Rule, Application, Conclusion) format as a guided reasoning process to see if it results in better performance over the non-distilled dataset. We compare the non-fine-tuned models against their supervised fine-tuned (SFT) counterparts, trained for different sample sizes per domain, to study the effect on accuracy and prompt adherence. We also analyse option selection biases and their mitigation following SFT. In addition, we consolidate the performance across multiple variables: prompt type (few-shot vs zero-shot), answer ordering (chosen-option first vs generated-explanation first), response format (Numbered list vs Markdown vs JSON), and different decoding temperatures. Our findings show that domain-specific SFT helps some model configurations achieve close to human baseline performance, despite limited computational resources and a relatively small dataset. We release both the gathered SFT dataset and the family of Supervised Fine-tuned (SFT) adapters optimised for MBE performance. This establishes a practical lower bound on resources needed towards achieving effective legal question answering in smaller LLMs.
zh

[NLP-31] How Is Generative AI Used for Persona Development?: A Systematic Review of 52 Research Articles

【速读】：该论文旨在解决生成式 AI 在人物画像（Personas）开发中的挑战与不足。关键在于通过系统性分析现有研究，揭示当前在数据收集、分割、丰富化及评估等各阶段使用生成式 AI 的现状与局限性，特别是强调商业化闭源模型导致的文化单一性、评估方法的不足以及人机协作模式的欠缺。论文指出，为充分发挥生成式 AI 人物画像的潜力，需加强学术界与产业界的协同努力，并提出具体的研究方向以指导未来工作。

链接: https://arxiv.org/abs/2504.04927
作者: Danial Amin,Joni Salminen,Farhan Ahmed,Sonja M.H. Tervola,Sankalp Sethi,Bernard J. Jansen
机构: University of Vaasa(Vaasa, Finland); Aalto University(Espoo, Finland); University of Arizona(Tucson, USA); Qatar Computing Research Institute, Hamad Bin Khalifa University(Doha, Qatar)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although Generative AI (GenAI) has the potential for persona development, many challenges must be addressed. This research systematically reviews 52 articles from 2022-2024, with important findings. First, closed commercial models are frequently used in persona development, creating a monoculture Second, GenAI is used in various stages of persona development (data collection, segmentation, enrichment, and evaluation). Third, similar to other quantitative persona development techniques, there are major gaps in persona evaluation for AI generated personas. Fourth, human-AI collaboration models are underdeveloped, despite human oversight being crucial for maintaining ethical standards. These findings imply that realizing the full potential of AI-generated personas will require substantial efforts across academia and industry. To that end, we provide a list of research avenues to inspire future work.
zh

[NLP-32] Collab-RAG : Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration

【速读】：本文旨在解决 Retrieval-Augmented Generation (RAG) 系统在处理多跳问答任务时因无关上下文检索和复杂推理能力有限而导致的准确性不足问题。论文提出了一种名为 Collab-RAG 的协作训练框架，其关键是通过白盒小语言模型（SLM）与黑盒大语言模型（LLM）之间的相互增强来提升 RAG 的性能。具体而言，SLM 将复杂查询分解为更简单的子问题，从而提高检索的准确性并促进黑盒 LLM 进行更有效的推理；同时，黑盒 LLM 向 SLM 提供反馈信号以改进其分解能力。Collab-RAG 仅依赖于可负担得起的黑盒 LLM 的监督，而无需额外的前沿 LLM 知识蒸馏，却展现出对多种黑盒 LLM 强大的泛化能力。实验结果表明，Collab-RAG 在五个多跳问答数据集上的平均性能比现有的黑盒-only 和 SLM 微调基线高出 1.8%-14.2%，特别是微调后的 3B SLM 在问题分解方面优于冻结的 32B LLM，凸显了 Collab-RAG 在提升复杂问题推理和检索效率方面的有效性。

链接: https://arxiv.org/abs/2504.04915
作者: Ran Xu,Wenqi Shi,Yuchen Zhuang,Yue Yu,Joyce C. Ho,Haoyu Wang,Carl Yang
机构: Department of Computer Science, Emory University (埃默里大学); University of Texas Southwestern Medical Center (德克萨斯大学西南医学中心); College of Computing, Georgia Institute of Technology (乔治亚理工学院); Departiment of Computer Science, SUNY Albany (纽约州立大学奥尔巴尼分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Work in progress. Code: this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM’s decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on this https URL.
zh

[NLP-33] Leverag ing Large Language Models for Cost-Effective Multilingual Depression Detection and Severity Assessment

【速读】：该论文旨在解决抑郁症早期检测困难的问题，主要由于其主观症状评估的局限性。论文通过利用大型语言模型（LLMs）的最新进展，探索更高效且成本效益更高的客观检测方法。研究的关键在于评估四种LLMs在临床访谈数据中检测抑郁症的表现，并进一步测试其在严重程度评估和知识增强场景中的性能。最终发现，DeepSeek V3是检测抑郁症最可靠且成本效益最高的模型，在零样本（zero-shot）和少样本（few-shot）场景中均表现优异，尤其零样本方式最为高效。尽管如此，模型在严重程度评估中与人类评估者的一致性较低，尤其是在轻度抑郁症的评估上。此外，模型在复杂诊断场景中对抑郁症的检测仍保持较高的曲线下面积（AUC）。这些结果表明DeepSeek V3在基于文本的抑郁症检测中具有显著的现实临床应用潜力，但同时也强调了进一步改进严重程度评估以及减轻潜在偏见以提升临床可靠性的必要性。

链接: https://arxiv.org/abs/2504.04891
作者: Longdi Xian,Jianzhang Ni,Mingzhu Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Depression is a prevalent mental health disorder that is difficult to detect early due to subjective symptom assessments. Recent advancements in large language models have offered efficient and cost-effective approaches for this objective. In this study, we evaluated the performance of four LLMs in depression detection using clinical interview data. We selected the best performing model and further tested it in the severity evaluation scenario and knowledge enhanced scenario. The robustness was evaluated in complex diagnostic scenarios using a dataset comprising 51074 statements from six different mental disorders. We found that DeepSeek V3 is the most reliable and cost-effective model for depression detection, performing well in both zero-shot and few-shot scenarios, with zero-shot being the most efficient choice. The evaluation of severity showed low agreement with the human evaluator, particularly for mild depression. The model maintains stably high AUCs for detecting depression in complex diagnostic scenarios. These findings highlight DeepSeek V3s strong potential for text-based depression detection in real-world clinical applications. However, they also underscore the need for further refinement in severity assessment and the mitigation of potential biases to enhance clinical reliability.
zh

[NLP-34] SAFT: Structure-aware Transformers for Textual Interaction Classification

【速读】：该论文旨在解决文本交互分类（TIC）任务中的两个关键挑战：一是现有方法因采用上下文无关的文本嵌入未能充分捕捉丰富的文本语义；二是忽视了文本交互网络（TINs）中二分图结构与节点异质性，导致分类性能受限。为应对这些问题，论文提出了一种名为SAFT的新架构。其关键在于通过语言模型和图神经网络的结合，有效融合文本和结构语义。具体而言，利用线图注意力机制（LGA）、门控注意单元（GAUs）以及预训练语言模型（PLMs）来建模交互级别和词级别信号，并通过代理词在迭代过程中实现上下文化的关联。此外，还开发了一种高效且理论基础扎实的方法，将局部和全局拓扑信息编码为结构嵌入，从而不仅将TINs的底层结构特征注入到文本交互表示中，还促进了图采样策略的设计。实验证明，SAFT在多个真实数据集上的分类准确性优于当前最先进的基线方法。

链接: https://arxiv.org/abs/2504.04861
作者: Hongtao Wang,Renchi Yang,Hewen Wang,Haoran Zheng,Jianliang Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Textual interaction networks (TINs) are an omnipresent data structure used to model the interplay between users and items on e-commerce websites, social networks, etc., where each interaction is associated with a text description. Classifying such textual interactions (TIC) finds extensive use in detecting spam reviews in e-commerce, fraudulent transactions in finance, and so on. Existing TIC solutions either (i) fail to capture the rich text semantics due to the use of context-free text embeddings, and/or (ii) disregard the bipartite structure and node heterogeneity of TINs, leading to compromised TIC performance. In this work, we propose SAFT, a new architecture that integrates language- and graph-based modules for the effective fusion of textual and structural semantics in the representation learning of interactions. In particular, line graph attention (LGA)/gated attention units (GAUs) and pretrained language models (PLMs) are capitalized on to model the interaction-level and token-level signals, which are further coupled via the proxy token in an iterative and contextualized fashion. Additionally, an efficient and theoretically-grounded approach is developed to encode the local and global topology information pertaining to interactions into structural embeddings. The resulting embeddings not only inject the structural features underlying TINs into the textual interaction encoding but also facilitate the design of graph sampling strategies. Extensive empirical evaluations on multiple real TIN datasets demonstrate the superiority of SAFT over the state-of-the-art baselines in TIC accuracy.
zh

[NLP-35] Discovering dynamical laws for speech gestures

【速读】：该论文旨在解决认知科学中的一个基础挑战：揭示支配行为的动力学原理。具体而言，研究聚焦于语音产生的动力学机制，探索构成语言的小型认知单元如何通过高度可变且复杂的物理运动来实现。论文的关键在于利用稀疏符号回归算法从舌头和嘴唇的运动学数据中发现描述构音动作的符号方程模型，并通过解析技术和数值模拟验证这些模型。研究发现，二阶线性模型在大多数情况下能够提供高精度的拟合，但在约三分之一的情况下需要引入非线性力以更准确地表征构音动力学。这一结果支持了自主、非线性、二阶微分方程作为语音构音手势可行的动力学定律的假设。论文最后讨论了数据驱动模型发现的未来机遇与障碍，并展望了揭示语言、大脑及行为动力学原理的可能性。

链接: https://arxiv.org/abs/2504.04849
作者: Sam Kirkham
机构: 未知
类目: Computation and Language (cs.CL); Adaptation and Self-Organizing Systems (nlin.AO)
备注: Accepted for publication in ‘Cognitive Science’

点击查看摘要

Abstract:A fundamental challenge in the cognitive sciences is discovering the dynamics that govern behaviour. Take the example of spoken language, which is characterised by a highly variable and complex set of physical movements that map onto the small set of cognitive units that comprise language. What are the fundamental dynamical principles behind the movements that structure speech production? In this study, we discover models in the form of symbolic equations that govern articulatory gestures during speech. A sparse symbolic regression algorithm is used to discover models from kinematic data on the tongue and lips. We explore these candidate models using analytical techniques and numerical simulations, and find that a second-order linear model achieves high levels of accuracy, but a nonlinear force is required to properly model articulatory dynamics in approximately one third of cases. This supports the proposal that an autonomous, nonlinear, second-order differential equation is a viable dynamical law for articulatory gestures in speech. We conclude by identifying future opportunities and obstacles in data-driven model discovery and outline prospects for discovering the dynamical principles that govern language, brain and behaviour.
zh

[NLP-36] Quantization Hurts Reasoning ? An Empirical Study on Quantized Reasoning Models

【速读】：该论文试图解决量化推理模型在保持性能的同时降低推理成本的问题。解决方案的关键在于系统性地评估不同量化方法（权重、KV缓存、激活量化）在多种位宽下的效果，并通过实证研究揭示模型规模、模型来源以及任务难度对量化推理模型性能的影响。研究发现，无损量化可通过W8A8或W4A16实现，但更低位宽会带来显著的精度风险。此外，通过对模型规模或推理步骤进行策略性调整，可以有效提升性能。

链接: https://arxiv.org/abs/2504.04823
作者: Ruikang Liu,Yuxuan Sun,Manyi Zhang,Haoli Bai,Xianzhi Yu,Tiezheng Yu,Chun Yuan,Lu Hou
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in this https URL.
zh

[NLP-37] I only read it for the plot! Maturity Ratings Affect Fanfiction Style and Community Engagement

【速读】：该论文试图研究不同成熟度评级（Maturity Ratings）的同人小说文本特征及其在不同粉丝群体中的变化，并探讨这些特征与读者参与度（Reader Engagement Metrics）之间的关系。论文的关键在于分析特定类型的同人小说（尤其是包含成人主题和明确场景的“explicit”作品）与其它成熟度评级作品相比所呈现的独特文本特征，以此揭示作者和读者动机的差异，并强调粉丝社区规范及行为对这些文化产品的影响。

链接: https://arxiv.org/abs/2504.04782
作者: Mia Jacobsen,Ross Deans Kristensen-McLachlan
机构: Center for Humanities Computing, Aarhus University (奥胡斯大学); Department of Linguistics, Cognitive Science, and Semiotics, Aarhus University (奥胡斯大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the 5th International Conference on Natural Language Processing for Digital Humanities (NLP4DH 2025)

点击查看摘要

Abstract:We consider the textual profiles of different fanfiction maturity ratings, how they vary across fan groups, and how this relates to reader engagement metrics. Previous studies have shown that fanfiction writing is motivated by a combination of admiration for and frustration with the fan object. These findings emerge when looking at fanfiction as a whole, as well as when it is divided into subgroups, also called fandoms. However, maturity ratings are used to indicate the intended audience of the fanfiction, as well as whether the story includes mature themes and explicit scenes. Since these ratings can be used to filter readers and writers, they can also be seen as a proxy for different reader/writer motivations and desires. We find that explicit fanfiction in particular has a distinct textual profile when compared to other maturity ratings. These findings thus nuance our understanding of reader/writer motivations in fanfiction communities, and also highlights the influence of the community norms and fan behavior more generally on these cultural products.
zh

[NLP-38] Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations

【速读】：该论文旨在解决 Retrieval-augmented generation (RAG) 在增强大型语言模型 (LLMs) 访问丰富事实知识时面临的挑战，特别是在多语言检索场景下因异构知识可能引发冲突所带来的问题。论文的关键解决方案是引入 Dialectic-RAG (DRAG)，这是一种由论证性解释 (Argumentative Explanations) 引导的模块化方法。DRAG 通过结构化的推理过程，系统性地评估检索到的信息，比较、对比并解决冲突的观点，从而使得 RAG 更具分析性、批判性和基础性。其核心在于从查询和相关文档集中选择并展示相关知识，以提供辩证解释，通过批判性权衡对立论点并过滤无关内容，明确得出最终答案。实验表明，DRAG 不仅提升了 RAG 方法的表现，还降低了计算开销，并增强了对知识扰动的鲁棒性。

链接: https://arxiv.org/abs/2504.04771
作者: Leonardo Ranaldi,Federico Ranaldi,Fabio Massimo Zanzotto,Barry Haddow,Alexandra Birch
机构: School of Informatics, University of Edinburgh (爱丁堡大学), UK; Department of Enterprise Engineering, University of Rome Tor Vergata (罗马大学托尔韦加塔校区), Italy
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is key to enhancing large language models (LLMs) to systematically access richer factual knowledge. Yet, using RAG brings intrinsic challenges, as LLMs must deal with potentially conflicting knowledge, especially in multilingual retrieval, where the heterogeneity of knowledge retrieved may deliver different outlooks. To make RAG more analytical, critical and grounded, we introduce Dialectic-RAG (DRAG), a modular approach guided by Argumentative Explanations, i.e., structured reasoning process that systematically evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives. Given a query and a set of multilingual related documents, DRAG selects and exemplifies relevant knowledge for delivering dialectic explanations that, by critically weighing opposing arguments and filtering extraneous content, clearly determine the final response. Through a series of in-depth experiments, we show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models. The final results demonstrate that DRAG significantly improves RAG approaches, requiring low-impact computational effort and providing robustness to knowledge perturbations. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.04771 [cs.CL] (or arXiv:2504.04771v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.04771 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-39] Can LLM s Interpret and Leverag e Structured Linguistic Representations? A Case Study with AMRs ACL2025

【速读】：该论文试图解决的问题是评估大型语言模型（Large Language Models, LLMs）利用结构化语言表示形式中的上下文信息的能力。具体而言，研究关注通过Abstract Meaning Representation (AMR) 结构编码短上下文和长上下文对多种语言任务的影响。论文的关键解决方案在于探索在提示中加入原始语言上下文的AMR是否能够提升LLMs的性能，并发现这种增强对于长上下文任务（如SAMSum数据集中的对话摘要）尤为有效，可显著提高零样本任务的余弦相似度分数。然而，对于短上下文任务，此方法通常会降低模型性能。此外，实验表明这种改进在新推出的更大规模的LLMs中更为明显，而在较旧或较小的模型中则不显著。

链接: https://arxiv.org/abs/2504.04745
作者: Ankush Raut,Xiaofeng Zhu,Maria Leonor Pacheco
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); Northwestern University (西北大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 23 figures. Submitted to XLLM @ ACL 2025

点击查看摘要

Abstract:This paper evaluates the ability of Large Language Models (LLMs) to leverage contextual information in the form of structured linguistic representations. Specifically, we examine the impact of encoding both short and long contexts using Abstract Meaning Representation (AMR) structures across a diverse set of language tasks. We perform our analysis using 8-bit quantized and instruction-tuned versions of Llama 3.1 (8B), Phi-3, and Mistral 7B. Our results indicate that, for tasks involving short contexts, augmenting the prompt with the AMR of the original language context often degrades the performance of the underlying LLM. However, for tasks that involve long contexts, such as dialogue summarization in the SAMSum dataset, this enhancement improves LLM performance, for example, by increasing the zero-shot cosine similarity score of Llama 3.1 from 66.2% to 76%. This improvement is more evident in the newer and larger LLMs, but does not extend to the older or smaller ones. In addition, we observe that LLMs can effectively reconstruct the original text from a linearized AMR, achieving a cosine similarity of 81.3% in the best-case scenario.
zh

[NLP-40] athyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context

【速读】：该论文旨在解决基于事实的判断预测与解释（Fact-based Judgment Prediction and Explanation, FJPE）领域中，如何利用高质量的事实数据开发稳健且透明的AI驱动决策工具的问题。论文的关键在于提出了TathyaNyaya这一面向印度法律语境的最大标注数据集，并结合FactLegalLlama这一经过指令微调的大型语言模型（Large Language Model, LLM）。通过在TathyaNyaya数据集上的微调，FactLegalLlama实现了预测准确性与上下文相关、连贯解释的有机结合，从而满足AI辅助法律系统对透明性和可解释性的需求。其方法的核心在于将Transformer用于二元判决预测，同时利用FactLegalLlama生成解释，构建了一个适用于印度法律领域的强大框架。这一方案的关键创新在于结合大规模、多样化事实数据与特定领域的模型优化，以提升预测性能和可解释性。

链接: https://arxiv.org/abs/2504.04737
作者: Shubham Kumar Nigam,Balaramamahanthi Deepak Patnaik,Shivam Mishra,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya
机构: IIT Kanpur (印度理工学院坎普尔); IISER Kolkata (印度科学教育与研究学院加尔各答); Symbiosis Law School Pune (浦那共生法学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India and various High Courts. Derived from the Hindi terms “Tathya” (fact) and “Nyaya” (justice), the TathyaNyaya dataset is uniquely designed to focus on factual statements rather than complete legal texts, reflecting real-world judicial processes where factual data drives outcomes. Complementing this dataset, we present FactLegalLlama, an instruction-tuned variant of the LLaMa-3-8B Large Language Model (LLM), optimized for generating high-quality explanations in FJPE tasks. Finetuned on the factual data in TathyaNyaya, FactLegalLlama integrates predictive accuracy with coherent, contextually relevant explanations, addressing the critical need for transparency and interpretability in AI-assisted legal systems. Our methodology combines transformers for binary judgment prediction with FactLegalLlama for explanation generation, creating a robust framework for advancing FJPE in the Indian legal domain. TathyaNyaya not only surpasses existing datasets in scale and diversity but also establishes a benchmark for building explainable AI systems in legal analysis. The findings underscore the importance of factual precision and domain-specific tuning in enhancing predictive performance and interpretability, positioning TathyaNyaya and FactLegalLlama as foundational resources for AI-assisted legal decision-making.
zh

[NLP-41] Synthetic Data Generation Multi-Step RL for Reasoning Tool Use

【速读】：该论文试图解决传统强化学习方法在处理多步优化场景时的局限性问题，这些方法通常将任务视为单步问题，而忽视了复杂推理和具身任务中需要多步文本生成、推理及环境交互的需求。论文提出了一种名为Step-Wise Reinforcement Learning (SWiRL) 的方法，其关键是通过合成数据生成和逐步分解机制，将每个多步轨迹拆解为多个子轨迹，分别对应原始模型的每个动作，并对这些子轨迹进行合成数据过滤和强化学习优化。这种方法能够有效提升多步工具使用、问答以及数学推理等任务的表现，在GSM8K、HotPotQA、CofCA、MuSiQue和BeerQA等任务上相对准确率分别提升了21.5%、12.3%、14.8%、11.1%和15.3%，同时展现出跨任务的泛化能力。

链接: https://arxiv.org/abs/2504.04736
作者: Anna Goldie,Azalia Mirhoseini,Hao Zhou,Irene Cai,Christopher D. Manning
机构: Department of Computer Science, Stanford University (斯坦福大学); Google DeepMind (谷歌深思维)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%.
zh

[NLP-42] 1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

【速读】：该论文旨在解决小语言模型（small language models, sLMs）在测试时扩展计算资源情况下，能否可靠地进行自我验证的问题。尽管先前研究表明通过引入更大模型作为验证器可以有效提升性能，但自我验证的能力尚未被充分探索。论文的关键在于提出了一种名为Tool-integrated self-verification (T1) 的方法，它通过将依赖大量记忆的任务（如数值计算和事实核查）外包给外部工具（例如代码解释器）来减轻sLMs的记忆负担。理论分析表明，这种工具集成不仅减少了模型的记忆需求，还提升了测试时扩展的性能表现。实验结果进一步证明，在MATH基准测试中，采用T1后的Llama-3.2 1B模型在测试时扩展条件下优于显著更大的Llama-3.1 8B模型，并且该方法在数学及多领域知识密集型任务上均展现出良好的泛化能力。因此，论文强调了工具集成对于大幅提升小语言模型自我验证能力的巨大潜力。

链接: https://arxiv.org/abs/2504.04718
作者: Minki Kang,Jongwon Jeong,Jaewoong Cho
机构: KRAFTON( Krafton ); KAIST( 韩国科学技术院 )
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.
zh

[NLP-43] Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多轮交互（multi-turn interactions）中的评估与增强问题。随着LLMs在单轮任务处理能力上的显著提升，实际应用迫切需要其具备更复杂的多轮交互能力。论文的关键在于系统性地审视多轮对话中维持上下文一致性、连贯性、公平性和响应性的挑战，并通过组织现有基准数据集和提出多种增强方法来应对这些挑战。解决方案的关键包括模型中心策略（如上下文学习、有监督微调、强化学习及新架构设计）、外部集成方法（如记忆增强、基于检索的方法和知识图谱）以及基于代理的协作交互技术。此外，论文还探讨了开放性问题并提出了未来研究方向，以进一步提升LLMs在多轮交互中的鲁棒性和有效性。

链接: https://arxiv.org/abs/2504.04717
作者: Yubo Li,Xiaobin Shen,Xinyu Yao,Xueying Ding,Yidi Miao,Ramayya Krishnan,Rema Padman
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: After 136 days of meticulous preparation, we’re thrilled to finally share our comprehensive survey on llm multi-turn interactions with the community!

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have revolutionized their ability to handle single-turn tasks, yet real-world applications demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent advancements in evaluating and enhancing multi-turn interactions in LLMs. Focusing on task-specific scenarios, from instruction following in diverse domains such as math and coding to complex conversational engagements in roleplay, healthcare, education, and even adversarial jailbreak settings, we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness over prolonged dialogues. The paper organizes current benchmarks and datasets into coherent categories that reflect the evolving landscape of multi-turn dialogue evaluation. In addition, we review a range of enhancement methodologies under multi-turn settings, including model-centric strategies (contextual learning, supervised fine-tuning, reinforcement learning, and new architectures), external integration approaches (memory-augmented, retrieval-based methods, and knowledge graph), and agent-based techniques for collaborative interactions. Finally, we discuss open challenges and propose future directions for research to further advance the robustness and effectiveness of multi-turn interactions in LLMs. Related resources and papers are available at this https URL.
zh

[NLP-44] Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs

【速读】：该论文旨在解决通过黑盒 API 访问的大规模语言模型（Large Language Models, LLMs）中模型替代检测的问题。随着用户基于广告宣传的能力（如模型大小和性能）付费，服务提供者可能通过暗中替换为成本更低、质量较差的替代模型来降低运营成本，这种不透明性破坏了公平性、侵蚀了信任，并使可靠的基准测试变得复杂化。由于黑盒性质限制了与模型的交互仅限于输入-输出查询，检测此类替换极具挑战性。

论文的关键在于探索现有验证技术的局限性，包括基于输出的统计测试、基准评估以及对数概率分析，并在多种现实攻击场景下（如模型量化、随机替代和基准规避）进行系统评估。研究发现，单纯依赖文本输出的方法在应对细微或自适应攻击时存在显著不足。尽管对数概率分析在可用时提供了更强的保证，但其可访问性通常受限。最后，论文讨论了基于硬件的解决方案（如可信执行环境，Trusted Execution Environments, TEEs）作为实现模型完整性的潜在途径，同时强调了安全、性能和提供商采用之间的权衡。相关代码已公开发布。

链接: https://arxiv.org/abs/2504.04715
作者: Will Cai,Tianneng Shi,Xuandong Zhao,Dawn Song
机构: University of California, Berkeley(加州大学伯克利分校)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) accessed via black-box APIs introduces a significant trust challenge: users pay for services based on advertised model capabilities (e.g., size, performance), but providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs. This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking. Detecting such substitutions is difficult due to the black-box nature, typically limiting interaction to input-output queries. This paper formalizes the problem of model substitution detection in LLM APIs. We systematically evaluate existing verification techniques, including output-based statistical tests, benchmark evaluations, and log probability analysis, under various realistic attack scenarios like model quantization, randomized substitution, and benchmark evasion. Our findings reveal the limitations of methods relying solely on text outputs, especially against subtle or adaptive attacks. While log probability analysis offers stronger guarantees when available, its accessibility is often limited. We conclude by discussing the potential of hardware-based solutions like Trusted Execution Environments (TEEs) as a pathway towards provable model integrity, highlighting the trade-offs between security, performance, and provider adoption. Code is available at this https URL
zh

[NLP-45] Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在处理长上下文时提取嵌入信息的能力评估问题，特别是针对从长输入中检索与特定查询相关的有用信息的挑战。为了解决这一问题，论文引入了一个名为Sequential-NIAH的新基准，它专门设计用于评估LLMs从长上下文（长度从8K到128K个token）中提取顺序信息项（称为“针”）的能力。该基准包含合成、真实和开放领域问答三种类型的针生成管道，并提供了包含14,000个样本的数据集，其中2,000个样本用于测试。解决方案的关键在于开发了一个基于合成数据驱动的评估模型，该模型能够根据时间或逻辑顺序评估答案的正确性，在合成测试数据上的准确率达到99.49%，从而有效支持了对不同LLMs性能的精确测量。实验结果表明，即使是最优秀的模型也只能达到63.15%的最大准确性，这凸显了随着上下文长度和针数量增加所面临的日益增长的挑战以及现有技术改进的空间。此外，噪声鲁棒性实验验证了该基准的可靠性，使Sequential-NIAH成为提升LLMs长文本提取能力研究的重要参考工具。

链接: https://arxiv.org/abs/2504.04713
作者: Yifei Yu,Qian-Wen Zhang,Lingfeng Qiao,Di Yin,Fang Li,Jie Wang,Zengxi Chen,Suncong Zheng,Xiaolong Liang,Xing Sun
机构: Tencent YouTu Lab(腾讯优图实验室); Tencent(腾讯)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Evaluating the ability of large language models (LLMs) to handle extended contexts is critical, particularly for retrieving information relevant to specific queries embedded within lengthy inputs. We introduce Sequential-NIAH, a benchmark specifically designed to evaluate the capability of LLMs to extract sequential information items (known as needles) from long contexts. The benchmark comprises three types of needle generation pipelines: synthetic, real, and open-domain QA. It includes contexts ranging from 8K to 128K tokens in length, with a dataset of 14,000 samples (2,000 reserved for testing). To facilitate evaluation on this benchmark, we trained a synthetic data-driven evaluation model capable of evaluating answer correctness based on chronological or logical order, achieving an accuracy of 99.49% on synthetic test data. We conducted experiments on six well-known LLMs, revealing that even the best-performing model achieved a maximum accuracy of only 63.15%. Further analysis highlights the growing challenges posed by increasing context lengths and the number of needles, underscoring substantial room for improvement. Additionally, noise robustness experiments validate the reliability of the benchmark, making Sequential-NIAH an important reference for advancing research on long text extraction capabilities of LLMs.
zh

[NLP-46] LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

【速读】：该论文旨在解决大语言模型长上下文推理过程中，Key-Value (KV) 缓存规模增大所带来的部署成本与任务精度之间平衡的挑战。现有方法主要依赖于注意力权重来淘汰非关键缓存标记，但这些方法通常需要对推理基础设施进行重大修改，并带来显著的计算开销。本文提出了一种名为 LagKV 的 KV 分配策略，其关键创新在于仅依赖 KV 对自身直接比较，而完全不依赖注意力机制。这种方法易于集成到主流推理平台，并在性能上可媲美其他复杂的 KV 压缩方法。实验结果表明，在 2 倍压缩比下，LagKV 方法几乎无性能损失，在 8 倍压缩比下仍保持约原模型 90% 的性能，特别是在 64 位密钥检索任务中，相比基于注意力权重的方法 H₂O，LagKV 在相同压缩比下提升了超过 60% 的性能。

链接: https://arxiv.org/abs/2504.04704
作者: Manlai Liang,JiaMing Zhang,Xiong Li,Jinlong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modifiation of the inference infrastructure and significant computation overhead. Base on the fact that the Large Lanuage models are autoregresssive models, we propose \it LagKV, a KV allocation strategy only relying on straight forward comparison among KV themself. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on LongBench and PasskeyRetrieval show that, our approach achieves nearly zero loss when the ratio is 2\times and \approx 90% of the original model performance for 8\times . Especially in the 64-digit passkey retrieval task, our mehod outperforms the attention weight based method H_2O over 60% with same compression ratios. Our code is available at \urlthis https URL.
zh

[NLP-47] Causal Retrieval with Semantic Consideration

【速读】：该论文旨在解决现有信息检索（IR）系统在处理知识密集型领域（如生物医学和法律）任务时，仅依赖表面语义相似性而无法有效捕捉深层次因果关系的问题。论文的关键创新在于提出了一种名为CAWAI的检索模型，其训练目标包含语义匹配和因果关系建模两个方面。通过结合这两种目标，CAWAI能够更准确地理解查询意图，特别是涉及因果关系的复杂场景，并在大规模检索设置下展现出显著性能提升，同时具备跨科学领域问答任务的零样本泛化能力。

链接: https://arxiv.org/abs/2504.04700
作者: Hyunseo Shin,Wonseok Hwang
机构: University of Seoul (首尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly enhanced the performance of conversational AI systems. To extend their capabilities to knowledge-intensive domains such as biomedical and legal fields, where the accuracy is critical, LLMs are often combined with information retrieval (IR) systems to generate responses based on retrieved documents. However, for IR systems to effectively support such applications, they must go beyond simple semantic matching and accurately capture diverse query intents, including causal relationships. Existing IR models primarily focus on retrieving documents based on surface-level semantic similarity, overlooking deeper relational structures such as causality. To address this, we propose CAWAI, a retrieval model that is trained with dual objectives: semantic and causal relations. Our extensive experiments demonstrate that CAWAI outperforms various models on diverse causal retrieval tasks especially under large-scale retrieval settings. We also show that CAWAI exhibits strong zero-shot generalization across scientific domain QA tasks.
zh

[NLP-48] R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）在软件漏洞检测（Software Vulnerability Detection, SVD）中推理能力不可靠的问题。具体而言，依赖于思维链（Chain-of-Thought, CoT）的方法难以提供相关且可操作的安全评估，同时有效SVD不仅需要生成连贯的推理，还需区分合理与看似合理但误导性的安全评估，这一方面在先前研究中被忽视。论文的关键解决方案是引入R2Vul方法，通过基于人工智能反馈的强化学习（Reinforcement Learning from AI Feedback, RLAIF）将结构化推理蒸馏到小型LLM中。这种方法使LLM能够生成具有行动力且可靠的安全意识推理，并显式学习区分有效评估与误导性评估。

链接: https://arxiv.org/abs/2504.04699
作者: Martin Weyssow,Chengran Yang,Junkai Chen,Yikun Li,Huihui Huang,Ratnadira Widyasari,Han Wei Ang,Frank Liauw,Eng Lieh Ouh,Lwin Khin Shar,David Lo
机构: Singapore Management University (新加坡管理大学); GovTech (政府科技局)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promising performance in software vulnerability detection (SVD), yet their reasoning capabilities remain unreliable. Existing approaches relying on chain-of-thought (CoT) struggle to provide relevant and actionable security assessments. Additionally, effective SVD requires not only generating coherent reasoning but also differentiating between well-founded and misleading yet plausible security assessments, an aspect overlooked in prior work. To this end, we introduce R2Vul, a novel approach that distills structured reasoning into small LLMs using reinforcement learning from AI feedback (RLAIF). Through RLAIF, R2Vul enables LLMs to produce structured, security-aware reasoning that is actionable and reliable while explicitly learning to distinguish valid assessments from misleading ones. We evaluate R2Vul across five languages against SAST tools, CoT, instruction tuning, and classification-based baselines. Our results show that R2Vul with structured reasoning distillation enables a 1.5B student LLM to rival larger models while improving generalization to out-of-distribution vulnerabilities. Beyond model improvements, we contribute a large-scale, multilingual preference dataset featuring structured reasoning to support future research in SVD.
zh

[NLP-49] scAgent : Universal Single-Cell Annotation via a LLM Agent

【速读】：该论文旨在解决通用细胞注释（universal cell annotation）的问题，即能够跨组织推广、发现新的细胞类型，并扩展到未见过的细胞类型。传统方法主要局限于特定组织内的固定数量细胞类型的注释，而未能有效应对跨组织的通用性挑战。论文的关键创新在于提出了一种基于大型语言模型（Large Language Models, LLMs）的通用细胞注释框架scAgent。scAgent不仅能够高效学习新细胞类型，还表现出在多种组织和细胞类型上的卓越性能，包括通用细胞类型注释、新型细胞发现以及对新细胞类型的扩展能力。

链接: https://arxiv.org/abs/2504.04698
作者: Yuren Mao,Yu Mi,Peigen Liu,Mengfei Zhang,Hanqing Liu,Yunjun Gao
机构: Zhejiang University (浙江大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cell type annotation is critical for understanding cellular heterogeneity. Based on single-cell RNA-seq data and deep learning models, good progress has been made in annotating a fixed number of cell types within a specific tissue. However, universal cell annotation, which can generalize across tissues, discover novel cell types, and extend to novel cell types, remains less explored. To fill this gap, this paper proposes scAgent, a universal cell annotation framework based on Large Language Models (LLMs). scAgent can identify cell types and discover novel cell types in diverse tissues; furthermore, it is data efficient to learn novel cell types. Experimental studies in 160 cell types and 35 tissues demonstrate the superior performance of scAgent in general cell-type annotation, novel cell discovery, and extensibility to novel cell type.
zh

[NLP-50] LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）中视觉标记冗余导致的计算效率低下问题，同时避免因减少标记数量而削弱视觉推理能力。论文提出的解决方案关键在于两个方面：一是通过引入CoTR（一种新的标记约简模块），利用视觉标记、文本标记与可学习查询之间的相似性，显著减少视觉标记的数量；二是采用MMoE（多模态专家混合模块），通过一组LoRA专家和新型路由机制根据输入的文本和视觉标记动态切换，而非仅依赖隐藏状态，并结合通用LoRA专家以最小的计算开销扩展模型能力，同时利用一组视觉专家提取丰富的视觉特征。这些创新共同提升了模型的效率和性能。

链接: https://arxiv.org/abs/2504.04653
作者: Yimu Wang,Mozhgan Nasr Azadani,Sean Sedwards,Krzysztof Czarnecki
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Redundancy of visual tokens in multi-modal large language models (MLLMs) significantly reduces their computational efficiency. Recent approaches, such as resamplers and summarizers, have sought to reduce the number of visual tokens, but at the cost of visual reasoning ability. To address this, we propose LEO-MINI, a novel MLLM that significantly reduces the number of visual tokens and simultaneously boosts visual reasoning capabilities. For efficiency, LEO-MINI incorporates CoTR, a novel token reduction module to consolidate a large number of visual tokens into a smaller set of tokens, using the similarity between visual tokens, text tokens, and a compact learnable query. For effectiveness, to scale up the model’s ability with minimal computational overhead, LEO-MINI employs MMoE, a novel mixture of multi-modal experts module. MMOE employs a set of LoRA experts with a novel router to switch between them based on the input text and visual tokens instead of only using the input hidden state. MMoE also includes a general LoRA expert that is always activated to learn general knowledge for LLM reasoning. For extracting richer visual features, MMOE employs a set of vision experts trained on diverse domain-specific data. To demonstrate LEO-MINI’s improved efficiency and performance, we evaluate it against existing efficient MLLMs on various benchmark vision-language tasks.
zh

[NLP-51] Splits! A Flexible Dataset for Evaluating a Models Demographic Social Inference

【速读】：该论文旨在解决如何从现实世界文本中归纳出不同人口统计学群体表达模式的可泛化理论这一挑战性问题。论文定义了一个名为“Group Theorization”的新任务，要求系统撰写能够区分不同人口统计学群体表达差异的理论。解决方案的关键在于构建了一个大规模数据集Splits!，该数据集通过将Reddit上中立主题（如体育、烹饪和电影）的帖子按人口统计学特征（如职业、宗教和种族）进行划分而创建，并提出了一种基于人工验证的简单评估框架，用于衡量方法生成关于群体表达理论的有效性。论文还公开了原始语料库和评估脚本，以帮助研究者评估方法在推断及可能误表征群体表达差异方面的表现。

链接: https://arxiv.org/abs/2504.04640
作者: Eylon Caplan,Tania Chakraborty,Dan Goldwasser
机构: Department of Computer Science (计算机科学系), Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review for COLM 2025

点击查看摘要

Abstract:Understanding how people of various demographics think, feel, and express themselves (collectively called group expression) is essential for social science and underlies the assessment of bias in Large Language Models (LLMs). While LLMs can effectively summarize group expression when provided with empirical examples, coming up with generalizable theories of how a group’s expression manifests in real-world text is challenging. In this paper, we define a new task called Group Theorization, in which a system must write theories that differentiate expression across demographic groups. We make available a large dataset on this task, Splits!, constructed by splitting Reddit posts by neutral topics (e.g. sports, cooking, and movies) and by demographics (e.g. occupation, religion, and race). Finally, we suggest a simple evaluation framework for assessing how effectively a method can generate ‘better’ theories about group expression, backed by human validation. We publicly release the raw corpora and evaluation scripts for Splits! to help researchers assess how methods infer–and potentially misrepresent–group differences in expression. We make Splits! and our evaluation module available at this https URL.
zh

[NLP-52] Ineffectiveness for Search and Undecidability of PCSP Meta-Problems

【速读】：该论文旨在探究约束满足问题的承诺版本（Promise Constraint Satisfaction Problems, PCSPs）中搜索版本与决策版本是否等价这一开放性问题。目前大多数针对PCSP的已知算法仅解决其决策变体，而无法保证能有效解决搜索变体。论文的关键在于证明基于松弛整数规划的主流方法（如BLP、AIP及BLP+AIP）在将解转化为有效的搜索证书时，其难度可能等价于TFNP类中的任何问题，从而表明这些算法在搜索问题上的无效性。进一步地，论文通过代数方法找到了一些足以导致搜索问题无效的充分条件，并设计了适用于由最小对象刻画算法的工具，不仅用于证明搜索问题的无效性，还揭示了基于BLP、AIP和BLP+AIP的模板族是不可判定的。此外，利用相同技术分析了几种已知保证有限模板CSP可处理性的代数条件，证明了与循环多项式和WNUs相关的若干元问题是对于PCSP不可判定的。

链接: https://arxiv.org/abs/2504.04639
作者: Alberto Larrauri
机构: 未知
类目: Computational Complexity (cs.CC); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:It is an open question whether the search and decision versions of promise CSPs are equivalent. Most known algorithms for PCSPs solve only their \emphdecision variant, and it is unknown whether they can be adapted to solve \emphsearch as well. The main approaches, called BLP, AIP and BLP+AIP, handle a PCSP by finding a solution to a relaxation of some integer program. We prove that rounding those solutions to a proper search certificate can be as hard as any problem in the class TFNP. In other words, these algorithms are ineffective for search. Building on the algebraic approach to PCSPs, we find sufficient conditions that imply ineffectiveness for search. Our tools are tailored to algorithms that are characterized by minions in a suitable way, and can also be used to prove undecidability results for meta-problems. This way, we show that the families of templates solvable via BLP, AIP, and BLP+AIP are undecidable. Using the same techniques we also analyze several algebraic conditions that are known to guarantee the tractability of finite-template CSPs. We prove that several meta-problems related to cyclic polymorphims and WNUs are undecidable for PCSPs. In particular, there is no algorithm deciding whether a finite PCSP template (1) admits cyclic a polymorphism, (2) admits a WNU. Subjects: Computational Complexity (cs.CC); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Logic in Computer Science (cs.LO) MSC classes: 68Q17, 68Q25 Cite as: arXiv:2504.04639 [cs.CC] (or arXiv:2504.04639v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2504.04639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-53] Steering off Course: Reliability Challenges in Steering Language Models

【速读】：该论文旨在解决现有语言模型（Language Models, LMs）引导方法在稳健性理解上的不足，这些方法虽作为微调的轻量级替代方案受到关注，但以往研究主要集中于少数模型，缺乏对广泛模型家族的系统评估。论文的关键在于通过扩展实验范围，测试多达36个来自14个家族的模型（参数规模从1.5B到70B），揭示了不同引导方法（DoLa、功能向量和任务向量）在有效性上的显著差异，并发现许多模型不仅未能提升性能，反而导致性能下降。进一步分析表明，这些方法背后的假设存在根本性缺陷，质疑了它们作为可扩展引导解决方案的可靠性。

链接: https://arxiv.org/abs/2504.04635
作者: Patrick Queiroz Da Silva,Hari Sethuraman,Dheeraj Rajagopal,Hannaneh Hajishirzi,Sachin Kumar
机构: The Ohio State University (俄亥俄州立大学); University of Washington (华盛顿大学); Fastino AI; Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods – DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis demonstrate fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions.
zh

[NLP-54] DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition NAACL2025

【速读】：该论文旨在解决远程监督命名实体识别（Distantly Supervised Named Entity Recognition, DS-NER）中存在的大量错误标注实例问题，这限制了其性能。现有工作大多通过构建复杂的模型来从噪声标签中学习以应对这一挑战，而本文提出了一种替代方案，即尝试清理标注数据以提高远程标签的质量，这种方法在NER任务中受到的关注较少。论文的关键在于提出了一种基于训练动态的标签清理方法，利用模型在训练过程中的行为特征来表征远程标注样本，并引入自动阈值估计策略定位远程标签中的错误。实验结果表明，使用经过直接移除识别出的错误标注精炼后的DS-NER数据集训练的模型，在F1分数上取得了3.18%到8.95%的显著提升，且该方法在四个数据集上优于众多先进的DS-NER方法。

链接: https://arxiv.org/abs/2504.04616
作者: Qi Zhang,Huitong Pan,Zhijia Chen,Longin Jan Latecki,Cornelia Caragea,Eduard Dragut
机构: Temple University (天普大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL2025-Findings

点击查看摘要

Abstract:Distantly Supervised Named Entity Recognition (DS-NER) has attracted attention due to its scalability and ability to automatically generate labeled data. However, distant annotation introduces many mislabeled instances, limiting its performance. Most of the existing work attempt to solve this problem by developing intricate models to learn from the noisy labels. An alternative approach is to attempt to clean the labeled data, thus increasing the quality of distant labels. This approach has received little attention for NER. In this paper, we propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses to characterize the distantly annotated samples. We also introduce an automatic threshold estimation strategy to locate the errors in distant labels. Extensive experimental results demonstrate that: (1) models trained on our cleaned DS-NER datasets, which were refined by directly removing identified erroneous annotations, achieve significant improvements in F1-score, ranging from 3.18% to 8.95%; and (2) our method outperforms numerous advanced DS-NER approaches across four datasets.
zh

[NLP-55] SECQUE: A Benchmark for Evaluating Real-World Financial Analysis Capabilities

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在金融分析任务评估中的标准化基准不足的问题。为了解决这一问题，论文提出了SECQUE，这是一个包含565个专家编写问题的综合基准，覆盖了SEC文件分析的四个关键类别：比较分析、比率计算、风险评估和财务洞察生成。解决方案的关键在于不仅构建了全面的基准数据集，还开发了SECQUE-Judge评估机制，该机制利用多个基于LLM的评委，与人工评估表现出高度一致性，从而提供了更可靠和客观的模型性能评估方法。

链接: https://arxiv.org/abs/2504.04596
作者: Noga Ben Yoash,Meni Brief,Oded Ovadia,Gil Shenderovitz,Moshik Mishaeli,Rachel Lemberg,Eitam Sheetrit
机构: Microsoft Industry AI (微软行业人工智能)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: Benchmark available at: this https URL

点击查看摘要

Abstract:We introduce SECQUE, a comprehensive benchmark for evaluating large language models (LLMs) in financial analysis tasks. SECQUE comprises 565 expert-written questions covering SEC filings analysis across four key categories: comparison analysis, ratio calculation, risk assessment, and financial insight generation. To assess model performance, we develop SECQUE-Judge, an evaluation mechanism leveraging multiple LLM-based judges, which demonstrates strong alignment with human evaluations. Additionally, we provide an extensive analysis of various models’ performance on our benchmark. By making SECQUE publicly available, we aim to facilitate further research and advancements in financial AI.
zh

[NLP-56] KnowsLM: A framework for evaluation of small language models for knowledge augmentation and humanised conversations

【速读】：本文研究了在会话型人工智能（Conversational AI）领域，如何利用小到中等规模的语言模型（LLMs）生成简洁、上下文相关且类人对话这一复杂挑战。论文重点探讨了LoRA秩值、数据集规模以及提示前缀设计对知识保留与风格一致性的影响。关键在于对比分析了微调（fine-tuning）与RAG增强模型两种方法：微调能够提升流畅性并实现风格定制化，但在整合未见过的知识方面受限，特别是在数据量较小时；而RAG增强模型通过推理阶段引入外部文档，在分布外提示下的事实准确性表现更优，但其风格一致性不及微调。最终评估表明，微调更适合用于调整语气，而RAG则擅长实时知识扩充。因此，解决方案的关键在于根据具体需求权衡使用微调或RAG增强策略。

链接: https://arxiv.org/abs/2504.04569
作者: Chitranshu Harbola,Anupam Purwar
机构: AIGuruKul Foundation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the evolving landscape of conversational AI, generating concise, context-aware, and human-like dialogue using small and medium-sized language models (LLMs) remains a complex challenge. This study investigates the influence of LoRA rank, dataset scale, and prompt prefix design on both knowledge retention and stylistic alignment. While fine-tuning improves fluency and enables stylistic customization, its ability to integrate unseen knowledge is constrained – particularly with smaller datasets. Conversely, RAG-augmented models, equipped to incorporate external documents at inference, demonstrated superior factual accuracy on out-of-distribution prompts, though they lacked the stylistic consistency achieved by fine-tuning. Evaluations by LLM-based judges across knowledge accuracy, conversational quality, and conciseness suggest that fine-tuning is best suited for tone adaptation, whereas RAG excels at real-time knowledge augmentation.
zh

[NLP-57] An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models

【速读】：该论文旨在解决跨领域（如新闻、医学和商业）文本摘要任务中因信息过载而产生的挑战，通过评估17个大型语言模型（包括商业和开源模型）在不同维度上的摘要性能来解决此问题。论文的关键在于提出了一种新颖的多维框架，综合考量了事实一致性、语义相似性、词汇重叠度以及人类感知质量等指标，并结合效率因素对模型进行系统性评估。此外，研究还揭示了模型性能在不同数据集、输出长度及应用场景中的显著差异，强调了在事实准确性与感知质量之间存在的权衡关系。最终，论文基于全面的分析提供了面向不同实际应用需求的证据驱动型建议，从而指导模型选择以平衡准确性、效率和成本效益之间的复杂关系。

链接: https://arxiv.org/abs/2504.04534
作者: Anantharaman Janakiraman,Behnaz Ghoraani
机构: Florida Atlantic University (佛罗里达大西洋大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text summarization is crucial for mitigating information overload across domains like journalism, medicine, and business. This research evaluates summarization performance across 17 large language models (OpenAI, Google, Anthropic, open-source) using a novel multi-dimensional framework. We assessed models on seven diverse datasets (BigPatent, BillSum, CNN/DailyMail, PubMed, SAMSum, WikiHow, XSum) at three output lengths (50, 100, 150 tokens) using metrics for factual consistency, semantic similarity, lexical overlap, and human-like quality, while also considering efficiency factors. Our findings reveal significant performance differences, with specific models excelling in factual accuracy (deepseek-v3), human-like quality (claude-3-5-sonnet), and processing efficiency/cost-effectiveness (gemini-1.5-flash, gemini-2.0-flash). Performance varies dramatically by dataset, with models struggling on technical domains but performing well on conversational content. We identified a critical tension between factual consistency (best at 50 tokens) and perceived quality (best at 150 tokens). Our analysis provides evidence-based recommendations for different use cases, from high-stakes applications requiring factual accuracy to resource-constrained environments needing efficient processing. This comprehensive approach enhances evaluation methodology by integrating quality metrics with operational considerations, incorporating trade-offs between accuracy, efficiency, and cost-effectiveness to guide model selection for specific applications.
zh

[NLP-58] Hessian of Perplexity for Large Language Models by PyTorch autograd (Open Source)

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）全Hessian矩阵计算不可行的问题，因全Hessian矩阵规模巨大。论文的关键解决方案是利用PyTorch的autograd库精确计算LLM至少一部分Hessian矩阵，并通过多组向量-海森乘积（Vector-Hessian Product, VHP）来高效计算Hessian矩阵的完整对角线。这种方法为理解和分析LLMs的海森行为与结构提供了实用工具。

链接: https://arxiv.org/abs/2504.04520
作者: Ivan Ilin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 3 figures, open source code on GitHub

点击查看摘要

Abstract:Computing the full Hessian matrix – the matrix of second-order derivatives for an entire Large Language Model (LLM) is infeasible due to its sheer size. In this technical report, we aim to provide a comprehensive guide on how to accurately compute at least a small portion of the Hessian for LLMs using PyTorch autograd library. We also demonstrate how to compute the full diagonal of the Hessian matrix using multiple samples of vector-Hessian Products (HVPs). We hope that both this guide and the accompanying GitHub code will be valuable resources for practitioners and researchers interested in better understanding the behavior and structure of the Hessian in LLMs.
zh

[NLP-59] Saliency-driven Dynamic Token Pruning for Large Language Models

【速读】：本文针对大型语言模型（Large Language Models, LLMs）在长序列推理场景中的挑战展开研究，由于注意力机制的二次计算复杂度，LLMs在此类任务中面临显著的效率瓶颈。论文提出了一种新颖的基于显著性驱动的动态令牌剪枝框架（Saliency-driven Dynamic Token Pruning, SDTP），旨在根据输入上下文逐步且动态地剪枝冗余令牌。方案的关键在于设计了一个轻量级的显著性驱动预测模块，通过隐藏状态估计每个令牌的重要性得分，并将其集成到LLM的不同层中以分层剪枝冗余令牌。此外，提出了基于排序的优化策略以最小化显著性得分与预测重要性得分之间的排名差异。实验结果表明，该方法在多种模型和数据集上具有通用性，在剪枝65%输入令牌的情况下，可减少33%~47%的浮点运算（FLOPs），推理速度提升高达1.75倍，同时保持相近的性能表现。进一步研究表明，SDTP还可与KV缓存压缩方法结合实现更高效的压缩。

链接: https://arxiv.org/abs/2504.04514
作者: Yao Tao,Yehui Tang,Yun Wang,Mingjian Zhu,Hailin Hu,Yunhe Wang
机构: Huawei Noah’s Ark Lab. (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the recent success of large language models (LLMs), LLMs are particularly challenging in long-sequence inference scenarios due to the quadratic computational complexity of the attention mechanism. Inspired by the interpretability theory of feature attribution in neural network models, we observe that not all tokens have the same contribution. Based on this observation, we propose a novel token pruning framework, namely Saliency-driven Dynamic Token Pruning (SDTP), to gradually and dynamically prune redundant tokens based on the input context. Specifically, a lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state, which is added to different layers of the LLM to hierarchically prune redundant tokens. Furthermore, a ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score. Extensive experiments have shown that our framework is generalizable to various models and datasets. By hierarchically pruning 65% of the input tokens, our method greatly reduces 33% \sim 47% FLOPs and achieves speedup up to 1.75 \times during inference, while maintaining comparable performance. We further demonstrate that SDTP can be combined with KV cache compression method for further compression.
zh

[NLP-60] Directed Graph-alignment Approach for Identification of Gaps in Short Answers

【速读】：本文旨在解决学生答案中缺失项（称为“gap”）的自动识别问题。解决方案的关键在于将gap识别问题建模为学生答案与相应参考答案的有向图对齐过程。通过这种图对齐方法，能够在单词、短语或句子级别识别出这些gap，从而为形成性评估提供有价值的反馈。为了验证该方法的有效性，研究基于三个知名短答案评分数据集（UNT、SciEntsBank和Beetle）构建了包含标注gap的学生答案数据集，并采用传统机器学习任务中的评估指标来衡量gap识别性能。尽管不同数据集和答案类型上的表现有所差异，但整体结果显示出良好的潜力。

链接: https://arxiv.org/abs/2504.04473
作者: Archana Sahu,Plaban Kumar Bhowmick
机构: Centre for Educational Technology (教育技术中心), Indian Institute of Technology Kharagpur (印度理工学院克勒格布尔); G.S. Sanyal School of Telecommunications (Sanyal电信学院), Indian Institute of Technology Kharagpur (印度理工学院克勒格布尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 11 figures

点击查看摘要

Abstract:In this paper, we have presented a method for identifying missing items known as gaps in the student answers by comparing them against the corresponding model answer/reference answers, automatically. The gaps can be identified at word, phrase or sentence level. The identified gaps are useful in providing feedback to the students for formative assessment. The problem of gap identification has been modelled as an alignment of a pair of directed graphs representing a student answer and the corresponding model answer for a given question. To validate the proposed approach, the gap annotated student answers considering answers from three widely known datasets in the short answer grading domain, namely, University of North Texas (UNT), SciEntsBank, and Beetle have been developed and this gap annotated student answers’ dataset is available at: this https URL. Evaluation metrics used in the traditional machine learning tasks have been adopted to evaluate the task of gap identification. Though performance of the proposed approach varies across the datasets and the types of the answers, overall the performance is observed to be promising.
zh

[NLP-61] An overview of model uncertainty and variability in LLM -based sentiment analysis. Challenges mitigation strategies and the role of explainability

【速读】：该论文旨在解决基于大型语言模型（LLMs）的情感分析中模型变异性问题（Model Variability Problem, MVP），其表现为情感分类不一致、极化现象以及由随机推理机制、提示敏感性和训练数据偏差引起的不确定性。论文通过分析MVP的核心成因，并结合示例与案例研究展示其影响，同时探讨了解决方案的关键挑战与策略。其中，温度作为输出随机性驱动因素的作用被特别强调，而可解释性在提升模型透明度和用户信任方面的重要性也被着重指出。通过提供关于稳定性、可重复性和可信度的系统视角，论文致力于开发更可靠、可解释且稳健的情感分析模型，从而促进其在金融、医疗和政策制定等高风险领域的应用。

链接: https://arxiv.org/abs/2504.04462
作者: David Herrera-Poyatos,Carlos Peláez-González,Cristina Zuheros,Andrés Herrera-Poyatos,Virilo Tejedor,Francisco Herrera,Rosana Montes
机构: Department of Computer Science and Artificial Intelligence, Andalusian Institute of Data Science and Computational Intelligence (DaSCI), University of Granada, Spain.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages and 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced sentiment analysis, yet their inherent uncertainty and variability pose critical challenges to achieving reliable and consistent outcomes. This paper systematically explores the Model Variability Problem (MVP) in LLM-based sentiment analysis, characterized by inconsistent sentiment classification, polarization, and uncertainty arising from stochastic inference mechanisms, prompt sensitivity, and biases in training data. We analyze the core causes of MVP, presenting illustrative examples and a case study to highlight its impact. In addition, we investigate key challenges and mitigation strategies, paying particular attention to the role of temperature as a driver of output randomness and emphasizing the crucial role of explainability in improving transparency and user trust. By providing a structured perspective on stability, reproducibility, and trustworthiness, this study helps develop more reliable, explainable, and robust sentiment analysis models, facilitating their deployment in high-stakes domains such as finance, healthcare, and policymaking, among others.
zh

[NLP-62] On the Spatial Structure of Mixture-of-Experts in Transformers ICLR2025

【速读】：该论文试图解决的问题是：传统观点认为MoE (Mixture-of-Experts) 路由器主要依赖语义特征进行专家选择，但该研究挑战这一假设，探索位置信息在路由决策中的作用。关键解决方案在于通过广泛的实证分析证明位置令牌信息在路由中起重要作用，并提出现象学解释，讨论其对基于MoE架构的实际影响。

链接: https://arxiv.org/abs/2504.04444
作者: Daniel Bershatsky,Ivan Oseledets
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICLR 2025 Workshop on Sparsity in LLMs (SLLM)

点击查看摘要

Abstract:A common assumption is that MoE routers primarily leverage semantic features for expert selection. However, our study challenges this notion by demonstrating that positional token information also plays a crucial role in routing decisions. Through extensive empirical analysis, we provide evidence supporting this hypothesis, develop a phenomenological explanation of the observed behavior, and discuss practical implications for MoE-based architectures.
zh

[NLP-63] Pre-trained Language Models and Few-shot Learning for Medical Entity Extraction

【速读】：该论文旨在解决医学文献中实体抽取任务的信息提取能力不足的问题。为应对医学文本的专业性和复杂性，论文通过对比不同预训练语言模型（BERT、BioBERT、PubMedBERT、ClinicalBERT）在医学命名实体识别任务中的性能，发现PubMedBERT在F1分数上达到最佳表现（88.8%），表明针对生物医学文献预训练的语言模型在医学领域更为有效。此外，研究分析了不同实体抽取方法（CRF、Span-based、Seq2Seq）的效果，结果显示Span-based方法在识别实体边界方面具有更高的准确性，在相同任务中达到F1分数88.6%。在低资源场景下，论文进一步探索了少量样本学习（Few-shot Learning）的应用，实验表明即使仅使用10个标注样本进行训练，模型仍可实现79.1%的F1分数，验证了Few-shot Learning在有限数据条件下的有效性。综上所述，论文的关键在于结合预训练语言模型与少量样本学习，以提升医学实体抽取的准确性。未来的研究可以考虑引入知识图谱和主动学习策略，以增强模型的泛化能力和稳定性，从而为医学自然语言处理提供更优解。

链接: https://arxiv.org/abs/2504.04385
作者: Xiaokai Wang,Guiran Liu,Binrong Zhu,Jacky He,Hongye Zheng,Hanlu Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study proposes a medical entity extraction method based on Transformer to enhance the information extraction capability of medical literature. Considering the professionalism and complexity of medical texts, we compare the performance of different pre-trained language models (BERT, BioBERT, PubMedBERT, ClinicalBERT) in medical entity extraction tasks. Experimental results show that PubMedBERT achieves the best performance (F1-score = 88.8%), indicating that a language model pre-trained on biomedical literature is more effective in the medical domain. In addition, we analyze the impact of different entity extraction methods (CRF, Span-based, Seq2Seq) and find that the Span-based approach performs best in medical entity extraction tasks (F1-score = 88.6%). It demonstrates superior accuracy in identifying entity boundaries. In low-resource scenarios, we further explore the application of Few-shot Learning in medical entity extraction. Experimental results show that even with only 10-shot training samples, the model achieves an F1-score of 79.1%, verifying the effectiveness of Few-shot Learning under limited data conditions. This study confirms that the combination of pre-trained language models and Few-shot Learning can enhance the accuracy of medical entity extraction. Future research can integrate knowledge graphs and active learning strategies to improve the model’s generalization and stability, providing a more effective solution for medical NLP research. Keywords- Natural Language Processing, medical named entity recognition, pre-trained language model, Few-shot Learning, information extraction, deep learning
zh

[NLP-64] Retro-Search: Exploring Untaken Paths for Deeper and Efficient Reasoning

【速读】：该论文旨在解决大型推理模型在知识蒸馏过程中产生的次优推理路径问题，这些路径通常存在过度切换思路、推理不足或过载以及产生退化响应的现象。论文的关键解决方案是引入了一种名为Retro-Search的算法，该算法受蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）启发，用于从大型推理模型中提炼更高质量的推理路径。Retro-Search通过回顾性地修订推理路径，发现更优且更短的推理轨迹，从而实现学生模型在保持较短、更快推理的同时提升其推理能力。此外，该方法支持两种应用场景：自改进（模型在其自身生成的经过Retro-Search修订的推理轨迹上进行微调）和弱到强改进（较小模型通过Retro-Search修订较大模型的推理轨迹）。研究结果表明，这种方法不仅能够有效优化推理路径长度与性能，还反驳了关于在大型模型时代搜索算法不再重要的观点，证明了即使对于前沿模型，算法创新仍具有重要意义。

链接: https://arxiv.org/abs/2504.04383
作者: Ximing Lu,Seungju Han,David Acuna,Hyunwoo Kim,Jaehun Jung,Shrimai Prabhumoye,Niklas Muennighoff,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro,Yejin Choi
机构: NVIDIA (英伟达); University of Washington (华盛顿大学); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code and data will be publicly released upon internal approval

点击查看摘要

Abstract:Large reasoning models exhibit remarkable reasoning capabilities via long, elaborate reasoning trajectories. Supervised fine-tuning on such reasoning traces, also known as distillation, can be a cost-effective way to boost reasoning capabilities of student models. However, empirical observations reveal that these reasoning trajectories are often suboptimal, switching excessively between different lines of thought, resulting in under-thinking, over-thinking, and even degenerate responses. We introduce Retro-Search, an MCTS-inspired search algorithm, for distilling higher quality reasoning paths from large reasoning models. Retro-Search retrospectively revises reasoning paths to discover better, yet shorter traces, which can then lead to student models with enhanced reasoning capabilities with shorter, thus faster inference. Our approach can enable two use cases: self-improvement, where models are fine-tuned on their own Retro-Search-ed thought traces, and weak-to-strong improvement, where a weaker model revises stronger model’s thought traces via Retro-Search. For self-improving, R1-distill-7B, fine-tuned on its own Retro-Search-ed traces, reduces the average reasoning length by 31.2% while improving performance by 7.7% across seven math benchmarks. For weak-to-strong improvement, we retrospectively revise R1-671B’s traces from the OpenThoughts dataset using R1-distill-32B as the Retro-Search-er, a model 20x smaller. Qwen2.5-32B, fine-tuned on this refined data, achieves performance comparable to R1-distill-32B, yielding an 11.3% reduction in reasoning length and a 2.4% performance improvement compared to fine-tuning on the original OpenThoughts data. Our work counters recently emergent viewpoints that question the relevance of search algorithms in the era of large reasoning models, by demonstrating that there are still opportunities for algorithmic advancements, even for frontier models.
zh

[NLP-65] PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多语言安全审核方面存在的两大核心问题：一是对少数语言（如英语、中文）的关注过于狭窄；二是对安全定义的范围有限，导致审核能力存在显著差距。为解决这些问题，论文提出了关键的解决方案——发布POLYGUARD，这是一个最先进的多语言安全模型，用于保护LLM生成内容的安全性。POLYGUARD的关键在于其训练数据集POLYGUARDMIX，该数据集包含来自17种语言（如中文、捷克语、英语、印地语）的191万样本，以及评估数据集POLYGUARDPROMPTS，包含2.9万个样本，这些样本通过结合自然发生的多语言人类-LLM交互数据与人工验证的机器翻译数据构建而成。通过广泛的基准测试，POLYGUARD在多个安全性和毒性评估指标上比现有最先进的开源权重和商业安全分类器高出5.5%，从而显著提升了多语言LLMs的安全性。

链接: https://arxiv.org/abs/2504.04377
作者: Priyanshu Kumar,Devansh Jain,Akhila Yerukola,Liwei Jiang,Himanshu Beniwal,Thomas Hartvigsen,Maarten Sap
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Truly multilingual safety moderation efforts for Large Language Models (LLMs) have been hindered by a narrow focus on a small set of languages (e.g., English, Chinese) as well as a limited scope of safety definition, resulting in significant gaps in moderation capabilities. To bridge these gaps, we release POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding LLM generations, and the corresponding training and evaluation datasets. POLYGUARD is trained on POLYGUARDMIX, the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages (e.g., Chinese, Czech, English, Hindi). We also introduce POLYGUARDPROMPTS, a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails. Created by combining naturally occurring multilingual human-LLM interactions and human-verified machine translations of an English-only safety dataset (WildGuardMix; Han et al., 2024), our datasets contain prompt-output pairs with labels of prompt harmfulness, response harmfulness, and response refusal. Through extensive evaluations across multiple safety and toxicity benchmarks, we demonstrate that POLYGUARD outperforms existing state-of-the-art open-weight and commercial safety classifiers by 5.5%. Our contributions advance efforts toward safer multilingual LLMs for all global users.
zh

[NLP-66] StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）的提示恢复（Prompt Recovery）问题，特别是针对风格迁移和改述任务中输入提示的重建。与传统的问答型提示恢复不同，本文聚焦于更复杂的风格转换场景。论文的关键在于通过引入一个由LLM辅助构建的数据集，并采用多种方法（如零样本、少量样本、越狱攻击、链式思维、微调以及一种新颖的规范提示回退策略）来测试和优化恢复性能。实验结果显示，少量样本和微调方法表现最佳，同时揭示了现有句子相似性度量在评估提示恢复效果上的局限性。论文的主要贡献包括构建了一个基准数据集、全面探索了提示恢复策略，并指出了当前评估指标的不足，从而推动了无结构限制的通用提示恢复研究的发展。

链接: https://arxiv.org/abs/2504.04373
作者: Shenyang Liu,Yang Gao,Shaoyan Zhai,Liqiang Wang
机构: Department of Computer Science, University of Central Florida (中佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2024 IEEE International Conference on Big Data (BigData)

点击查看摘要

Abstract:Prompt Recovery, reconstructing prompts from the outputs of large language models (LLMs), has grown in importance as LLMs become ubiquitous. Most users access LLMs through APIs without internal model weights, relying only on outputs and logits, which complicates recovery. This paper explores a unique prompt recovery task focused on reconstructing prompts for style transfer and rephrasing, rather than typical question-answering. We introduce a dataset created with LLM assistance, ensuring quality through multiple techniques, and test methods like zero-shot, few-shot, jailbreak, chain-of-thought, fine-tuning, and a novel canonical-prompt fallback for poor-performing cases. Our results show that one-shot and fine-tuning yield the best outcomes but highlight flaws in traditional sentence similarity metrics for evaluating prompt recovery. Contributions include (1) a benchmark dataset, (2) comprehensive experiments on prompt recovery strategies, and (3) identification of limitations in current evaluation metrics, all of which advance general prompt recovery research, where the structure of the input prompt is unrestricted.
zh

[NLP-67] DDPT: Diffusion-Driven Prompt Tuning for Large Language Model Code Generation ICSE

【速读】：该论文旨在解决基于大型语言模型（Large Language Models, LLMs）的代码生成中，生成代码质量高度依赖于提示（prompt）结构与组成的问题。由于精心设计高质量提示是一项需要专业知识和技能的任务，论文提出了一种名为扩散驱动提示调优（Diffusion-Driven Prompt Tuning, DDPT）的新方法，以实现从高斯噪声生成最优提示嵌入的自动化提示工程。解决方案的关键在于利用基于扩散的优化抽象出最优提示嵌入的方向向量，并通过LLMs提供的代码生成损失，在训练过程中帮助扩散模型捕捉最优提示嵌入的分布；在采样阶段，训练好的扩散模型能够构建从噪声分布到最优分布的路径，从而有效提升代码生成的提示优化效果。

链接: https://arxiv.org/abs/2504.04351
作者: Jinyang Li,Sangwon Hyun,M. Ali Babar
机构: University of Adelaide (阿德莱德大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICSE CAIN 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation. However, the quality of the generated code is heavily dependent on the structure and composition of the prompts used. Crafting high-quality prompts is a challenging task that requires significant knowledge and skills of prompt engineering. To advance the automation support for the prompt engineering for LLM-based code generation, we propose a novel solution Diffusion-Driven Prompt Tuning (DDPT) that learns how to generate optimal prompt embedding from Gaussian Noise to automate the prompt engineering for code generation. We evaluate the feasibility of diffusion-based optimization and abstract the optimal prompt embedding as a directional vector toward the optimal embedding. We use the code generation loss given by the LLMs to help the diffusion model capture the distribution of optimal prompt embedding during training. The trained diffusion model can build a path from the noise distribution to the optimal distribution at the sampling phrase, the evaluation result demonstrates that DDPT helps improve the prompt optimization for code generation.
zh

[NLP-68] Compression Laws for Large Language Models

【速读】：该论文试图解决语言模型（Language Models, LMs）在压缩后的性能退化问题，研究如何通过模型压缩技术在资源受限的场景下有效部署大规模预训练语言模型（Large Language Models, LLMs）。论文的关键在于揭示压缩比与模型性能之间的关系，并强调恢复微调（recovery fine-tuning）的重要性。研究表明，虽然测试集交叉熵损失随压缩比呈二次增长，但下游任务性能仅线性下降；恢复微调可使压缩模型的测试损失改善高达55%，并在高压缩比（如90%）下实现推理速度提升60%，从而平衡性能退化与计算效率之间的权衡。这一发现为实际应用中利用模型压缩技术提供了实用指导，尤其适用于较大规模模型。

链接: https://arxiv.org/abs/2504.04342
作者: Ayan Sengupta,Siddhant Chaudhary,Tanmoy Chakraborty
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 11 figures, 6 tables

点击查看摘要

Abstract:We introduce compression laws for language language models (LLMs). While recent scaling laws have sought to understand how LLMs scale with respect to model size, pre-training data, and computational resources, we focus on understanding how model compression affects the performance of a pre-trained LLM on downstream tasks. We empirically examine the effects of structured model compression on LLMs through over 1000 experiments across eight models with sizes ranging from 0.5B to 14B parameters. Our findings indicate that the test cross-entropy loss increases quadratically with the compression ratio, whereas performance on downstream tasks declines only linearly. Our study emphasizes the importance of recovery fine-tuning in enhancing generation loss, showing that the test loss of compressed LLMs can improve by up to 55% with recovery fine-tuning. At higher compression ratios (up to 90%), compressed LLMs demonstrate a speed increase of 60% during inference compared to their uncompressed counterparts, compensating for the performance degradation at this level. However, for smaller models ( \le 7B ), the computational gains are limited, peaking at just 35%. We conclude that model compression can be highly beneficial for larger models, especially when a smaller model within the same computational budget is not available. These insights provide the practical guidelines for utilizing model compression techniques for adopting LLMs in real-life applications in resource-constrained settings.
zh

[NLP-69] Generative Large Language Models Trained for Detecting Errors in Radiology Reports

【速读】：该论文试图解决放射学报告中错误检测的问题。解决方案的关键在于利用生成式大语言模型（Generative Large Language Models, LLMs），通过在合成数据和真实世界数据（如MIMIC-CXR数据库）上的微调（fine-tuning），显著提升了对四种类型错误（否定错误、左右混淆、时间间隔变化和转录错误）的检测能力。研究采用了多种提示策略（零样本提示、少量样本提示）以及模型评估方法（F1分数、95%置信区间、t检验），并通过放射科医生的进一步验证，证明了经过优化的Llama-3-70B-Instruct模型在错误检测任务中的卓越性能。

链接: https://arxiv.org/abs/2504.04336
作者: Cong Sun,Kurt Teichman,Yiliang Zhou,Brian Critelli,David Nauheim,Graham Keir,Xindi Wang,Judy Zhong,Adam E Flanders,George Shih,Yifan Peng
机构: Weill Cornell Medicine (威尔康奈尔医学院); Thomas Jefferson University (托马斯杰斐逊大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this retrospective study, a dataset was constructed with two parts. The first part included 1,656 synthetic chest radiology reports generated by GPT-4 using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC-CXR database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3, GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using the F1 score, 95% confidence interval (CI) and paired-sample t-tests on our constructed dataset, with the prediction results further assessed by radiologists. Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance with the following F1 scores: 0.769 for negation errors, 0.772 for left/right errors, 0.750 for interval change errors, 0.828 for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model. Of these, 99 were confirmed to contain errors detected by the models by both radiologists, and 163 were confirmed to contain model-detected errors by at least one radiologist. Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports.
zh

[NLP-70] Hallucination Detection using Multi-View Attention Features

【速读】：该论文旨在解决大型语言模型输出中的token级别幻觉检测问题。论文的关键在于从注意力矩阵中提取特征，这些特征分别描述了每个token接收的平均注意力（帮助识别某些token是否过于有影响力或被忽略）、每个token接收的注意力多样性（揭示注意力是否偏向特定子集）以及生成过程中一个token关注的tokens多样性（指示模型是否引用了狭窄或广泛的上下文信息）。这些特征被输入到基于Transformer的分类器中，以实现token级别的分类，从而识别出幻觉片段。实验结果表明，所提出的方法在具有较长输入上下文的任务（如数据到文本生成和摘要任务）中优于强大的基准方法。

链接: https://arxiv.org/abs/2504.04335
作者: Yuya Ogasa,Yuki Arase
机构: The University of Osaka (大阪大学); Institute of Science Tokyo (东京理科大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study tackles token-level hallucination detection in outputs of large language models. Previous studies revealed that attention exhibits irregular patterns when hallucination occurs. Inspired by this, we extract features from the attention matrix that provide complementary views of (a) the average attention each token receives, which helps identify whether certain tokens are overly influential or ignored, (b) the diversity of attention each token receives, which reveals whether attention is biased toward specific subsets, and © the diversity of tokens a token attends to during generation, which indicates whether the model references a narrow or broad range of information. These features are input to a Transformer-based classifier to conduct token-level classification to identify hallucinated spans. Experimental results indicate that the proposed method outperforms strong baselines on hallucination detection with longer input contexts, i.e., data-to-text and summarization tasks.
zh

[NLP-71] IMPersona: Evaluating Individual Level LM Impersonation

【速读】：该论文试图解决语言模型在模仿特定个体写作风格和个人知识方面的仿真能力问题。论文的关键在于引入了一个名为 IMPersona 的框架，通过监督微调（Supervised Fine-Tuning）以及受记忆启发的分层检索系统（Hierarchical Memory-Inspired Retrieval System），验证了即使是规模较小的开源模型（如 Llama-3.1-8B-Instruct），也能够以令人担忧的程度实现个性化仿真。实验结果显示，在盲测对话中，参与者有 44.44% 的概率将集成记忆的微调模型误认为人类，而基于最佳提示方法的结果仅为 25.00%。这些结果促使研究者提出相应的检测方法与防御策略，从而探讨个性化语言模型在隐私、安全及实际应用中的潜在影响与风险。

链接: https://arxiv.org/abs/2504.04332
作者: Quan Shi,Carlos Jimenez,Stephen Dong,Brian Seo,Caden Yao,Adam Kelch,Karthik Narasimhan
机构: Princeton Language and Intelligence (PLI), Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 9 pages main

点击查看摘要

Abstract:As language models achieve increasingly human-like capabilities in conversational text generation, a critical question emerges: to what extent can these systems simulate the characteristics of specific individuals? To evaluate this, we introduce IMPersona, a framework for evaluating LMs at impersonating specific individuals’ writing style and personal knowledge. Using supervised fine-tuning and a hierarchical memory-inspired retrieval system, we demonstrate that even modestly sized open-source models, such as Llama-3.1-8B-Instruct, can achieve impersonation abilities at concerning levels. In blind conversation experiments, participants (mis)identified our fine-tuned models with memory integration as human in 44.44% of interactions, compared to just 25.00% for the best prompting-based approach. We analyze these results to propose detection methods and defense strategies against such impersonation attempts. Our findings raise important questions about both the potential applications and risks of personalized language models, particularly regarding privacy, security, and the ethical deployment of such technologies in real-world contexts.
zh

[NLP-72] Constructing the Truth: Text Mining and Linguistic Networks in Public Hearings of Case 03 of the Special Jurisdiction for Peace (JEP)

【速读】：本文旨在解决哥伦比亚特别和平司法管辖机构（JEP）第三案中，针对虚假正面事件（false positives）所引发的复杂叙事模式识别与系统化分析问题。论文提出了一种基于自然语言处理与语义共现模型的创新方法，通过构建skipgram网络并分析其模块性，识别出反映区域性及程序状态差异的主题聚类，从而揭示受害、责任归属及承认等动态过程中的实证证据。该研究的关键在于采用计算方法，将司法与非司法真相的建构相结合，并为其他过渡正义案件提供可复制的工具。研究立足于真相、正义、赔偿与不再重犯的核心支柱，倡导对争议记忆进行批判性且深入的解读。

链接: https://arxiv.org/abs/2504.04325
作者: Juan Sosa,Alejandro Urrego,Cesar Prieto,Emma J. Camargo-Díaz
机构: Universidad Nacional de Colombia (哥伦比亚国立大学)
类目: Computation and Language (cs.CL); Applications (stat.AP); Methodology (stat.ME)
备注: 48 pages, in Spanish language, 11 tablas, 24 figures

点击查看摘要

Abstract:Case 03 of the Special Jurisdiction for Peace (JEP), focused on the so-called false positives in Colombia, represents one of the most harrowing episodes of the Colombian armed conflict. This article proposes an innovative methodology based on natural language analysis and semantic co-occurrence models to explore, systematize, and visualize narrative patterns present in the public hearings of victims and appearing parties. By constructing skipgram networks and analyzing their modularity, the study identifies thematic clusters that reveal regional and procedural status differences, providing empirical evidence on dynamics of victimization, responsibility, and acknowledgment in this case. This computational approach contributes to the collective construction of both judicial and extrajudicial truth, offering replicable tools for other transitional justice cases. The work is grounded in the pillars of truth, justice, reparation, and non-repetition, proposing a critical and in-depth reading of contested memories.
zh

[NLP-73] Balancing Complexity and Informativeness in LLM -Based Clustering: Finding the Goldilocks Zone

【速读】：该论文旨在解决短文本聚类中信息性与可解释性之间的权衡问题，传统评估指标往往忽视这一折衷关系。为了解决此问题，论文通过量化信息性和认知简约性之间的权衡来探究最优聚类数量，并利用大语言模型（Large Language Models, LLMs）生成聚类名称，结合语义密度、信息论和聚类准确性进行评估。关键在于发现了一个“恰到好处”的区域，在此区域内聚类保持区分度的同时仍具有良好的可解释性，最终确定了16至22个聚类的最优范围，这一结果与词汇分类中的语言效率相一致。

链接: https://arxiv.org/abs/2504.04314
作者: Justin Miller,Tristram Alexander
机构: University of Sydney (悉尼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注: 12 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The challenge of clustering short text data lies in balancing informativeness with interpretability. Traditional evaluation metrics often overlook this trade-off. Inspired by linguistic principles of communicative efficiency, this paper investigates the optimal number of clusters by quantifying the trade-off between informativeness and cognitive simplicity. We use large language models (LLMs) to generate cluster names and evaluate their effectiveness through semantic density, information theory, and clustering accuracy. Our results show that Gaussian Mixture Model (GMM) clustering on embeddings generated by a LLM, increases semantic density compared to random assignment, effectively grouping similar bios. However, as clusters increase, interpretability declines, as measured by a generative LLM’s ability to correctly assign bios based on cluster names. A logistic regression analysis confirms that classification accuracy depends on the semantic similarity between bios and their assigned cluster names, as well as their distinction from alternatives. These findings reveal a “Goldilocks zone” where clusters remain distinct yet interpretable. We identify an optimal range of 16-22 clusters, paralleling linguistic efficiency in lexical categorization. These insights inform both theoretical models and practical applications, guiding future research toward optimising cluster interpretability and usefulness. Comments: 12 pages, 4 figures, 2 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Statistics Theory (math.ST) Cite as: arXiv:2504.04314 [cs.CL] (or arXiv:2504.04314v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.04314 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-74] CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

【速读】：该论文试图解决的问题是生成式人工智能（Generative AI）在组合优化（Combinatorial Optimization, CO）领域的应用潜力尚未被充分探索，且缺乏全面的基准测试来系统性评估其性能。论文的关键解决方案是引入了一个名为CO-Bench的基准套件，包含来自广泛领域和复杂度级别的36个真实世界组合优化问题，并提供结构化的问题形式和精心策划的数据，以支持对大型语言模型（LLM）代理的严格评估。通过将多种代理框架与现有手工设计的算法进行对比评估，揭示了当前方法的优势和局限性，并指出了未来研究的有前景方向。

链接: https://arxiv.org/abs/2504.04310
作者: Weiwei Sun,Shengyu Feng,Shanda Li,Yiming Yang
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint-intensive problems-a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO-Bench, a benchmark suite featuring 36 real-world CO problems drawn from a broad range of domains and complexity levels. CO-Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agent frameworks against established human-designed algorithms, revealing key strengths and limitations of current approaches and identifying promising directions for future research. CO-Bench is publicly available at this https URL.
zh

[NLP-75] Gating is Weighting: Understanding Gated Linear Attention through In-context Learning

【速读】：该论文试图解决的问题是如何提升Gated Linear Attention (GLA) 模型在情境学习 (in-context learning) 中的能力，并理解其背后的机制。论文的关键解决方案在于揭示GLA模型可以通过多层结构实现一类具有数据相关权重的加权预条件梯度下降 (Weighted Preconditioned Gradient Descent, WPGD) 算法，并通过门控机制和输入数据诱导出这些权重，从而控制单个标记对预测的贡献。进一步地，论文引入一个多任务提示的数据模型来刻画学习WPGD算法的优化景观，并在温和条件下证明了全局最小值的存在性和唯一性（仅相差尺度因子）。最终，这些发现被用于探索GLA模型的优化景观，阐明门控如何促进上下文感知学习，并在理论上说明其相对于标准线性注意力的优势。

链接: https://arxiv.org/abs/2504.04308
作者: Yingcong Li,Davoud Ataee Tarzanagh,Ankit Singh Rawat,Maryam Fazel,Samet Oymak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Linear attention methods offer a compelling alternative to softmax attention due to their efficiency in recurrent decoding. Recent research has focused on enhancing standard linear attention by incorporating gating while retaining its computational benefits. Such Gated Linear Attention (GLA) architectures include competitive models such as Mamba and RWKV. In this work, we investigate the in-context learning capabilities of the GLA model and make the following contributions. We show that a multilayer GLA can implement a general class of Weighted Preconditioned Gradient Descent (WPGD) algorithms with data-dependent weights. These weights are induced by the gating mechanism and the input, enabling the model to control the contribution of individual tokens to prediction. To further understand the mechanics of this weighting, we introduce a novel data model with multitask prompts and characterize the optimization landscape of learning a WPGD algorithm. Under mild conditions, we establish the existence and uniqueness (up to scaling) of a global minimum, corresponding to a unique WPGD solution. Finally, we translate these findings to explore the optimization landscape of GLA and shed light on how gating facilitates context-aware learning and when it is provably better than vanilla linear attention.
zh

[NLP-76] Dynamic Hedging Strategies in Derivatives Markets with LLM -Driven Sentiment and News Analytics IJCNN2025

【速读】：该论文旨在解决衍生品市场中基于固定策略的风险管理效果受限的问题，特别是在波动性和市场情绪对投资表现有显著影响的情境下。论文的关键创新在于提出了一种结合大型语言模型（Large Language Models, LLMs）的新型框架，用于通过文本数据（如新闻文章、社交媒体和财务报告）进行情感分析和新闻舆情解析，以指导动态套期保值决策。该方案的核心在于利用LLMs实时捕捉反映当前市场状况的关键情感指标，并据此动态调整套期保值策略，从而实现优于传统静态方法的风险调整收益。

链接: https://arxiv.org/abs/2504.04295
作者: Jie Yang,Yiqiu Tang,Yongjie Li,Lihua Zhang,Haoran Zhang
机构: The Chinese University of Hong Kong (香港中文大学); Columbia University (哥伦比亚大学); University of Utah (犹他大学); University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: Accepted by IJCNN 2025

点击查看摘要

Abstract:Dynamic hedging strategies are essential for effective risk management in derivatives markets, where volatility and market sentiment can greatly impact performance. This paper introduces a novel framework that leverages large language models (LLMs) for sentiment analysis and news analytics to inform hedging decisions. By analyzing textual data from diverse sources like news articles, social media, and financial reports, our approach captures critical sentiment indicators that reflect current market conditions. The framework allows for real-time adjustments to hedging strategies, adapting positions based on continuous sentiment signals. Backtesting results on historical derivatives data reveal that our dynamic hedging strategies achieve superior risk-adjusted returns compared to conventional static approaches. The incorporation of LLM-driven sentiment analysis into hedging practices presents a significant advancement in decision-making processes within derivatives trading. This research showcases how sentiment-informed dynamic hedging can enhance portfolio management and effectively mitigate associated risks.
zh

[NLP-77] Cross-Asset Risk Management: Integrating LLM s for Real-Time Monitoring of Equity Fixed Income and Currency Markets IJCNN2025

【速读】：该论文旨在解决跨资产类别风险管理中的实时监控与动态风险评估问题。传统方法在处理多源异构数据时存在局限性，难以快速捕捉市场变化并提供全局视角。论文的关键解决方案在于引入基于大型语言模型（Large Language Models, LLMs）的跨资产风险管理框架，通过整合多元数据源（如金融市场信号、新闻文本和市场报告），利用LLMs强大的文本解析能力，实现对金融市场的全面理解和风险机会识别。该框架通过实时数据分析和先进预测技术，显著提升了市场趋势预测的准确性，并增强了金融机构在复杂市场环境下的风险应对能力，从而推动了金融稳定性的提升。

链接: https://arxiv.org/abs/2504.04292
作者: Jie Yang,Yiqiu Tang,Yongjie Li,Lihua Zhang,Haoran Zhang
机构: The Chinese University of Hong Kong (香港中文大学); Columbia University (哥伦比亚大学); University of Utah (犹他大学); University of Utah (犹他大学); University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: Accepted by IJCNN 2025

点击查看摘要

Abstract:Large language models (LLMs) have emerged as powerful tools in the field of finance, particularly for risk management across different asset classes. In this work, we introduce a Cross-Asset Risk Management framework that utilizes LLMs to facilitate real-time monitoring of equity, fixed income, and currency markets. This innovative approach enables dynamic risk assessment by aggregating diverse data sources, ultimately enhancing decision-making processes. Our model effectively synthesizes and analyzes market signals to identify potential risks and opportunities while providing a holistic view of asset classes. By employing advanced analytics, we leverage LLMs to interpret financial texts, news articles, and market reports, ensuring that risks are contextualized within broader market narratives. Extensive backtesting and real-time simulations validate the framework, showing increased accuracy in predicting market shifts compared to conventional methods. The focus on real-time data integration enhances responsiveness, allowing financial institutions to manage risks adeptly under varying market conditions and promoting financial stability through the advanced application of LLMs in risk analysis.
zh

[NLP-78] Could AI Trace and Explain the Origins of AI-Generated Images and Text?

【速读】：该论文试图解决的问题是如何系统性地检测和区分由不同大型语言模型（Large Language Models, LLMs）和大型多模态模型（Large Multimodal Models, LMMs）生成的内容，并分析其在一般用途与恶意用途中的差异。此外，论文还关注当前生成式 AI 系统（如 GPT-4o）是否能够解释伪造内容为何被归因于特定的生成模型，以及是否存在相应的基准来评估这一能力。

解决方案的关键在于构建了一个名为 AI-FAKER 的综合性多模态数据集，包含超过 280,000 个样本，涵盖了多种 LLMs 和 LMMs，同时包括一般性和恶意性的应用场景。通过这一数据集，研究揭示了两个关键发现：一是 AI 作者身份检测不仅依赖于生成输出本身，还取决于模型原始训练意图；二是 GPT-4o 在分析 OpenAI 自家模型（如 DALL-E 和 GPT-4o 本身）所生产的内容时，提供了高度一致但较不具体的解释。

链接: https://arxiv.org/abs/2504.04279
作者: Hongchao Fang,Can Qin,Ran Xu,Feng Liu,Yixin Liu,Lichao Sun,Dongwon Lee,Lifu Huang,Wenpeng Yin
机构: Penn State University (宾夕法尼亚州立大学); Salesforce (赛富时); Drexel University (德雷塞尔大学); Lehigh University (里海大学); UC Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI-generated content is becoming increasingly prevalent in the real world, leading to serious ethical and societal concerns. For instance, adversaries might exploit large multimodal models (LMMs) to create images that violate ethical or legal standards, while paper reviewers may misuse large language models (LLMs) to generate reviews without genuine intellectual effort. While prior work has explored detecting AI-generated images and texts, and occasionally tracing their source models, there is a lack of a systematic and fine-grained comparative study. Important dimensions–such as AI-generated images vs. text, fully vs. partially AI-generated images, and general vs. malicious use cases–remain underexplored. Furthermore, whether AI systems like GPT-4o can explain why certain forged content is attributed to specific generative models is still an open question, with no existing benchmark addressing this. To fill this gap, we introduce AI-FAKER, a comprehensive multimodal dataset with over 280,000 samples spanning multiple LLMs and LMMs, covering both general and malicious use cases for AI-generated images and texts. Our experiments reveal two key findings: (i) AI authorship detection depends not only on the generated output but also on the model’s original training intent; and (ii) GPT-4o provides highly consistent but less specific explanations when analyzing content produced by OpenAI’s own models, such as DALL-E and GPT-4o itself.
zh

[NLP-79] Beyond the Hype: Embeddings vs. Prompting for Multiclass Classification Tasks

【速读】：该论文试图解决传统分类方法在人工智能时代是否变得无关紧要的问题，并探索在多类别分类任务中，是否存在预测模型整体性能优于基于大型语言模型（LLM）提示的方法的情形。研究以Thumbtack客户提供的家庭服务项目描述（包含文本和图像）为数据源，构建基于嵌入向量（embeddings-based）的Softmax模型来预测每个问题描述对应的专家类别（如杂工、浴室改造等）。解决方案的关键在于采用基于嵌入向量的分类方法，该方法通过高效利用专有数据集，在准确性（比提示方法高49.5%）、校准性、延迟时间以及成本方面均显著优于基于提示的LLM方法，同时提供更可靠的置信信号用于实际部署中的上下文用户体验。此外，嵌入向量方法在处理图像和文本时分别快14倍和81倍，并且在实际部署条件下成本可降低至其十分之一。最终通过A/B测试验证了离线分析结果的一致性，表明在可以利用专有数据集的多类别分类任务中，基于嵌入向量的方法能够带来明确的优势。

链接: https://arxiv.org/abs/2504.04277
作者: Marios Kokkodis,Richard Demsyn-Jones,Vijay Raghavan
机构: Thumbtack
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Are traditional classification approaches irrelevant in this era of AI hype? We show that there are multiclass classification problems where predictive models holistically outperform LLM prompt-based frameworks. Given text and images from home-service project descriptions provided by Thumbtack customers, we build embeddings-based softmax models that predict the professional category (e.g., handyman, bathroom remodeling) associated with each problem description. We then compare against prompts that ask state-of-the-art LLM models to solve the same problem. We find that the embeddings approach outperforms the best LLM prompts in terms of accuracy, calibration, latency, and financial cost. In particular, the embeddings approach has 49.5% higher accuracy than the prompting approach, and its superiority is consistent across text-only, image-only, and text-image problem descriptions. Furthermore, it yields well-calibrated probabilities, which we later use as confidence signals to provide contextualized user experience during deployment. On the contrary, prompting scores are overly uninformative. Finally, the embeddings approach is 14 and 81 times faster than prompting in processing images and text respectively, while under realistic deployment assumptions, it can be up to 10 times cheaper. Based on these results, we deployed a variation of the embeddings approach, and through A/B testing we observed performance consistent with our offline analysis. Our study shows that for multiclass classification problems that can leverage proprietary datasets, an embeddings-based approach may yield unequivocally better results. Hence, scientists, practitioners, engineers, and business leaders can use our study to go beyond the hype and consider appropriate predictive models for their classification use cases.
zh

[NLP-80] negativas: a prototype for searching and classifying sentential negation in speech data

【速读】：该论文试图解决巴西葡萄牙语中“não”在不同位置表达否定（NEG1、NEG2、NEG3）时因频率差异导致的研究结果主观且难以推广的问题。为应对这一挑战，论文的关键解决方案是开发了一款名为“negativas”的自动化工具，用于识别和分类转录数据中的NEG1、NEG2和NEG3结构。该工具通过分析标注数据集、应用自然语言处理技术创建代码、运行工具以及评估准确性等四个阶段实现，并达到了93%的成功率。尽管如此，工具在处理低频的NEG2结构时仍存在一定局限性。

链接: https://arxiv.org/abs/2504.04275
作者: Túlio Sousa de Gois,Paloma Batista Cardoso
机构: Federal University of Sergipe (塞阿拉联邦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Negation is a universal feature of natural languages. In Brazilian Portuguese, the most commonly used negation particle is não, which can scope over nouns or verbs. When it scopes over a verb, não can occur in three positions: pre-verbal (NEG1), double negation (NEG2), or post-verbal (NEG3), e.g., não gosto, não gosto não, gosto não (“I do not like it”). From a variationist perspective, these structures are different forms of expressing negation. Pragmatically, they serve distinct communicative functions, such as politeness and modal evaluation. Despite their grammatical acceptability, these forms differ in frequency. NEG1 dominates across Brazilian regions, while NEG2 and NEG3 appear more rarely, suggesting its use is contextually restricted. This low-frequency challenges research, often resulting in subjective, non-generalizable interpretations of verbal negation with não. To address this, we developed negativas, a tool for automatically identifying NEG1, NEG2, and NEG3 in transcribed data. The tool’s development involved four stages: i) analyzing a dataset of 22 interviews from the Falares Sergipanos database, annotated by three linguists, ii) creating a code using natural language processing (NLP) techniques, iii) running the tool, iv) evaluating accuracy. Inter-annotator consistency, measured using Fleiss’ Kappa, was moderate (0.57). The tool identified 3,338 instances of não, classifying 2,085 as NEG1, NEG2, or NEG3, achieving a 93% success rate. However, negativas has limitations. NEG1 accounted for 91.5% of identified structures, while NEG2 and NEG3 represented 7.2% and 1.2%, respectively. The tool struggled with NEG2, sometimes misclassifying instances as overlapping structures (NEG1/NEG2/NEG3).
zh

[NLP-81] Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models

【速读】：该论文旨在解决多语言语言模型（MLMs）在不同语言中对语义等价提示（semantically equivalent prompts）提供不一致响应的问题。研究表明，尽管先前的研究指出了这一跨语言不一致性问题，但其根本原因尚未被充分探索。论文通过机制可解释性方法发现，MLMs 在大部分层中以语言无关的概念空间（language-independent concept space）存储知识，仅在最终层才过渡到语言特定的空间（language-specific spaces）。这种过渡失败常常导致目标语言中的错误预测，即使其他语言的回答是正确的。为缓解此不一致性问题，论文提出了一种线性捷径方法（linear shortcut method），通过绕过最终层的计算来提升预测准确性和跨语言一致性。解决方案的关键在于通过引入线性捷径方法，在不显著增加模型复杂度的前提下，有效改善了多语言语言模型的跨语言一致性。

链接: https://arxiv.org/abs/2504.04264
作者: Mingyang Wang,Heike Adel,Lukas Lange,Yihong Liu,Ercong Nie,Jannik Strötgen,Hinrich Schütze
机构: Bosch Center for Artificial Intelligence (博世人工智能中心), Germany; LMU Munich (慕尼黑大学), Germany; Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心), Germany; Hochschule der Medien (斯图加特媒体大学), Germany; Karlsruhe University of Applied Sciences (卡尔斯鲁厄应用技术大学), Germany
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual language models (MLMs) store factual knowledge across languages but often struggle to provide consistent responses to semantically equivalent prompts in different languages. While previous studies point out this cross-lingual inconsistency issue, the underlying causes remain unexplored. In this work, we use mechanistic interpretability methods to investigate cross-lingual inconsistencies in MLMs. We find that MLMs encode knowledge in a language-independent concept space through most layers, and only transition to language-specific spaces in the final layers. Failures during the language transition often result in incorrect predictions in the target language, even when the answers are correct in other languages. To mitigate this inconsistency issue, we propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency. Our findings shed light on the internal mechanisms of MLMs and provide a lightweight, effective strategy for producing more consistent factual outputs.
zh

[NLP-82] Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models

【速读】：该论文试图从机制角度探究大型语言模型（Large Language Models, LLMs）中心智理论（Theory-of-Mind, ToM）能力的涌现，并聚焦于极端稀疏参数模式的作用。研究提出了一种新方法来识别与ToM相关的敏感参数，发现仅扰动这些参数中的0.001%即可显著削弱ToM性能，同时影响上下文定位和语言理解能力。关键在于揭示这些敏感参数与位置编码模块（尤其是使用旋转位置嵌入，Rotary Position Embedding, RoPE的模型）之间的紧密联系，其扰动会破坏上下文处理所需的主导频率激活；此外，还表明这些参数通过调节位置编码下查询与键之间的角度影响注意力机制。因此，论文通过分析这些参数与LLMs核心架构组件的交互，为理解LLMs如何获得社会推理能力提供了新的见解，并促进了AI可解释性与认知科学的结合。

链接: https://arxiv.org/abs/2504.04238
作者: Yuheng Wu,Wentao Guo,Zirui Liu,Heng Ji,Zhaozhuo Xu,Denghui Zhang
机构: Department of Electrical Engineering, Stanford University (斯坦福大学); Department of Computer Science, Princeton University (普林斯顿大学); Department of Computer Science & Engineering, University of Minnesota Twin Cities (明尼苏达大学双城分校); Department of Computer Science, University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Department of Computer Science, Stevens Institute of Technology (史蒂文斯理工学院); School of Business, Stevens Institute of Technology (史蒂文斯理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates the emergence of Theory-of-Mind (ToM) capabilities in large language models (LLMs) from a mechanistic perspective, focusing on the role of extremely sparse parameter patterns. We introduce a novel method to identify ToM-sensitive parameters and reveal that perturbing as little as 0.001% of these parameters significantly degrades ToM performance while also impairing contextual localization and language understanding. To understand this effect, we analyze their interaction with core architectural components of LLMs. Our findings demonstrate that these sensitive parameters are closely linked to the positional encoding module, particularly in models using Rotary Position Embedding (RoPE), where perturbations disrupt dominant-frequency activations critical for contextual processing. Furthermore, we show that perturbing ToM-sensitive parameters affects LLM’s attention mechanism by modulating the angle between queries and keys under positional encoding. These insights provide a deeper understanding of how LLMs acquire social reasoning abilities, bridging AI interpretability with cognitive science. Our results have implications for enhancing model alignment, mitigating biases, and improving AI systems designed for human interaction.
zh

[NLP-83] A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在数据和模型使用过程中可能引发的版权侵权及不道德实践问题，特别是因轻微修改现有模型而冒充新模型开发所导致的模型抄袭和所有权侵犯等问题。论文的关键解决方案在于提出了一种新颖的度量方法，用于量化LLMs之间的相似性，该方法利用了困惑度曲线（perplexity curves）以及Menger曲率差异。通过综合实验验证，此方法不仅表现出优于基线方法的性能，还展示了其在不同模型和领域中的通用性，并能够有效检测模型复制行为，从而保护LLMs的原创性和完整性。

链接: https://arxiv.org/abs/2504.04216
作者: Yuantao Zhang,Zhankui Yang
机构: National University of Singapore; National Supercomputing Center in Shenzhen (深圳国家超级计算中心)
类目: Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has brought about concerns regarding copyright infringement and unethical practices in data and model usage. For instance, slight modifications to existing LLMs may be used to falsely claim the development of new models, leading to issues of model copying and violations of ownership rights. This paper addresses these challenges by introducing a novel metric for quantifying LLM similarity, which leverages perplexity curves and differences in Menger curvature. Comprehensive experiments validate the performance of our methodology, demonstrating its superiority over baseline methods and its ability to generalize across diverse models and domains. Furthermore, we highlight the capability of our approach in detecting model replication through simulations, emphasizing its potential to preserve the originality and integrity of LLMs. Code is available at this https URL.
zh

[NLP-84] owards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

【速读】：该论文旨在解决压缩大型语言模型过程中安全性下降的问题。研究发现，尽管已有大量工作从安全性角度探索模型压缩，但安全对齐的模型在压缩后往往丧失部分可信性。为应对这一挑战，论文提出了一种基于可解释性的新方法，通过分析拒绝机制（refusal mechanisms）来评估压缩模型的安全性，并进一步利用可解释性分析的结果，设计了一种轻量级且计算高效的方案，在不牺牲模型性能和实用性的情况下提升压缩模型的安全性。关键在于结合机制解析与创新性评估框架，实现对压缩模型安全性的增强。

链接: https://arxiv.org/abs/2504.04215
作者: Vishnu Kabir Chhabra,Mohammad Mahdi Khalili
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid growth of large language models has spurred significant interest in model compression as a means to enhance their accessibility and practicality. While extensive research has explored model compression through the lens of safety, findings suggest that safety-aligned models often lose elements of trustworthiness post-compression. Simultaneously, the field of mechanistic interpretability has gained traction, with notable discoveries, such as the identification of a single direction in the residual stream mediating refusal behaviors across diverse model architectures. In this work, we investigate the safety of compressed models by examining the mechanisms of refusal, adopting a novel interpretability-driven perspective to evaluate model safety. Furthermore, leveraging insights from our interpretability analysis, we propose a lightweight, computationally efficient method to enhance the safety of compressed models without compromising their performance or utility.
zh

[NLP-85] Adaptive Elicitation of Latent Information Using Natural Language

【速读】：该论文旨在解决在自然语言环境中，利用大型语言模型（Large Language Models, LLMs）和现有微调算法进行信息收集时缺乏战略性机制以有效减少关于潜在实体不确定性的问题。论文的关键解决方案在于提出了一种自适应信息收集框架，该框架通过主动降低潜在实体的不确定性来发展有效的信息收集策略。为了实现这一目标，框架采用了元学习的语言模型对未来观测进行模拟，并通过自回归前向仿真量化新问题如何减少认识论不确定性（epistemic uncertainty），从而制定出复杂且高效的自然语言信息收集策略，选择最具信息量的后续查询。实验结果表明，该方法在识别关键未知因素和改进下游预测方面优于基线方法。

链接: https://arxiv.org/abs/2504.04204
作者: Jimmy Wang,Thomas Zollo,Richard Zemel,Hongseok Namkoong
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Eliciting information to reduce uncertainty about a latent entity is a critical task in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences. Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing fine-tuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity. To harness the generalization power and world knowledge of LLMs in developing effective information-gathering strategies, we propose an adaptive elicitation framework that actively reduces uncertainty on the latent entity. Since probabilistic modeling of an abstract latent entity is difficult, our framework adopts a predictive view of uncertainty, using a meta-learned language model to simulate future observations and enable scalable uncertainty quantification over complex natural language. Through autoregressive forward simulation, our model quantifies how new questions reduce epistemic uncertainty, enabling the development of sophisticated information-gathering strategies to choose the most informative next queries. In experiments on the 20 questions game, dynamic opinion polling, and adaptive student assessment, our method consistently outperforms baselines in identifying critical unknowns and improving downstream predictions, illustrating the promise of strategic information gathering in natural language settings.
zh

[NLP-86] GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

【速读】：该论文试图解决现有大型语言模型（Large Language Models, LLMs）评估框架过度聚焦于英语及少数高资源语言的问题，导致其在多语言和低资源场景下的真实性能被忽视。为填补这一空白，论文提出GlotEval，这是一个轻量级的多语言评估框架。其关键在于支持多种任务（包括机器翻译、文本分类、摘要生成等），覆盖数十至数百种语言，并通过一致的多语言基准测试、语言特定的提示模板以及非英语中心的机器翻译方法，实现对模型在多样化语言环境中优势与不足的精确诊断。

链接: https://arxiv.org/abs/2504.04155
作者: Hengyu Luo,Zihao Li,Joseph Attieh,Sawal Devkota,Ona de Gibert,Shaoxiong Ji,Peiqin Lin,Bhavani Sai Praneeth Varma Mantina,Ananda Sreenidhi,Raúl Vázquez,Mengjie Wang,Samea Yusofi,Jörg Tiedemann
机构: University of Helsinki (赫尔辛基大学), Finland; Technical University of Darmstadt (达姆施塔特工业大学), Germany; University of Munich (慕尼黑大学), Germany
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary language. Evaluation of these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks are disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this gap, we introduce GlotEval, a lightweight framework designed for massively multilingual evaluation. Supporting seven key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation), spanning over dozens to hundreds of languages, GlotEval highlights consistent multilingual benchmarking, language-specific prompt templates, and non-English-centric machine translation. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval’s applicability for multilingual and language-specific evaluations.
zh

[NLP-87] Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLM s Across Languages and Resources

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在不同语言间的性能显著差异问题，主要关注如何通过持续预训练（Continual Pretraining, CPT）技术缓解高资源语言受益而低资源语言被边缘化的情况。论文的关键在于系统性评估了多种CPT配置的效果，包括单语、双语以及包含代码增强的数据策略，并揭示了三种主要发现：双语CPT虽提升多语言分类性能但可能引发语言混合问题；加入编程代码数据可一致提高多语言分类准确性，尤其对低资源语言有益，但会轻微降低生成质量；语言分类对跨语言迁移的影响与传统认知存在较大偏差，不同类型语言（利他、自私、停滞）表现出复杂且条件依赖的行为模式。这些发现强调了多语言表示学习的复杂性，并指出未来需要更系统的语言分类研究以指导更具普适性的多语言CPT策略。

链接: https://arxiv.org/abs/2504.04152
作者: Zihao Li,Shaoxiong Ji,Hengyu Luo,Jörg Tiedemann
机构: University of Helsinki (赫尔辛基大学); Technical University of Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit significant disparities in performance across languages, primarily benefiting high-resource languages while marginalizing underrepresented ones. Continual Pretraining (CPT) has emerged as a promising approach to address this imbalance, although the relative effectiveness of monolingual, bilingual, and code-augmented data strategies remains unclear. This study systematically evaluates 36 CPT configurations involving three multilingual base models, across 30+ languages categorized as altruistic, selfish, and stagnant, spanning various resource levels. Our findings reveal three major insights: (1) Bilingual CPT improves multilingual classification but often causes language mixing issues during generation. (2) Including programming code data during CPT consistently enhances multilingual classification accuracy, particularly benefiting low-resource languages, but introduces a trade-off by slightly degrading generation quality. (3) Contrary to prior work, we observe substantial deviations from language classifications according to their impact on cross-lingual transfer: Languages classified as altruistic often negatively affect related languages, selfish languages show conditional and configuration-dependent behavior, and stagnant languages demonstrate surprising adaptability under certain CPT conditions. These nuanced interactions emphasize the complexity of multilingual representation learning, underscoring the importance of systematic studies on generalizable language classification to inform future multilingual CPT strategies.
zh

[NLP-88] STEP: Staged Parameter-Efficient Pre-training for Large Language Models NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）预训练过程中因模型参数规模庞大而导致的显著内存挑战。论文提出了一种名为STaged parameter-Efficient Pre-training（STEP）的方法，其关键是将参数高效微调技术与模型扩展相结合，在保持等效性能的同时，将最大内存需求减少了高达53.9%，相较于传统的预训练方法。此外，经过指令微调后，STEP训练出的模型在下游任务中的表现与传统预训练模型相当。

链接: https://arxiv.org/abs/2504.04151
作者: Kazuki Yano,Takumi Ito,Jun Suzuki
机构: Tohoku University (东北大学); Langsmith Inc. (朗斯密斯股份有限公司); RIKEN; NII LLMC (国立情报学研究所大型语言模型中心)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Main

点击查看摘要

Abstract:Pre-training large language models (LLMs) faces significant memory challenges due to the large size of model parameters. We introduce STaged parameter-Efficient Pre-training (STEP), which integrates parameter-efficient tuning techniques with model growth. We conduct experiments on pre-training LLMs of various sizes and demonstrate that STEP achieves up to a 53.9% reduction in maximum memory requirements compared to vanilla pre-training while maintaining equivalent performance. Furthermore, we show that the model by STEP performs comparably to vanilla pre-trained models on downstream tasks after instruction tuning.
zh

[NLP-89] Reasoning on Multiple Needles In A Haystack

【速读】：本文旨在解决基于记忆的回答（Memory-based Answering）问题，具体而言，针对长上下文问答任务中大型语言模型（Large Language Models, LLMs）在处理多跳推理（multi-hop reasoning）时面临的性能下降问题。论文指出，现有方法存在两个主要不足：一是未能有效过滤出直接回答的问题；二是无法解释或缓解随着上下文长度增加导致的准确性下降现象。为了解决这些问题，论文的关键创新在于将思考过程分解为检索（retrieval）与推理（reasoning）两个阶段，并引入了一种反思机制（reflection mechanism），通过多轮扩展增强模型的推理能力。此外，作者利用生成的迭代思考过程训练模型，以减轻性能退化问题。最后，论文展示了这种基于检索与反思的能力在数学推理场景中的应用效果，显著提升了GPT-4o在AIME2024竞赛中的表现。

链接: https://arxiv.org/abs/2504.04150
作者: Yidong Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Needle In A Haystack (NIAH) task has been widely used to evaluate the long-context question-answering capabilities of Large Language Models (LLMs). However, its reliance on simple retrieval limits its effectiveness. To address this limitation, recent studies have introduced the Multiple Needles In A Haystack Reasoning (MNIAH-R) task, which incorporates supporting documents (Multiple needles) of multi-hop reasoning tasks into a distracting context (Haystack). Despite this advancement, existing approaches still fail to address the issue of models providing direct answers from internal knowledge, and they do not explain or mitigate the decline in accuracy as context length increases. In this paper, we tackle the memory-based answering problem by filtering out direct-answer questions, and we reveal that performance degradation is primarily driven by the reduction in the length of the thinking process as the input length increases. Building on this insight, we decompose the thinking process into retrieval and reasoning stages and introduce a reflection mechanism for multi-round extension. We also train a model using the generated iterative thinking process, which helps mitigate the performance degradation. Furthermore, we demonstrate the application of this retrieval-reflection capability in mathematical reasoning scenarios, improving GPT-4o’s performance on AIME2024.
zh

[NLP-90] My Life in Artificial Intelligence: People anecdotes and some lessons learnt

【速读】：该论文并非致力于解决具体的技术问题，而是通过作者四十年在人工智能（尤其是自然语言处理）领域的研究与教育经历，探讨行业与学术界的选择及其背后的历史和发展脉络。论文的关键在于通过个人故事和历史回顾，为年轻的研究人员提供启发和参考，帮助他们在当前人工智能快速发展的背景下做出适合自身的职业与生活选择。文中强调好奇心与时代背景对研究方向的影响，并通过跨文化、跨国界的工作经验展示多样化的视角。

链接: https://arxiv.org/abs/2504.04142
作者: Kees van Deemter
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages

点击查看摘要

Abstract:In this very personal workography, I relate my 40-year experiences as a researcher and educator in and around Artificial Intelligence (AI), more specifically Natural Language Processing. I describe how curiosity, and the circumstances of the day, led me to work in both industry and academia, and in various countries, including The Netherlands (Amsterdam, Eindhoven, and Utrecht), the USA (Stanford), England (Brighton), Scotland (Aberdeen), and China (Beijing and Harbin). People and anecdotes play a large role in my story; the history of AI forms its backdrop. I focus on things that might be of interest to (even) younger colleagues, given the choices they face in their own work and life at a time when AI is finally emerging from the shadows.
zh

[NLP-91] Cognitive Debiasing Large Language Models for Decision-Making

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在支持决策任务（如金融、医疗和法律领域中的个人会话助手）时，因认知偏差（Cognitive Biases）导致输出准确性下降的问题。现有认知偏差缓解策略假设输入提示仅包含单一类型的认知偏差，因此在实际场景中表现不佳，尤其是在存在多种偏差的情况下。为填补这一空白，论文提出了一种名为“自我去偏”（self-debiasing）的方法，通过迭代优化提示来增强LLMs的可靠性。该方法的关键在于其三步迭代过程：偏差确定（Bias Determination）、偏差分析（Bias Analysis）和认知去偏（Cognitive Debiasing），从而系统性地减轻提示中的潜在认知偏差。实验结果表明，“自我去偏”方法在无偏差、单一偏差和多偏差设置下的平均准确率优于先进的提示工程方法和现有的认知去偏技术。

链接: https://arxiv.org/abs/2504.04141
作者: Yougang Lyu,Shijie Ren,Yue Feng,Zihan Wang,Zhumin Chen,Zhaochun Ren,Maarten de Rijke
机构: University of Amsterdam (阿姆斯特丹大学); Shandong University (山东大学); University of Birmingham (伯明翰大学); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown potential in supporting decision-making applications, particularly as personal conversational assistants in the financial, healthcare, and legal domains. While prompt engineering strategies have enhanced the capabilities of LLMs in decision-making, cognitive biases inherent to LLMs present significant challenges. Cognitive biases are systematic patterns of deviation from norms or rationality in decision-making that can lead to the production of inaccurate outputs. Existing cognitive bias mitigation strategies assume that input prompts contain (exactly) one type of cognitive bias and therefore fail to perform well in realistic settings where there maybe any number of biases. To fill this gap, we propose a cognitive debiasing approach, called self-debiasing, that enhances the reliability of LLMs by iteratively refining prompts. Our method follows three sequential steps – bias determination, bias analysis, and cognitive debiasing – to iteratively mitigate potential cognitive biases in prompts. Experimental results on finance, healthcare, and legal decision-making tasks, using both closed-source and open-source LLMs, demonstrate that the proposed self-debiasing method outperforms both advanced prompt engineering methods and existing cognitive debiasing techniques in average accuracy under no-bias, single-bias, and multi-bias settings. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.04141 [cs.CL] (or arXiv:2504.04141v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.04141 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-92] Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

【速读】：该论文旨在解决法律文本中句边界检测（Sentence Boundary Detection）的高精度与高吞吐量处理需求，特别是在大规模应用场景如尽职调查、电子取证和法律研究中。法律文档通常包含专业引用、缩写及复杂的句法结构，这些特性使得通用句边界检测工具难以胜任。论文提出的解决方案包括两个库：NUPunkt 和 CharBoundary。其中，NUPunkt 通过纯 Python 实现，在保持高吞吐量（每秒处理 1000 万字符）的同时实现了 91.1% 的精确度，并显著提升了 29%-32% 的精确度；CharBoundary 则提供了可调的精确度与召回率权衡，其大模型在所有测试方法中达到最高的 F1 分数（0.782）。关键在于 NUPunkt 和 CharBoundary 针对法律领域特性的优化设计，能够有效减少上下文碎片化，提升检索增强生成系统中的推理质量。

链接: https://arxiv.org/abs/2504.04131
作者: Michael J Bommarito,Daniel Martin Katz,Jillian Bommarito
机构: ALEA Institute; Illinois Tech - Chicago Kent Law (芝加哥肯特法学院); Bucerius Law School (布塞留斯法律科学学院); Stanford CodeX (斯坦福代码X)
类目: Computation and Language (cs.CL)
备注: 12 pages, 5 figures, 6 tables

点击查看摘要

Abstract:We present NUPunkt and CharBoundary, two sentence boundary detection libraries optimized for high-precision, high-throughput processing of legal text in large-scale applications such as due diligence, e-discovery, and legal research. These libraries address the critical challenges posed by legal documents containing specialized citations, abbreviations, and complex sentence structures that confound general-purpose sentence boundary detectors. Our experimental evaluation on five diverse legal datasets comprising over 25,000 documents and 197,000 annotated sentence boundaries demonstrates that NUPunkt achieves 91.1% precision while processing 10 million characters per second with modest memory requirements (432 MB). CharBoundary models offer balanced and adjustable precision-recall tradeoffs, with the large model achieving the highest F1 score (0.782) among all tested methods. Notably, NUPunkt provides a 29-32% precision improvement over general-purpose tools while maintaining exceptional throughput, processing multi-million document collections in minutes rather than hours. Both libraries run efficiently on standard CPU hardware without requiring specialized accelerators. NUPunkt is implemented in pure Python with zero external dependencies, while CharBoundary relies only on scikit-learn and optional ONNX runtime integration for optimized performance. Both libraries are available under the MIT license, can be installed via PyPI, and can be interactively tested at this https URL. These libraries address critical precision issues in retrieval-augmented generation systems by preserving coherent legal concepts across sentences, where each percentage improvement in precision yields exponentially greater reductions in context fragmentation, creating cascading benefits throughout retrieval pipelines and significantly enhancing downstream reasoning quality. Comments: 12 pages, 5 figures, 6 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.04131 [cs.CL] (or arXiv:2504.04131v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.04131 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Michael Bommarito [view email] [v1] Sat, 5 Apr 2025 10:48:34 UTC (767 KB)
zh

[NLP-93] PEIRCE: Unifying Material and Formal Reasoning via LLM -Driven Neuro-Symbolic Refinement

【速读】：该论文旨在解决人工智能领域中材料推理（material inference）与形式推理（formal inference）有效集成的挑战。材料推理关注论证的合理性和上下文相关性，而形式推理则侧重于逻辑和结构的有效性。尽管大型语言模型（Large Language Models, LLMs）在材料推理方面表现出强大能力，但其推理缺乏形式上的严谨性和可验证性。同时，LLMs在自然语言处理方面的优势为其作为自然语言与形式语言之间桥梁提供了可能性，从而为结合这两种推理模式创造了新机会。

论文提出的关键解决方案是PEIRCE框架，这是一个神经符号框架，通过迭代猜想-批评过程统一材料推理和形式推理。在这个框架中，LLMs负责生成自然语言和形式语言中的候选解决方案，然后通过与外部批评模型的交互进行评估和优化。这些批评模型包括符号证明器（symbolic provers），用于评估形式有效性；以及软评估器（soft evaluators），从语言学和认识论维度（如合理性、连贯性和简洁性）衡量生成论证的质量。虽然PEIRCE是一个通用框架，但论文展示了它在自然语言解释生成领域的应用能力，这一场景天然需要材料充分性和形式正确性的双重保证。

链接: https://arxiv.org/abs/2504.04110
作者: Xin Quan,Marco Valentino,Danilo S. Carvalho,Dhairya Dalal,André Freitas
机构: University of Manchester (曼彻斯特大学); Idiap Research Institute (Idiap 研究所); National Biomarker Centre, CRUK-MI, University of Manchester (国家生物标志物中心，CRUK-MI，曼彻斯特大学); University of Galway (戈尔韦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Demo paper. Work in progress

点击查看摘要

Abstract:A persistent challenge in AI is the effective integration of material and formal inference - the former concerning the plausibility and contextual relevance of arguments, while the latter focusing on their logical and structural validity. Large Language Models (LLMs), by virtue of their extensive pre-training on large textual corpora, exhibit strong capabilities in material inference. However, their reasoning often lacks formal rigour and verifiability. At the same time, LLMs’ linguistic competence positions them as a promising bridge between natural and formal languages, opening up new opportunities for combining these two modes of reasoning. In this paper, we introduce PEIRCE, a neuro-symbolic framework designed to unify material and formal inference through an iterative conjecture-criticism process. Within this framework, LLMs play the central role of generating candidate solutions in natural and formal languages, which are then evaluated and refined via interaction with external critique models. These critiques include symbolic provers, which assess formal validity, as well as soft evaluators that measure the quality of the generated arguments along linguistic and epistemic dimensions such as plausibility, coherence, and parsimony. While PEIRCE is a general-purpose framework, we demonstrate its capabilities in the domain of natural language explanation generation - a setting that inherently demands both material adequacy and formal correctness.
zh

[NLP-94] A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLM s: Experiments with OpenAI Models

【速读】：该论文试图解决零样本（zero-shot）方法在生物医学关系抽取（biomedical relation extraction, RE）任务中的性能未知问题，即评估大型生成式语言模型（Generative Large Language Models, LLMs）在未经过任务特定微调的情况下，是否能够在广泛的生物医学关系抽取数据集上表现出高精度。论文的关键解决方案在于使用OpenAI的GPT-4-turbo和其推理模型o1，通过两种方式利用其JSON生成能力实现端到端的关系抽取：一是定义明确的关系结构模式，二是从提示语言中推断结构。这种方法展示了大语言模型在复杂生物医学关系抽取任务中的潜力，并提供了接近微调方法的零样本性能，同时降低了数据标注和领域专业知识的需求，但也存在处理多关系实例表现不佳及边界模糊文本提及出错的局限性。

链接: https://arxiv.org/abs/2504.04083
作者: Aviv Brokman,Xuguang Ai,Yuhang Jiang,Shashank Gupta,Ramakanth Kavuluru
机构: University of Kentucky (肯塔基大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Objective: Zero-shot methodology promises to cut down on costs of dataset annotation and domain expertise needed to make use of NLP. Generative large language models trained to align with human goals have achieved high zero-shot performance across a wide variety of tasks. As of yet, it is unclear how well these models perform on biomedical relation extraction (RE). To address this knowledge gap, we explore patterns in the performance of OpenAI LLMs across a diverse sampling of RE tasks. Methods: We use OpenAI GPT-4-turbo and their reasoning model o1 to conduct end-to-end RE experiments on seven datasets. We use the JSON generation capabilities of GPT models to generate structured output in two ways: (1) by defining an explicit schema describing the structure of relations, and (2) using a setting that infers the structure from the prompt language. Results: Our work is the first to study and compare the performance of the GPT-4 and o1 for the end-to-end zero-shot biomedical RE task across a broad array of datasets. We found the zero-shot performances to be proximal to that of fine-tuned methods. The limitations of this approach are that it performs poorly on instances containing many relations and errs on the boundaries of textual mentions. Conclusion: Recent large language models exhibit promising zero-shot capabilities in complex biomedical RE tasks, offering competitive performance with reduced dataset curation and NLP modeling needs at the cost of increased computing, potentially increasing medical community accessibility. Addressing the limitations we identify could further boost reliability. The code, data, and prompts for all our experiments are publicly available: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.04083 [cs.CL] (or arXiv:2504.04083v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.04083 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ramakanth Kavuluru [view email] [v1] Sat, 5 Apr 2025 07:08:54 UTC (223 KB)
zh

[NLP-95] Collaboration and Controversy Among Experts: Rumor Early Detection by Tuning a Comment Generator SIGIR2025

【速读】：该论文旨在解决社交网络中谣言早期检测（Rumor Early Detection, RED）的问题，特别是在早期传播阶段因用户评论有限而导致现有方法表现不佳的挑战。论文的关键创新在于通过生成更多类人化的评论来扩展训练数据，从而提升模型在有限语义信息条件下的性能。具体而言，解决方案的核心包括：(1) 设计了一个结合专家协作与争议模拟的评论生成器，并提出一种新的RED框架CAMERED；(2) 在生成语言模型中引入混合专家结构及新型路由网络以实现专家协作；(3) 构建知识丰富的合成数据集并设计对抗学习策略以使生成评论的风格与真实评论保持一致；(4) 提出互争议融合模块整合生成与原始评论。实验结果表明，CAMERED在性能上超越了当前最先进的RED基线模型，验证了其有效性。

链接: https://arxiv.org/abs/2504.04076
作者: Bing Wang,Bingrui Zhao,Ximing Li,Changchun Li,Wanfu Gao,Shengsheng Wang
机构: College of Computer Science and Technology, Jilin University (吉林大学计算机科学与技术学院)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 11 pages, 5 figures. Accepted by SIGIR 2025. Code: this https URL

点击查看摘要

Abstract:Over the past decade, social media platforms have been key in spreading rumors, leading to significant negative impacts. To counter this, the community has developed various Rumor Detection (RD) algorithms to automatically identify them using user comments as evidence. However, these RD methods often fail in the early stages of rumor propagation when only limited user comments are available, leading the community to focus on a more challenging topic named Rumor Early Detection (RED). Typically, existing RED methods learn from limited semantics in early comments. However, our preliminary experiment reveals that the RED models always perform best when the number of training and test comments is consistent and extensive. This inspires us to address the RED issue by generating more human-like comments to support this hypothesis. To implement this idea, we tune a comment generator by simulating expert collaboration and controversy and propose a new RED framework named CAMERED. Specifically, we integrate a mixture-of-expert structure into a generative language model and present a novel routing network for expert collaboration. Additionally, we synthesize a knowledgeable dataset and design an adversarial learning strategy to align the style of generated comments with real-world comments. We further integrate generated and original comments with a mutual controversy fusion module. Experimental results show that CAMERED outperforms state-of-the-art RED baseline models and generation methods, demonstrating its effectiveness.
zh

[NLP-96] VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

【速读】：该论文旨在解决实时语音交互中高效且高质量生成语音语言模型（Speech LLMs）的问题。为实现这一目标，论文提出了一种可扩展且与模型无关的训练框架，以支持高效率和低延迟的语音处理任务。论文的关键创新在于引入多令牌预测（Multi-Token Prediction, MTP），这是一种专为语音LLMs优化的新方法，通过同时提升生成速度和质量，突破了传统单令牌预测（Next-Token Prediction, NTP）的局限性。这一解决方案的核心在于结合高效的训练策略与创新的预测机制，使VocalNet系列模型在显著减少训练数据的情况下，仍能超越主流通用大型语言模型（Omni LLMs）以及现有开源语音LLMs的性能表现。

链接: https://arxiv.org/abs/2504.04060
作者: Yuhao Wang,Heyang Liu,Ziyang Cheng,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团); Wuhan University (武汉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We propose VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. Departing from the conventional next-token prediction (NTP), we introduce multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. Experiments show that VocalNet outperforms mainstream Omni LLMs despite using significantly less training data, while also surpassing existing open-source speech LLMs by a substantial margin. To support reproducibility and community advancement, we will open-source all model weights, inference code, training data, and framework implementations upon publication.
zh

[NLP-97] FISH-Tuning: Enhancing PEFT Methods with Fisher Information

【速读】：该论文旨在解决现有 Parameter-Efficient Fine-Tuning (PEFT) 方法在与 Fisher Induced Sparse uncHanging (FISH) Mask 结合时未被充分探索的问题。解决方案的关键在于提出了一种名为 FISH-Tuning 的新方法，将 FISH Mask 整合到基于添加（如 LoRA）和基于重新参数化（如 Adapters）的 PEFT 方法中。通过利用 Fisher 信息选择这些方法中的关键参数，FISH-Tuning 在不增加额外内存开销或推理延迟的情况下实现了更优的性能。实验结果表明，FISH-Tuning 在多种数据集和预训练模型上的表现始终优于具有相同可训练参数比例的传统 PEFT 方法。

链接: https://arxiv.org/abs/2504.04050
作者: Kang Xue,Ming Dong,Xinhui Tu,Tingting He
机构: Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning (湖北省人工智能与智能学习重点实验室); National Language Resources Monitoring and Research Center for Network Media (国家语言资源监测与研究网络媒体中心); School of Computer, Central China Normal University (华中师范大学计算机学院), Wuhan, China (中国武汉)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid growth in the parameter size of Large Language Models (LLMs) has led to the development of Parameter-Efficient Fine-Tuning (PEFT) methods to alleviate the computational costs of fine-tuning. Among these, Fisher Induced Sparse uncHanging (FISH) Mask is a selection-based PEFT technique that identifies a subset of pre-trained parameters for fine-tuning based on approximate Fisher information. However, the integration of FISH Mask with other PEFT methods, such as LoRA and Adapters, remains underexplored. In this paper, we propose FISH-Tuning, a novel approach that incorporates FISH Mask into addition-based and reparameterization-based PEFT methods, including LoRA, Adapters, and their variants. By leveraging Fisher information to select critical parameters within these methods, FISH-Tuning achieves superior performance without additional memory overhead or inference latency. Experimental results across various datasets and pre-trained models demonstrate that FISH-Tuning consistently outperforms the vanilla PEFT methods with the same proportion of trainable parameters.
zh

[NLP-98] SyLeR: A Framework for Explicit Syllogistic Legal Reasoning in Large Language Models

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）在法律问答中无法进行显式三段论推理的问题，其答案通常隐式且无结构，缺乏可解释性和可信度。为克服这一局限，论文提出了一种名为SyLeR的新框架，其关键是结合树状分层检索机制与两阶段微调过程：首先通过有监督微调预热建立基础的三段论推理理解，然后利用具有结构感知奖励机制的强化学习优化模型生成多样化、逻辑严谨且结构良好的推理路径。

链接: https://arxiv.org/abs/2504.04042
作者: Kepu Zhang,Weijie Yu,Zhongxiang Sun,Jun Xu
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院, 中国人民大学); School of Information Technology and Management, University of International Business and Economics (信息与管理学院, 对外经济贸易大学); unknown
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Syllogistic reasoning is a fundamental aspect of legal decision-making, enabling logical conclusions by connecting general legal principles with specific case facts. Although existing large language models (LLMs) can generate responses to legal questions, they fail to perform explicit syllogistic reasoning, often producing implicit and unstructured answers that lack explainability and trustworthiness. To address this limitation, we propose SyLeR, a novel framework that empowers LLMs to engage in explicit syllogistic legal reasoning. SyLeR integrates a tree-structured hierarchical retrieval mechanism to effectively combine relevant legal statutes and precedent cases, forming comprehensive major premises. This is followed by a two-stage fine-tuning process: supervised fine-tuning warm-up establishes a foundational understanding of syllogistic reasoning, while reinforcement learning with a structure-aware reward mechanism refines the ability of the model to generate diverse logically sound and well-structured reasoning paths. We conducted extensive experiments across various dimensions, including in-domain and cross-domain user groups (legal laypersons and practitioners), multiple languages (Chinese and French), and different LLM backbones (legal-specific and open-domain LLMs). The results show that SyLeR significantly improves response accuracy and consistently delivers explicit, explainable, and trustworthy legal reasoning.
zh

[NLP-99] myNER: Contextualized Burmese Named Entity Recognition with Bidirectional LSTM and fastText Embeddings via Joint Training with POS Tagging

【速读】：该论文旨在解决缅甸语（Burmese）等低资源语言中命名实体识别（NER）研究匮乏的问题，主要由于缺乏公开可用的标注数据集。为应对这一挑战，论文提出了myNER，一个包含7标签标注方案的新型词级NER语料库，并结合词性标注（POS）以提供额外的句法信息。解决方案的关键在于通过引入myNER语料库，评估多种NER模型（如条件随机场CRF、双向长短期记忆网络BiLSTM-CRF及其与fastText嵌入的组合），并揭示上下文化词嵌入以及与POS联合训练的重要性，从而显著提升模型性能。实验结果表明，传统CRF联合任务模型使用fastText嵌入作为特征取得了最佳效果。

链接: https://arxiv.org/abs/2504.04038
作者: Kaung Lwin Thant,Kwankamol Nongpong,Ye Kyaw Thu,Thura Aung,Khaing Hsu Wai,Thazin Myint Oo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures, 5 tables, to be published in the proceedings of IEEE ICCI-2025

点击查看摘要

Abstract:Named Entity Recognition (NER) involves identifying and categorizing named entities within textual data. Despite its significance, NER research has often overlooked low-resource languages like Myanmar (Burmese), primarily due to the lack of publicly available annotated datasets. To address this, we introduce myNER, a novel word-level NER corpus featuring a 7-tag annotation scheme, enriched with Part-of-Speech (POS) tagging to provide additional syntactic information. Alongside the corpus, we conduct a comprehensive evaluation of NER models, including Conditional Random Fields (CRF), Bidirectional LSTM (BiLSTM)-CRF, and their combinations with fastText embeddings in different settings. Our experiments reveal the effectiveness of contextualized word embeddings and the impact of joint training with POS tagging, demonstrating significant performance improvements across models. The traditional CRF joint-task model with fastText embeddings as a feature achieved the best result, with a 0.9818 accuracy and 0.9811 weighted F1 score with 0.7429 macro F1 score. BiLSTM-CRF with fine-tuned fastText embeddings gets the best result of 0.9791 accuracy and 0.9776 weighted F1 score with 0.7395 macro F1 score.
zh

[NLP-100] OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在软件开发领域应用受限的问题，主要原因是高质量、公开可用的监督微调（Supervised Fine-Tuning, SFT）数据集匮乏，尤其是针对编码任务定制的数据集。为了解决这一问题，论文的关键解决方案是引入了一个名为OpenCodeInstruct的新数据集，该数据集包含500万个多样化样本，每个样本涵盖编程问题、解决方案、测试用例、执行反馈以及LLM生成的质量评估。通过使用此数据集对多种基础模型（如LLaMA和Qwen）进行微调，并覆盖多个规模（1B+、3B+和7B+），论文展示了显著的性能提升，尤其是在主流基准测试（如HumanEval、MBPP、LiveCodeBench和BigCodeBench）中的表现。

链接: https://arxiv.org/abs/2504.04030
作者: Wasi Uddin Ahmad,Aleksander Ficek,Mehrzad Samadi,Jocelyn Huang,Vahid Noroozi,Somshubra Majumdar,Boris Ginsburg
机构: NVIDIA (NVIDIA)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed software development by enabling code generation, automated debugging, and complex reasoning. However, their continued advancement is constrained by the scarcity of high-quality, publicly available supervised fine-tuning (SFT) datasets tailored for coding tasks. To bridge this gap, we introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset. Comprehensive evaluations on popular benchmarks (HumanEval, MBPP, LiveCodeBench, and BigCodeBench) demonstrate substantial performance improvements achieved by SFT with OpenCodeInstruct. We also present a detailed methodology encompassing seed data curation, synthetic instruction and solution generation, and filtering.
zh

[NLP-101] Rethinking Reflection in Pre-Training

【速读】：该论文试图研究语言模型在解决复杂问题时反思自身推理能力的起源和发展过程。解决方案的关键在于通过在思维链中引入刻意错误，并测试模型是否能够识别并纠正这些错误以得出正确答案，从而验证模型的自我校正能力。研究发现，这种自我校正能力在预训练阶段早期即已显现，并随预训练进程稳步提升。例如，在包含4万亿令牌的预训练数据集上训练的OLMo2-7B模型展示了其在六项自我反思任务中的自我校正能力。

链接: https://arxiv.org/abs/2504.04022
作者: Essential AI:Darsh J Shah,Peter Rushton,Somanshu Singla,Mohit Parmar,Kurt Smith,Yash Vanjani,Ashish Vaswani,Adarsh Chaluvaraju,Andrew Hojel,Andrew Ma,Anil Thomas,Anthony Polloreno,Ashish Tanwer,Burhan Drak Sibai,Divya S Mansingka,Divya Shivaprasad,Ishaan Shah,Karl Stratos,Khoi Nguyen,Michael Callahan,Michael Pust,Mrinal Iyer,Philip Monk,Platon Mazarakis,Ritvik Kapila,Saurabh Srivastava,Tim Romanski
机构: Essential AI (Essential AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A language model’s ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model’s pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.
zh

[NLP-102] Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

【速读】：该论文旨在解决如何在多智能体协作环境中有效模拟多样化的人类团队行为的问题。由于通过大规模用户研究获取多样化行为数据存在实际、伦理和操作上的限制，单纯依赖真实数据并不现实，因此需要合成模型来表征多种人类行为。近年来，基于大语言模型（Large Language Models, LLMs）的代理已被证明能够在社交场景中表现出类似人类的行为，但实现多样化的代理行为通常需要大量人工设计提示的工作。另一方面，质量多样性（Quality Diversity, QD）优化方法已被证明能够生成多样化的强化学习（Reinforcement Learning, RL）代理行为。论文的关键解决方案在于结合QD优化与LLM驱动的代理，通过迭代搜索生成多样化团队行为的提示。研究首先通过一项包含54名参与者的实验验证了人类在此任务域中展现出多样化的协调和沟通行为，随后表明所提出的方法不仅能有效重现人类团队数据中的趋势，还能捕捉到不通过收集大量数据难以观察到的行为模式。这项工作强调了将QD与LLM驱动代理相结合作为研究多智能体协作中团队策略和沟通方式的有效工具。

链接: https://arxiv.org/abs/2504.03991
作者: Siddharth Srikanth,Varun Bhatt,Boshen Zhang,Werner Hager,Charles Michael Lewis,Katia P. Sycara,Aaquib Tabrez,Stefanos Nikolaidis
机构: Thomas Lord Department of Computer Science, University of Southern California (南加州大学); School of Computing and Information, University of Pittsburgh (匹兹堡大学); Robotics Institute, Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment (n=54 participants), that humans exhibit diverse coordination and communication behavior in this domain. We then show that our approach can effectively replicate trends from human teaming data and also capture behaviors that are not easily observed without collecting large amounts of data. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.
zh

[NLP-103] Structured Extraction of Process Structure Properties Relationships in Materials Science

【速读】：本文旨在解决通过大型语言模型（Large Language Models, LLMs）从数百万篇学术论文的非结构化文本中进行材料发现时所面临的挑战，尤其是在通用LLMs难以有效处理特定领域查询的情况下。论文的关键解决方案是引入了一种新的标注方案，用于从科学文献中提取通用的工艺-结构-性能关系，并通过在人类标注数据上的微调来实现有效的结构化知识提取。研究以一个包含128个摘要的数据集验证了该方法的有效性，这些摘要来自两个不同的领域：高温材料（领域I）和模拟材料微观结构中的不确定性量化（领域II）。此外，通过开发基于MatBERT的条件随机场（Conditional Random Field, CRF）模型与经过微调的LLM（OpenAI的GPT-4o）进行对比实验，证明了微调LLMs在领域I中显著提升了实体抽取性能，而当加入领域II的额外样本后，MatBERT-CRF模型的表现可与GPT-4o相媲美。这表明所提出的标注方案具有提取结构化知识的潜力，并突显了两种建模方法的互补优势。

链接: https://arxiv.org/abs/2504.03979
作者: Amit K Verma,Zhisong Zhang,Junwon Seo,Robin Kuo,Runbo Jiang,Emma Strubell,Anthony D Rollett
机构: Computational Engineering Division, Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室); Language Technologies Institute, School of Computer Science, Carnegie Mellon University (卡内基梅隆大学); Materials Science and Engineering, Carnegie Mellon University (卡内基梅隆大学); Language Technologies Institute, School of Computer Science, Carnegie Mellon University (卡内基梅隆大学); Materials Science and Engineering, Carnegie Mellon University (卡内基梅隆大学); Language Technologies Institute, School of Computer Science, Carnegie Mellon University (卡内基梅隆大学); Materials Science and Engineering, Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Information Retrieval (cs.IR)
备注: 16 pages, 3 figures, 13 table

点击查看摘要

Abstract:With the advent of large language models (LLMs), the vast unstructured text within millions of academic papers is increasingly accessible for materials discovery, although significant challenges remain. While LLMs offer promising few- and zero-shot learning capabilities, particularly valuable in the materials domain where expert annotations are scarce, general-purpose LLMs often fail to address key materials-specific queries without further adaptation. To bridge this gap, fine-tuning LLMs on human-labeled data is essential for effective structured knowledge extraction. In this study, we introduce a novel annotation schema designed to extract generic process-structure-properties relationships from scientific literature. We demonstrate the utility of this approach using a dataset of 128 abstracts, with annotations drawn from two distinct domains: high-temperature materials (Domain I) and uncertainty quantification in simulating materials microstructure (Domain II). Initially, we developed a conditional random field (CRF) model based on MatBERT, a domain-specific BERT variant, and evaluated its performance on Domain I. Subsequently, we compared this model with a fine-tuned LLM (GPT-4o from OpenAI) under identical conditions. Our results indicate that fine-tuning LLMs can significantly improve entity extraction performance over the BERT-CRF baseline on Domain I. However, when additional examples from Domain II were incorporated, the performance of the BERT-CRF model became comparable to that of the GPT-4o model. These findings underscore the potential of our schema for structured knowledge extraction and highlight the complementary strengths of both modeling approaches.
zh

[NLP-104] VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models CVPR2025

【速读】：该论文试图解决视频-文本复合性理解中的细粒度时间对齐问题，特别是在连续多事件视频中的对齐挑战。现有基准主要关注静态图像-文本或孤立单事件视频的复合性，而忽视了复杂的时间对齐需求。为应对这一挑战，论文提出构建了两个新的基准（ActivityNet-Comp 和 YouCook2-Comp），并通过引入细微的时间扰动（如重新排序、动作词替换、部分描述和组合干扰）来创建具有挑战性的负样本。论文的关键解决方案是设计了一种分层成对偏好损失函数，该函数通过强化时间精确对齐的正样本对，并逐步惩罚受干扰的负样本对，从而促进细粒度复合学习。此外，为了缓解密集标注视频数据的稀缺性，论文还提出了通过拼接短片段来模拟多事件序列的预训练策略。这些方法共同构成了一个全面的框架，用于评估和提升视觉语言模型在时间连贯性视频-文本对齐中的性能。

链接: https://arxiv.org/abs/2504.03970
作者: Dahun Kim,AJ Piergiovanni,Ganesh Mallya,Anelia Angelova
机构: Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: CVPR 2025, project page at this https URL

点击查看摘要

Abstract:We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models’ compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.
zh

[NLP-105] Clinical ModernBERT: An efficient and long context encoder for biomedical text

【速读】：该论文旨在解决生物医学和临床领域自然语言处理（NLP）任务中语义表示不足的问题。为应对这一挑战，论文提出Clinical ModernBERT，一种基于Transformer的预训练模型，其关键创新在于结合大规模生物医学文献、临床笔记及医学本体数据进行训练，并引入PubMed摘要、MIMIC IV临床数据以及带有文本描述的医学代码。此外，该模型继承并优化了ModernBERT的架构，包括旋转位置嵌入（RoPE）、Flash Attention及上下文长度扩展至8,192个标记等技术，以适应生物医学和临床领域的特定需求。这些改进使Clinical ModernBERT能够生成富含语义的长上下文表示，从而在临床NLP基准测试中表现出色。

链接: https://arxiv.org/abs/2504.03964
作者: Simon A. Lee,Anthony Wu,Jeffrey N. Chiang
机构: Department of Computational Medicine (计算医学系); UCLA (加州大学洛杉矶分校); Department of Computational Medicine & Neurosurgery (计算医学与神经外科系); UCLA (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Manuscript writeup corresponding to the Clinical ModernBERT pre-trained encoder ( this https URL )

点击查看摘要

Abstract:We introduce Clinical ModernBERT, a transformer based encoder pretrained on large scale biomedical literature, clinical notes, and medical ontologies, incorporating PubMed abstracts, MIMIC IV clinical data, and medical codes with their textual descriptions. Building on ModernBERT the current state of the art natural language text encoder featuring architectural upgrades such as rotary positional embeddings (RoPE), Flash Attention, and extended context length up to 8,192 tokens our model adapts these innovations specifically for biomedical and clinical domains. Clinical ModernBERT excels at producing semantically rich representations tailored for long context tasks. We validate this both by analyzing its pretrained weights and through empirical evaluation on a comprehensive suite of clinical NLP benchmarks.
zh

[NLP-106] Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking

【速读】：该论文旨在解决推理密集型文档排序任务中小型语言模型训练效率低且性能不足的问题。现有方法通常依赖昂贵的人工标注或大型黑盒语言模型，而该研究提出了一种新颖的方法，结合知识蒸馏与强化学习优化，通过利用网络数据和教师大语言模型（Teacher LLM）自动生成高质量的带相关性解释的训练样本，从而避免昂贵的人工标注需求。方案的关键在于将文档排序建模为一个强化学习问题，并通过激励显式推理能力来训练一个仅含30亿参数的小型语言模型，使其在BRIGHT基准测试中达到最先进的性能，同时显著减少参数量（仅为其他方法的约1/20）。此外，论文证明了在推理阶段生成解释而非直接预测相关性分数，能够使较小的语言模型实现更有效的推理。这种自监督方法为现代信息检索系统提供了可扩展且可解释的解决方案。

链接: https://arxiv.org/abs/2504.03947
作者: Chris Samarinas,Hamed Zamani
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a novel approach for training small language models for reasoning-intensive document ranking that combines knowledge distillation with reinforcement learning optimization. While existing methods often rely on expensive human annotations or large black-box language models, our methodology leverages web data and a teacher LLM to automatically generate high-quality training examples with relevance explanations. By framing document ranking as a reinforcement learning problem and incentivizing explicit reasoning capabilities, we train a compact 3B parameter language model that achieves state-of-the-art performance on the BRIGHT benchmark. Our model ranks third on the leaderboard while using substantially fewer parameters than other approaches, outperforming models that are over 20 times larger. Through extensive experiments, we demonstrate that generating explanations during inference, rather than directly predicting relevance scores, enables more effective reasoning with smaller language models. The self-supervised nature of our method offers a scalable and interpretable solution for modern information retrieval systems.
zh

[NLP-107] Language Models Are Implicitly Continuous ICLR2025

【速读】：该论文试图解决的问题是传统语言模型基于离散序列建模语言的局限性，以及探索神经网络连续表示在语言理解中的潜在机制。论文的关键解决方案在于揭示Transformer基线语言模型隐式学习将句子表示为定义在连续输入空间上的连续时间函数的现象，并进一步形式化扩展Transformer以捕捉输入与输出空间的时间与空间连续性。这一发现挑战了对大型语言模型（LLMs）理解语言的传统解释，具有重要的语言学与工程学意义。

链接: https://arxiv.org/abs/2504.03933
作者: Samuele Marro,Davide Evangelista,X. Angelo Huang,Emanuele La Malfa,Michele Lombardi,Michael Wooldridge
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published at ICLR 2025

点击查看摘要

Abstract:Language is typically modelled with discrete sequences. However, the most successful approaches to language modelling, namely neural networks, are continuous and smooth function approximators. In this work, we show that Transformer-based language models implicitly learn to represent sentences as continuous-time functions defined over a continuous input space. This phenomenon occurs in most state-of-the-art Large Language Models (LLMs), including Llama2, Llama3, Phi3, Gemma, Gemma2, and Mistral, and suggests that LLMs reason about language in ways that fundamentally differ from humans. Our work formally extends Transformers to capture the nuances of time and space continuity in both input and output space. Our results challenge the traditional interpretation of how LLMs understand language, with several linguistic and engineering implications.
zh

[NLP-108] YaleNLP @ PerAnsSumm 2025: Multi-Perspective Integration via Mixture-of-Agents for Enhanced Healthcare QA Summarization NAACL2025 ALT

【速读】：该论文旨在解决医疗社区问答论坛中自动化摘要生成的挑战，由于每个问题存在多角度的用户回答，导致摘要生成困难。为应对这一挑战，研究提出了识别不同回答中的视角并生成全面答案的方法。解决方案的关键在于采用两种互补范式：一是基于训练的方法，通过QLoRA微调LLaMA-3.3-70B-Instruct模型；二是基于代理的方法，包括使用前沿大型语言模型（LLaMA-3.3-70B-Instruct和GPT-4o）进行零样本和少样本提示，以及利用多层反馈聚合组合多种模型输出的代理集合（Mixture-of-Agents, MoA）框架。通过这些方法，研究显著提升了模型在视角识别和基于视角的摘要生成任务上的性能。

链接: https://arxiv.org/abs/2504.03932
作者: Dongsuk Jang,Alan Li,Arman Cohan
机构: Department of Computer Science, Yale University (耶鲁大学); Interdisciplinary Program for Bioengineering, Seoul National University (首尔国立大学); Integrated Major in Innovative Medical Science, Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: Paper accepted at CL4HEALTH @ NAACL 2025: Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics

点击查看摘要

Abstract:Automated summarization of healthcare community question-answering forums is challenging due to diverse perspectives presented across multiple user responses to each question. The PerAnsSumm Shared Task was therefore proposed to tackle this challenge by identifying perspectives from different answers and then generating a comprehensive answer to the question. In this study, we address the PerAnsSumm Shared Task using two complementary paradigms: (i) a training-based approach through QLoRA fine-tuning of LLaMA-3.3-70B-Instruct, and (ii) agentic approaches including zero- and few-shot prompting with frontier LLMs (LLaMA-3.3-70B-Instruct and GPT-4o) and a Mixture-of-Agents (MoA) framework that leverages a diverse set of LLMs by combining outputs from multi-layer feedback aggregation. For perspective span identification/classification, GPT-4o zero-shot achieves an overall score of 0.57, substantially outperforming the 0.40 score of the LLaMA baseline. With a 2-layer MoA configuration, we were able to improve LLaMA performance up by 28 percent to 0.51. For perspective-based summarization, GPT-4o zero-shot attains an overall score of 0.42 compared to 0.28 for the best LLaMA zero-shot, and our 2-layer MoA approach boosts LLaMA performance by 32 percent to 0.37. Furthermore, in few-shot setting, our results show that the sentence-transformer embedding-based exemplar selection provides more gain than manually selected exemplars on LLaMA models, although the few-shot prompting is not always helpful for GPT-4o. The YaleNLP team’s approach ranked the overall second place in the shared task.
zh

[NLP-109] Adaptation of Large Language Models NAACL2025

【速读】：该论文旨在解决通用大语言模型（LLMs）在应对特定领域任务和动态变化环境时能力不足的问题。具体而言，尽管通用LLMs在多种任务上表现出强大的泛化能力，但在金融、医疗等专业化领域以及代码生成等特定任务中表现欠佳，且其静态特性限制了其适应快速变化世界的能力，同时其庞大的模型规模也增加了部署难度与成本。因此，论文聚焦于如何通过适配技术提升LLMs的灵活性与实用性，以满足工业界服务于目标用户群体的需求以及学术界对轻量级但高效LLMs的研究需求。

论文的关键解决方案在于提出了一种全面的LLM适配技术综述，从数据视角与模型视角出发介绍适配方法，并探讨了不同于其他技术的评估指标与基准设定方式。随后，论文将适配技术分为两大类：第一类是参数化知识适配，重点在于更新LLMs内部的参数化知识；第二类是半参数化知识适配，目标是通过检索增强生成（RAG）和基于代理系统等技术改进LLMs对外部知识或工具的利用效率。此外，还特别强调了实时适配技术的重要性，例如模型编辑，使LLMs能够在生产环境中动态更新。这些方法共同构成了克服现有LLMs局限性的核心策略。

链接: https://arxiv.org/abs/2504.03931
作者: Zixuan Ke,Yifei Ming,Shafiq Joty
机构: Salesforce AI Research (Salesforce AI 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Tutorial Proposal for NAACL2025

点击查看摘要

Abstract:This tutorial on adaptation of LLMs is designed to address the growing demand for models that go beyond the static capabilities of generic LLMs by providing an overview of dynamic, domain-specific, and task-adaptive LLM adaptation techniques. While general LLMs have demonstrated strong generalization across a variety of tasks, they often struggle to perform well in specialized domains such as finance, healthcare, and code generation for underrepresented languages. Additionally, their static nature limits their ability to evolve with the changing world, and they are often extremely large in size, making them impractical and costly to deploy at scale. As a result, the adaptation of LLMs has drawn much attention since the birth of LLMs and is of core importance, both for industry, which focuses on serving its targeted users, and academia, which can greatly benefit from small but powerful LLMs. To address this gap, this tutorial aims to provide an overview of the LLM adaptation techniques. We start with an introduction to LLM adaptation, from both the data perspective and the model perspective. We then emphasize how the evaluation metrics and benchmarks are different from other techniques. After establishing the problems, we explore various adaptation techniques. We categorize adaptation techniques into two main families. The first is parametric knowledge adaptation, which focuses on updating the parametric knowledge within LLMs. Additionally, we will discuss real-time adaptation techniques, including model editing, which allows LLMs to be updated dynamically in production environments. The second kind of adaptation is semi-parametric knowledge adaptation, where the goal is to update LLM parameters to better leverage external knowledge or tools through techniques like retrieval-augmented generation (RAG) and agent-based systems.
zh

[NLP-110] CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ)

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在理解与气候相关语境及生成气候话语方面的有效性问题，特别是评估其是否能够传播可信的解决方案或传播未经证实的主张。为了解决这一问题，论文提出了两个关键创新：首先，构建了一个名为CliME（Climate Change Multimodal Evaluation）的全新多模态数据集，包含来自Twitter和Reddit的2579个帖子，涵盖了幽默 meme 和怀疑论等多样化内容；其次，引入了一种新的评估指标 Climate Alignment Quotient (CAQ)，包含五个维度（Articulation, Evidence, Resonance, Transition, Specificity），并结合三个分析视角（Actionability, Criticality, Justice）来系统性地评估LLMs生成的气候话语表现。这些方法旨在更全面地揭示LLMs在应对复杂气候议题中的优势与不足。

链接: https://arxiv.org/abs/2504.03906
作者: Abhilekh Borah,Hasnat Md Abdullah,Kangda Wei,Ruihong Huang
机构: Manipal University Jaipur ( Manipal大学斋浦尔 ); Texas A&M University ( 德州农工大学 )
类目: Computation and Language (cs.CL)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has raised questions about their ability to understand climate-related contexts. Though climate change dominates social media, analyzing its multimodal expressions is understudied, and current tools have failed to determine whether LLMs amplify credible solutions or spread unsubstantiated claims. To address this, we introduce CliME (Climate Change Multimodal Evaluation), a first-of-its-kind multimodal dataset, comprising 2579 Twitter and Reddit posts. The benchmark features a diverse collection of humorous memes and skeptical posts, capturing how these formats distill complex issues into viral narratives that shape public opinion and policy discussions. To systematically evaluate LLM performance, we present the Climate Alignment Quotient (CAQ), a novel metric comprising five distinct dimensions: Articulation, Evidence, Resonance, Transition, and Specificity. Additionally, we propose three analytical lenses: Actionability, Criticality, and Justice, to guide the assessment of LLM-generated climate discourse using CAQ. Our findings, based on the CAQ metric, indicate that while most evaluated LLMs perform relatively well in Criticality and Justice, they consistently underperform on the Actionability axis. Among the models evaluated, Claude 3.7 Sonnet achieves the highest overall performance. We publicly release our CliME dataset and code to foster further research in this domain.
zh

[NLP-111] Do LLM Evaluators Prefer Themselves for a Reason ?

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）作为自动评估器时可能存在的自我偏好（self-preference）问题，即LLMs倾向于偏好自身生成的回答。研究的核心问题是厘清这种自我偏好是否本质上是有害的，还是仅仅反映了更强大的模型确实能够产生更优的输出。由于先前研究多依赖主观任务，难以区分有害的自我偏好（即偏好客观上较差的响应）与合理的自我偏好（即偏好真正优越的响应），因此亟需一种客观的方法来解决这一问题。

为了解决上述问题，论文的关键方案在于采用可验证的基准测试（如数学推理、事实性知识、代码生成等）进行评估，这些任务允许客观地确定正确答案（ground-truth）。通过这种方式，研究可以在不同的模型家族（例如Llama、Qwen、Gemma、Mistral、Phi、GPT、DeepSeek）上开展大规模实验，并在控制的评估条件下分析自我偏好的性质。研究结果揭示了三个关键洞见：首先，生成能力更强的模型作为评估器的表现也更佳；其次，即使在较强的模型中，有害的自我偏好仍然存在，尤其是在其生成性能较差的情况下；最后，在推理阶段通过扩展链式思维（Chain-of-Thought）等策略可以有效减少有害的自我偏好。这些发现为基于LLMs的评估提供了更细致的理解，并为提高其可靠性提供了实用性的指导。

链接: https://arxiv.org/abs/2504.03846
作者: Wei-Lin Chen,Zhepei Wei,Xinyu Zhu,Shi Feng,Yu Meng
机构: University of Virginia (弗吉尼亚大学); George Washington University (乔治华盛顿大学)
类目: Computation and Language (cs.CL)
备注: Preprint. 31 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as automatic evaluators in applications such as benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability. This raises a critical question: Is self-preference detrimental, or does it simply reflect objectively superior outputs from more capable models? Disentangling these has been challenging due to the usage of subjective tasks in previous studies. To address this, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment. This enables us to distinguish harmful self-preference (favoring objectively worse responses) from legitimate self-preference (favoring genuinely superior ones). We conduct large-scale experiments under controlled evaluation conditions across diverse model families (e.g., Llama, Qwen, Gemma, Mistral, Phi, GPT, DeepSeek). Our findings reveal three key insights: (1) Better generators are better judges – LLM evaluators’ accuracy strongly correlates with their task performance, and much of the self-preference in capable models is legitimate. (2) Harmful self-preference persists, particularly when evaluator models perform poorly as generators on specific task instances. Stronger models exhibit more pronounced harmful bias when they err, though such incorrect generations are less frequent. (3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce the harmful self-preference. These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.
zh

[NLP-112] Recursive Training Loops in LLM s: How training data properties modulate distribution shift in generated data?

【速读】：该论文试图解决在迭代训练循环中人类数据特性如何影响分布偏移（distribution shift）动态的问题。解决方案的关键在于通过系统分析不同数据集的人类数据属性，揭示这些属性如何影响分布偏移的速度与方向。研究首先验证了分布偏移动态在不同人类数据集之间存在显著差异，并发现数据质量在Twitter数据集中会影响偏移速度，但在Reddit数据集中不显著。进一步分析表明，词汇多样性较高的文本会导致更大的偏移，而语义多样性较高的文本则会减小偏移的危害。此外，研究还发现政治偏见的变化类型（如偏见减少、放大或反转）依赖于人类数据的真实分布倾向。总体而言，论文强调迭代微调的后果高度依赖于训练所基于的人类数据特性，这表明来自互联网不同部分（如GitHub、Reddit）的数据可能经历不同的分布偏移类型。

链接: https://arxiv.org/abs/2504.03814
作者: Grgur Kovač,Jérémy Perez,Rémy Portelas,Peter Ford Dominey,Pierre-Yves Oudeyer
机构: Flowers TEAM, INRIA (花团队，法国国家信息与自动化研究所); Ubisoft La Forge (育碧 forge实验室); INSERM UMR 1093-CAPS (法国国家健康与医学研究院); Robot Cognition Laboratory, Institute Marey (机器人认知实验室，马雷研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly contributing to the creation of content on the Internet. This creates a feedback loop as subsequent generations of models will be trained on this generated, synthetic data. This phenomenon is receiving increasing interest, in particular because previous studies have shown that it may lead to distribution shift - models misrepresent and forget the true underlying distributions of human data they are expected to approximate (e.g. resulting in a drastic loss of quality). In this study, we study the impact of human data properties on distribution shift dynamics in iterated training loops. We first confirm that the distribution shift dynamics greatly vary depending on the human data by comparing four datasets (two based on Twitter and two on Reddit). We then test whether data quality may influence the rate of this shift. We find that it does on the twitter, but not on the Reddit datasets. We then focus on a Reddit dataset and conduct a more exhaustive evaluation of a large set of dataset properties. This experiment associated lexical diversity with larger, and semantic diversity with smaller detrimental shifts, suggesting that incorporating text with high lexical (but limited semantic) diversity could exacerbate the degradation of generated text. We then focus on the evolution of political bias, and find that the type of shift observed (bias reduction, amplification or inversion) depends on the political lean of the human (true) distribution. Overall, our work extends the existing literature on the consequences of recursive fine-tuning by showing that this phenomenon is highly dependent on features of the human data on which training occurs. This suggests that different parts of internet (e.g. GitHub, Reddit) may undergo different types of shift depending on their properties.
zh

[NLP-113] What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在信息获取中的内容审核实践未被充分探索的问题，特别是当LLMs被用于政治话题时，其拒绝回答或遗漏信息的程度及方式。论文的关键在于区分硬审查（hard censorship）与软审查（soft censorship），并通过分析14个来自不同国家和地区的顶尖LLMs在联合国官方六种语言下的表现，揭示这些模型如何根据提供者的目标受众调整其审查策略，主要体现为硬审查或软审查，而非两者同时存在。研究强调了增加公开可用LLMs在意识形态和地理上的多样性，以及提高模型审核策略透明度的重要性，以支持用户的知情选择。所有数据均公开可获取。

链接: https://arxiv.org/abs/2504.03803
作者: Sander Noels,Guillaume Bied,Maarten Buyl,Alexander Rogiers,Yousra Fettach,Jefrey Lijffijt,Tijl De Bie
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 17 pages, 38 pages in total including appendix; 5 figures, 22 figures in appendix

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed as gateways to information, yet their content moderation practices remain underexplored. This work investigates the extent to which LLMs refuse to answer or omit information when prompted on political topics. To do so, we distinguish between hard censorship (i.e., generated refusals, error messages, or canned denial responses) and soft censorship (i.e., selective omission or downplaying of key elements), which we identify in LLMs’ responses when asked to provide information on a broad range of political figures. Our analysis covers 14 state-of-the-art models from Western countries, China, and Russia, prompted in all six official United Nations (UN) languages. Our analysis suggests that although censorship is observed across the board, it is predominantly tailored to an LLM provider’s domestic audience and typically manifests as either hard censorship or soft censorship (though rarely both concurrently). These findings underscore the need for ideological and geographic diversity among publicly available LLMs, and greater transparency in LLM moderation strategies to facilitate informed user choices. All data are made freely available.
zh

[NLP-114] Entropy-Based Block Pruning for Efficient Large Language Models

【速读】：该论文旨在解决大型Transformer语言模型在实际部署中因计算和存储需求快速增长而面临的挑战。论文的关键在于提出了一种基于熵的剪枝策略（entropy-based pruning strategy），通过利用熵作为衡量计算块信息丰富度的更有效指标，替代传统的余弦相似性方法。实验结果表明，该方法在减小模型规模的同时保持了较高的准确性，为高效模型部署提供了有前景的方向。

链接: https://arxiv.org/abs/2504.03794
作者: Liangwei Yang,Yuhui Xu,Juntao Tan,Doyen Sahoo,Silvio Savarese,Caiming Xiong,Huan Wang,Shelby Heinecke
机构: Salesforce AI Research (Salesforce AI 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.
zh

[NLP-115] Sample Dont Search: Rethinking Test-Time Alignment for Language Models

【速读】：该论文旨在解决在计算资源受限或模型权重私有无法微调的情况下，如何通过增加测试时计算来提升语言模型性能的问题。现有基于奖励模型（Reward Model, RM）的测试时搜索方法在计算规模增大时质量往往下降，主要原因是这些方法过度优化了本质上不完美的奖励代理。论文提出了一种新的测试时对齐方法QAlign，其关键在于利用近期Markov链Monte Carlo（MCMC）文本生成领域的进展，在不修改底层模型甚至无需访问logits的情况下，随着测试时计算资源的扩展，使输出更好地对齐，最终收敛到为每个单独提示采样最优对齐分布。实验表明，QAlign在数学推理基准（如GSM8K和GSM-Symbolic）上的表现优于现有的测试时计算方法，并且在更现实的Tulu 3偏好数据集训练的奖励模型下，于多个数据集上超越了直接偏好优化（DPO）、最佳-n（best-of-n）、多数投票（majority voting）以及加权多数投票（weighted majority voting）。

链接: https://arxiv.org/abs/2504.03790
作者: Gonçalo Faria,Noah A. Smith
机构: University of Washington (华盛顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.
zh

[NLP-116] Do “New Snow Tablets” Contain Snow? Large Language Models Over-Rely on Names to Identify Ingredients of Chinese Drugs

【速读】：该论文旨在解决中药（TCM）药物成分识别任务中现有大型语言模型（LLMs）表现不佳的问题。研究发现，当前的中药专用LLMs主要依赖药物名称，而非系统性药理知识，并表现出对不熟悉配方的错误解读、过度使用无关草药以及未能正确理解验证任务等局限性。为克服这些限制，论文提出了一种基于检索增强生成（Retrieval Augmented Generation, RAG）的方法，专注于成分名称的处理。通过在220种中药制剂上的实验验证，该方法显著提升了成分验证任务的准确性，从约50%提升至82%。因此，该解决方案的关键在于结合检索与生成技术，以弥补现有LLMs缺乏系统性药理知识的不足。

链接: https://arxiv.org/abs/2504.03786
作者: Sifan Li,Yujun Cai,Bryan Hooi,Nanyun Peng,Yiwei Wang
机构: Liaoning University (辽宁大学); University of Queensland (昆士兰大学); National University of Singapore (新加坡国立大学); University of California, Los Angeles (加州大学洛杉矶分校); University of California, Merced (加州大学默塞德分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional Chinese Medicine (TCM) has seen increasing adoption in healthcare, with specialized Large Language Models (LLMs) emerging to support clinical applications. A fundamental requirement for these models is accurate identification of TCM drug ingredients. In this paper, we evaluate how general and TCM-specialized LLMs perform when identifying ingredients of Chinese drugs. Our systematic analysis reveals consistent failure patterns: models often interpret drug names literally, overuse common herbs regardless of relevance, and exhibit erratic behaviors when faced with unfamiliar formulations. LLMs also fail to understand the verification task. These findings demonstrate that current LLMs rely primarily on drug names rather than possessing systematic pharmacological knowledge. To address these limitations, we propose a Retrieval Augmented Generation (RAG) approach focused on ingredient names. Experiments across 220 TCM formulations show our method significantly improves accuracy from approximately 50% to 82% in ingredient verification tasks. Our work highlights critical weaknesses in current TCM-specific LLMs and offers a practical solution for enhancing their clinical reliability.
zh

[NLP-117] FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling

【速读】：该论文旨在解决在大语言模型推理中解耦(prefill, P 和 decode, D)阶段的 KV 缓存传输延迟问题。传统方法中，块状调用方式和不连续的 KV 缓存内存分配导致频繁调用传输内核，同时固定的角色分配容易引发计算不平衡。论文的关键解决方案是提出了一种名为 FlowKV 的新型解耦推理框架，通过优化 KV 缓存传输将平均传输延迟降低了 96%，并引入了负载感知调度器以实现请求调度的平衡以及 PD 节点的灵活分配，从而最大化硬件资源利用率，在多种场景下显著提升了系统峰值吞吐量，并加速推理性能达 15.2%-48.9%。

链接: https://arxiv.org/abs/2504.03775
作者: Weiqing Li,Guochao Jiang,Xiangyong Ding,Zhangcheng Tao,Chuzhan Hao,Chenfeng Xu,Yuewei Zhang,Hao Wang
机构: Alibaba Cloud Computing (阿里云)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Disaggregated inference has become an essential framework that separates the prefill § and decode (D) stages in large language model inference to improve throughput. However, the KV cache transfer faces significant delays between prefill and decode nodes. The block-wise calling method and discontinuous KV cache memory allocation increase the number of calls to the transmission kernel. Additionally, existing frameworks often fix the roles of P and D nodes, leading to computational imbalances. In this paper, we propose FlowKV, a novel disaggregated inference framework, which reduces the average transmission latency of KV cache by 96%, from 0.944s to 0.053s, almost eliminating the transfer time relative to the total request latency by optimizing the KV cache transfer. FlowKV introduces the Load-Aware Scheduler for balanced request scheduling and flexible PD node allocation. This design maximizes hardware resource utilization, achieving peak system throughput across various scenarios, including normal, computational imbalance, and extreme overload conditions. Experimental results demonstrate that FlowKV significantly accelerates inference by 15.2%-48.9% on LongBench dataset compared to the baseline and supports applications with heterogeneous GPUs.
zh

[NLP-118] DBench: Benchmarking Vision-Language Models in Understanding Top-Down Images

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在处理顶部视角（top-down）图像理解任务中的能力不足问题。这一领域长期以来受到有限的多样化数据集以及收集此类数据挑战的制约，而顶部视角图像能够提供明确的空间概览并增强场景上下文理解，在自动驾驶导航、航拍成像和空间规划等任务中具有重要价值。论文的关键解决方案是提出了TDBench，这是一个针对顶部视角图像理解任务的全面基准测试平台。TDBench通过整合公共顶部视角数据集和高质量模拟图像构建而成，涵盖多样化的现实世界与合成场景，并包含十个维度的图像理解视觉问答对。此外，论文还通过四个常见但研究较少的实际场景案例研究进一步验证了其有效性。通过评估现有VLM的优势与局限性，论文期望TDBench能为未来相关研究提供指导和启发。

链接: https://arxiv.org/abs/2504.03748
作者: Kaiyuan Hou,Minghui Zhao,Lilin Xu,Yuang Fan,Xiaofan Jiang
机构: Columbia University (哥伦比亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid emergence of Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling applications in scene comprehension and visual reasoning. While these models have been primarily evaluated and developed for front-view image understanding, their capabilities in interpreting top-down images have received limited attention, partly due to the scarcity of diverse top-down datasets and the challenges in collecting such data. In contrast, top-down vision provides explicit spatial overviews and improved contextual understanding of scenes, making it particularly valuable for tasks like autonomous navigation, aerial imaging, and spatial planning. In this work, we address this gap by introducing TDBench, a comprehensive benchmark for VLMs in top-down image understanding. TDBench is constructed from public top-down view datasets and high-quality simulated images, including diverse real-world and synthetic scenarios. TDBench consists of visual question-answer pairs across ten evaluation dimensions of image understanding. Moreover, we conduct four case studies that commonly happen in real-world scenarios but are less explored. By revealing the strengths and limitations of existing VLM through evaluation results, we hope TDBench to provide insights for motivating future research. Project homepage: this https URL
zh

[NLP-119] A Unified Virtual Mixture-of-Experts Framework:Enhanced Inference and Hallucination Mitigation in Single-Model System

【速读】：该论文旨在解决生成式模型（Generative Models）在小规模架构中容易产生幻觉（hallucinations），即生成非事实或误导性内容的问题，这限制了其实际应用。为了解决这一挑战，论文提出了一种统一的虚拟混合专家（Virtual Mixture-of-Experts, MoE）融合策略，在不增加参数量的前提下提升小规模Qwen 1.5 0.5B模型的推理性能并缓解幻觉现象。关键在于利用多个领域特定的专家提示（可调节数量）从不同视角引导模型，并结合基于均值和标准差的统计异常截断策略过滤异常高概率预测，同时通过嵌入空间噪声注入增强输出多样性。此外，采用固定投票机制而非动态门控网络以明确评估各模块贡献，避免额外干扰因素。理论分析表明，该方法通过降低输出方差抑制幻觉现象。

链接: https://arxiv.org/abs/2504.03739
作者: Mingyan Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models, such as GPT and BERT, have significantly improved performance in tasks like text generation and summarization. However, hallucinations “where models generate non-factual or misleading content” are especially problematic in smaller-scale architectures, limiting their real-world this http URL this paper, we propose a unified Virtual Mixture-of-Experts (MoE) fusion strategy that enhances inference performance and mitigates hallucinations in a single Qwen 1.5 0.5B model without increasing the parameter count. Our method leverages multiple domain-specific expert prompts (with the number of experts being adjustable) to guide the model from different perspectives. We apply a statistical outlier truncation strategy based on the mean and standard deviation to filter out abnormally high probability predictions, and we inject noise into the embedding space to promote output diversity. To clearly assess the contribution of each module, we adopt a fixed voting mechanism rather than a dynamic gating network, thereby avoiding additional confounding factors. We provide detailed theoretical derivations from both statistical and ensemble learning perspectives to demonstrate how our method reduces output variance and suppresses hallucinations. Extensive ablation experiments on dialogue generation tasks show that our approach significantly improves inference accuracy and robustness in small models. Additionally, we discuss methods for evaluating the orthogonality of virtual experts and outline the potential for future work involving dynamic expert weight allocation using gating networks.
zh

[NLP-120] Misaligned Roles Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots

【速读】：本文旨在解决多模态语言模型（Multimodal Language Models, MMLMs）在经过后训练对齐（post-training alignment）以防止有害内容生成时存在的局限性。尽管现有方法主要关注助手角色的对齐，但忽视了用户角色的对齐，并且局限于固定的特殊标记输入提示结构，导致模型在输入偏离预期时容易受到攻击。为此，论文引入了一种名为角色模态攻击（Role-Modality Attacks, RMAs）的新类别对抗攻击，这种攻击通过混淆用户与助手的角色以及改变图像标记的位置来诱发有害输出。不同于以往修改查询内容的攻击方式，RMAs专注于操纵输入结构而不改变查询本身。论文系统性地评估了这些攻击在多种视觉语言模型（Vision Language Models, VLMs）上的八个不同设置下的表现，证明它们可以组合形成更强的对抗性提示，并通过残差流中负拒绝方向投影增加的现象验证了其有效性，这一特性与之前成功的攻击一致。

针对上述问题，论文的关键解决方案是提出一种对抗性训练方法，使模型能够抵抗输入提示扰动。具体而言，通过在一系列有害和良性提示上进行训练，所有提示都使用不同的RMA设置进行了扰动，从而使模型不再对角色混淆和模态操作攻击敏感，而是专注于输入提示结构中的查询内容，从而有效降低攻击成功率（Attack Success Rate, ASR），同时保持模型的整体实用性。

链接: https://arxiv.org/abs/2504.03735
作者: Erfan Shayegani,G M Shahariar,Sara Abdali,Lei Yu,Nael Abu-Ghazaleh,Yue Dong
机构: University of California, Riverside(加州大学河滨分校); Microsoft Applied Sciences Group(微软应用科学组); University of Toronto(多伦多大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Language Models (MMLMs) typically undergo post-training alignment to prevent harmful content generation. However, these alignment stages focus primarily on the assistant role, leaving the user role unaligned, and stick to a fixed input prompt structure of special tokens, leaving the model vulnerable when inputs deviate from these expectations. We introduce Role-Modality Attacks (RMA), a novel class of adversarial attacks that exploit role confusion between the user and assistant and alter the position of the image token to elicit harmful outputs. Unlike existing attacks that modify query content, RMAs manipulate the input structure without altering the query itself. We systematically evaluate these attacks across multiple Vision Language Models (VLMs) on eight distinct settings, showing that they can be composed to create stronger adversarial prompts, as also evidenced by their increased projection in the negative refusal direction in the residual stream, a property observed in prior successful attacks. Finally, for mitigation, we propose an adversarial training approach that makes the model robust against input prompt perturbations. By training the model on a range of harmful and benign prompts all perturbed with different RMA settings, it loses its sensitivity to Role Confusion and Modality Manipulation attacks and is trained to only pay attention to the content of the query in the input prompt structure, effectively reducing Attack Success Rate (ASR) while preserving the model’s general utility.
zh

[NLP-121] CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

【速读】：该论文试图解决传统强化学习中基于二元准确性奖励（0/1 accuracy reward）导致的学习效率低下问题。为了解决这一问题，论文提出了一种名为模糊组相对策略奖励（Fuzzy Group Relative Policy Reward, FGRPR）的新框架，其关键在于将组相对策略优化（Group Relative Policy Optimization, GRPO）与模糊奖励函数相结合。模糊奖励模型通过提供更精细的激励机制，鼓励输出更加精确的结果。实验表明，在标准0/1准确性奖励下，GRPO的表现不如监督微调（SFT），而FGRPR在多个领域内数据集上超越了包括GPT-4o、LLaMA2(90B)和SFT在内的基线模型，并在目标值较大时于跨领域数据集上表现出优于SFT的性能，这得益于其模糊奖励函数对更接近正确答案的输出赋予更高的奖励。这种方法适用于对答案精度要求较高的任务场景。

链接: https://arxiv.org/abs/2504.03724
作者: Zhiqiang Wang,Pengbin Feng,Yanbin Lin,Shuzhang Cai,Zongao Bian,Jinghua Yan,Xingquan Zhu
机构: Florida Atlantic University (佛罗里达大西洋大学); Amazon.com, Inc. (亚马逊公司); University of Texas at Dallas (德克萨斯大学达拉斯分校); Georgia Institute of Technology (乔治亚理工学院); University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 11 pages, 6 figures and 4 tables

点击查看摘要

Abstract:We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: this https URL
zh

[NLP-122] Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在实际应用中稳定性不足的问题，这一领域目前研究较少。尽管LLMs已在广泛使用，但缺乏对其在多种扰动条件下稳定性的严格评估方法。为填补这一空白，论文提出了一种基于信息几何统计方法的新颖稳定性度量，该度量具备理想的不变性特性，能够有效分析模型对参数和输入扰动的敏感性。关键在于通过引入这种稳定性度量，并结合广泛的实验验证其在识别显著参数、检测输入图像中的脆弱区域以及嵌入标记的关键维度方面的有效性，同时证明了该框架在模型合并过程中提升鲁棒性的能力。

链接: https://arxiv.org/abs/2504.03714
作者: Runpeng Dai,Run Yang,Fan Zhou,Hongtu Zhu
机构: Department of BiostatisticsUniversity of North Carolina at Chapel HillChapel HillNCUSA; Department of Statistics and ManagementShanghai University of Finance and EconomicsShanghaiChina; Department of Statistics and ManagementShanghai University of Finance and EconomicsShanghaiChina; Department of BiostatisticsUniversity of North Carolina at Chapel HillChapel HillNCUSA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have become essential to general artificial intelligence, exhibiting remarkable capabilities in task understanding and problem-solving. However, the real-world reliability of these models critically depends on their stability, which remains an underexplored area. Despite their widespread use, rigorous studies examining the stability of LLMs under various perturbations are still lacking. In this paper, we address this gap by proposing a novel stability measure for LLMs, inspired by statistical methods rooted in information geometry. Our measure possesses desirable invariance properties, making it well-suited for analyzing model sensitivity to both parameter and input perturbations. To assess the effectiveness of our approach, we conduct extensive experiments on models ranging in size from 1.5B to 13B parameters. Our results demonstrate the utility of our measure in identifying salient parameters and detecting vulnerable regions in input images or critical dimensions in token embeddings. Furthermore, leveraging our stability framework, we enhance model robustness during model merging, leading to improved performance.
zh

[NLP-123] Prot42: a Novel Family of Protein Language Models for Target-aware Protein Binder Generation

【速读】：该论文旨在解决传统蛋白质工程方法因复杂性和资源密集性而限制下一代生物技术与治疗创新的问题。解决方案的关键在于引入了一种基于蛋白质语言模型（Protein Language Models, pLMs）的新方法，通过预训练大量无标注蛋白质序列的Prot42模型家族，利用先进的自回归解码器-only架构，从进化、结构和功能层面捕获深层洞见，仅依赖语言信息即可显著提升计算蛋白质设计的能力。Prot42模型能够处理长达8,192个氨基酸的序列，突破常规限制，实现大型蛋白质及复杂多域序列的精确建模，并在生成高亲和力蛋白结合剂和序列特异性DNA结合蛋白方面展现出强大的实际应用能力。

链接: https://arxiv.org/abs/2504.04453
作者: Mohammad Amaan Sayeed,Engin Tekin,Maryam Nadeem,Nancy A. ElNaker,Aahan Singh,Natalia Vassilieva,Boulbaba Ben Amor
机构: Inception Institute of Artificial Intelligence (初创人工智能研究所), Abu Dhabi, UAE; Cerebras Systems (赛普拉斯系统公司), Sunnyvale, CA, USA; Inception Institute of Artificial Intelligence (初创人工智能研究所), Abu Dhabi, UAE; Inception Institute of Artificial Intelligence (初创人工智能研究所), Abu Dhabi, UAE; Inception Institute of Artificial Intelligence (初创人工智能研究所), Abu Dhabi, UAE; Cerebras Systems (赛普拉斯系统公司), Sunnyvale, CA, USA; Inception Institute of Artificial Intelligence (初创人工智能研究所), Abu Dhabi, UAE
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unlocking the next generation of biotechnology and therapeutic innovation demands overcoming the inherent complexity and resource-intensity of conventional protein engineering methods. Recent GenAI-powered computational techniques often rely on the availability of the target protein’s 3D structures and specific binding sites to generate high-affinity binders, constraints exhibited by models such as AlphaProteo and RFdiffusion. In this work, we explore the use of Protein Language Models (pLMs) for high-affinity binder generation. We introduce Prot42, a novel family of Protein Language Models (pLMs) pretrained on vast amounts of unlabeled protein sequences. By capturing deep evolutionary, structural, and functional insights through an advanced auto-regressive, decoder-only architecture inspired by breakthroughs in natural language processing, Prot42 dramatically expands the capabilities of computational protein design based on language only. Remarkably, our models handle sequences up to 8,192 amino acids, significantly surpassing standard limitations and enabling precise modeling of large proteins and complex multi-domain sequences. Demonstrating powerful practical applications, Prot42 excels in generating high-affinity protein binders and sequence-specific DNA-binding proteins. Our innovative models are publicly available, offering the scientific community an efficient and precise computational toolkit for rapid protein engineering.
zh

计算机视觉

[CV-0] CREA: A Collaborative Multi-Agent Framework for Creative Content Generation with Diffusion Models

【速读】：该论文旨在解决人工智能图像创意编辑的核心挑战，即不仅需要生成视觉上有吸引力的内容，还需具备添加新颖、富有表现力且艺术性强的图像变换能力。与基于直接提示的传统编辑任务不同，创意图像编辑需要一种自主且迭代的方法来平衡原创性、连贯性和艺术意图。为了解决这一问题，论文提出了一种名为CREA的新颖多智能体协作框架，该框架模仿人类的创意过程。其关键在于利用一组专门化的AI代理动态协作，以概念化、生成、评价和优化图像。通过广泛的定性和定量评估，证明CREA在多样性、语义对齐和创意转换方面显著优于现有技术。关键解决方案在于将创造力构建成一个动态的、具有主动性的过程，从而重新定义了AI与艺术的交汇点。

链接: https://arxiv.org/abs/2504.05306
作者: Kavana Venkatesh,Connor Dunlop,Pinar Yanardag
机构: Virginia Tech (弗吉尼亚理工学院暨州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project URL: this https URL

点击查看摘要

Abstract:Creativity in AI imagery remains a fundamental challenge, requiring not only the generation of visually compelling content but also the capacity to add novel, expressive, and artistically rich transformations to images. Unlike conventional editing tasks that rely on direct prompt-based modifications, creative image editing demands an autonomous, iterative approach that balances originality, coherence, and artistic intent. To address this, we introduce CREA, a novel multi-agent collaborative framework that mimics the human creative process. Our framework leverages a team of specialized AI agents who dynamically collaborate to conceptualize, generate, critique, and enhance images. Through extensive qualitative and quantitative evaluations, we demonstrate that CREA significantly outperforms state-of-the-art methods in diversity, semantic alignment, and creative transformation. By structuring creativity as a dynamic, agentic process, CREA redefines the intersection of AI and art, paving the way for autonomous AI-driven artistic exploration, generative design, and human-AI co-creation. To the best of our knowledge, this is the first work to introduce the task of creative editing.
zh

[CV-1] URECA: Unique Region Caption Anything

【速读】：该论文致力于解决现有区域级图像描述方法在多粒度场景下难以生成独特且一致描述的问题，这限制了其实际应用价值。为满足详细区域级理解的需求，论文引入了URECA数据集，这是一个专为多粒度区域描述设计的大规模数据集。与专注于显著物体的先前数据集不同，URECA通过包含多样化的物体、部件及背景元素，确保区域与描述之间具有唯一且一致的映射关系。论文的关键解决方案在于构建了一个分阶段的数据整理流水线，在每个阶段逐步优化区域选择与描述生成，同时利用多模态大语言模型（Multimodal Large Language Models, MLLMs）提升描述的独特性和语义丰富性。此外，论文提出的URECA模型通过简单但有效的方式修改现有MLLMs，保留关键的空间属性如位置和形状，进一步增强了描述的精细程度与语义深度，并通过动态掩码建模和高分辨率掩码编码器提升了描述的独特性。实验表明，URECA不仅在URECA数据集上达到当前最优性能，还能够在现有的区域级描述基准测试中表现良好。

链接: https://arxiv.org/abs/2504.05305
作者: Sangbeom Lim,Junwan Kim,Heeji Yoon,Jaewoo Jung,Seungryong Kim
机构: Korea University (韩国大学); Yonsei University (延世大学); KAIST AI (韩国科学技术院 AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.
zh

[CV-2] Gaussian Mixture Flow Matching Models

【速读】：该论文旨在解决扩散模型（Diffusion Models）在少量采样步数下的性能不足以及在无分类器引导（Classifier-Free Guidance, CFG）下容易产生过饱和颜色的问题。为了解决这些问题，论文提出了一种新颖的高斯混合流匹配（Gaussian Mixture Flow Matching, GMFlow）模型。其关键创新在于不再直接预测均值，而是通过学习动态高斯混合（GM）参数来捕捉多模态流速分布，并采用KL散度损失函数进行优化。此外，论文还引入了一种新的概率引导方案以缓解CFG中的过饱和问题，并开发了基于解析去噪分布和速度场的GM-SDE/ODE求解器，从而实现精确的少量步数采样。这些改进显著提升了生成质量，在ImageNet 256×256数据集上仅需6个采样步数即可达到0.942的Precision。

链接: https://arxiv.org/abs/2504.05304
作者: Hansheng Chen,Kai Zhang,Hao Tan,Zexiang Xu,Fujun Luan,Leonidas Guibas,Gordon Wetzstein,Sai Bi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Diffusion models approximate the denoising distribution as a Gaussian and predict its mean, whereas flow matching models reparameterize the Gaussian mean as flow velocity. However, they underperform in few-step sampling due to discretization error and tend to produce over-saturated colors under classifier-free guidance (CFG). To address these limitations, we propose a novel Gaussian mixture flow matching (GMFlow) model: instead of predicting the mean, GMFlow predicts dynamic Gaussian mixture (GM) parameters to capture a multi-modal flow velocity distribution, which can be learned with a KL divergence loss. We demonstrate that GMFlow generalizes previous diffusion and flow matching models where a single Gaussian is learned with an L_2 denoising loss. For inference, we derive GM-SDE/ODE solvers that leverage analytic denoising distributions and velocity fields for precise few-step sampling. Furthermore, we introduce a novel probabilistic guidance scheme that mitigates the over-saturation issues of CFG and improves image generation quality. Extensive experiments demonstrate that GMFlow consistently outperforms flow matching baselines in generation quality, achieving a Precision of 0.942 with only 6 sampling steps on ImageNet 256 \times 256.
zh

[CV-3] InteractVLM: 3D Interaction Reasoning from 2D Foundational Models CVPR2025

【速读】：本文旨在解决从单张自然场景图像中估计人体与物体之间三维接触点的问题，以实现准确的人体-物体联合三维重建。由于遮挡、深度模糊以及物体形状多样性等因素，这一任务极具挑战性。传统方法依赖于昂贵的动作捕捉系统或繁琐的手动标注来获取三维接触数据，这限制了其可扩展性和泛化能力。为了解决这些问题，InteractVLM利用大型视觉语言模型（Vision-Language Models, VLMs）的广泛视觉知识，并通过少量三维接触数据进行微调。然而，直接应用这些模型并不容易，因为它们仅在二维空间中进行推理，而人体与物体的接触本质上是三维的。为此，论文提出了一种新颖的Render-Localize-Lift模块：(1)通过多视角渲染将三维人体和物体表面嵌入到二维空间；(2)训练一种新的多视角定位模型（MV-Loc）以推断二维中的接触点；(3)将这些结果提升至三维空间。此外，还引入了一个新任务——语义人体接触估计，其中人体接触预测明确依赖于物体语义信息，从而实现更丰富的交互建模。InteractVLM在接触点估计方面优于现有方法，并且能够从野外图像中促进三维重建。代码和模型可在提供的链接中获取。关键在于Render-Localize-Lift模块及其结合多视角推理和三维提升的能力，以及语义人体接触估计任务的设计。

链接: https://arxiv.org/abs/2504.05303
作者: Sai Kumar Dwivedi,Dimitrije Antić,Shashank Tripathi,Omid Taheri,Cordelia Schmid,Michael J. Black,Dimitrios Tzionas
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所), Tübingen, Germany; University of Amsterdam (UvA) (阿姆斯特丹大学), the Netherlands; Inria (法国国家信息与自动化研究所), École normale supérieure, CNRS, PSL Research University, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling, limiting scalability and generalization. To overcome this, InteractVLM harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. However, directly applying these models is non-trivial, as they reason only in 2D, while human-object contact is inherently 3D. Thus we introduce a novel Render-Localize-Lift module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. Additionally, we propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics, enabling richer interaction modeling. InteractVLM outperforms existing work on contact estimation and also facilitates 3D reconstruction from an in-the wild image. Code and models are available at this https URL.
zh

[CV-4] S4M: Boosting Semi-Supervised Instance Segmentation with SAM

【速读】：该论文致力于解决半监督实例分割中因标注数据有限导致的精确实例定位困难问题。当前基于教师-学生框架的方法受限于伪标签质量的不可靠性，而Segment Anything Model (SAM) 虽具备强大的多粒度分割能力，但直接应用于该任务会面临类别无关预测及过分割等挑战。论文的关键解决方案在于精心设计了一种新颖的知识蒸馏方法，将SAM的精确局部化能力有效整合进半监督实例分割框架，同时保持语义识别不受影响。此外，通过引入伪标签优化及针对优化后伪标签的专用数据增强策略，进一步提升了性能，最终实现了最先进的性能表现，并通过全面的实验与消融研究验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.05301
作者: Heeji Yoon,Heeseong Shin,Eunbeen Hong,Hyunwook Choi,Hansang Cho,Daun Jeong,Seungryong Kim
机构: KAIST AI (KAIST AI); Korea University (韩国大学); Samsung Electro-Mechanics (三星电子机械)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised instance segmentation poses challenges due to limited labeled data, causing difficulties in accurately localizing distinct object instances. Current teacher-student frameworks still suffer from performance constraints due to unreliable pseudo-label quality stemming from limited labeled data. While the Segment Anything Model (SAM) offers robust segmentation capabilities at various granularities, directly applying SAM to this task introduces challenges such as class-agnostic predictions and potential over-segmentation. To address these complexities, we carefully integrate SAM into the semi-supervised instance segmentation framework, developing a novel distillation method that effectively captures the precise localization capabilities of SAM without compromising semantic recognition. Furthermore, we incorporate pseudo-label refinement as well as a specialized data augmentation with the refined pseudo-labels, resulting in superior performance. We establish state-of-the-art performance, and provide comprehensive experiments and ablation studies to validate the effectiveness of our proposed approach.
zh

[CV-5] SmolVLM: Redefining small and efficient multimodal models

【速读】：该论文旨在解决大型视觉语言模型（Vision-Language Models, VLMs）因计算资源需求过高而难以部署于移动和边缘设备的问题。传统的小型VLM通常沿袭大型模型的设计选择，如复杂的图像标记化策略，导致GPU内存使用效率低下，限制了其在实际设备上的应用。论文的关键解决方案在于引入SmolVLM系列紧凑型多模态模型，通过系统性探索架构配置、高效的标记化策略以及精心策划的训练数据，实现了低计算开销下的高性能表现。关键设计选择包括战略性架构优化、激进但高效的标记化方法以及高质量的训练数据筛选，使得SmolVLM能够在极小的内存占用下（例如SmolVLM-256M模型仅需不到1GB GPU内存）超越规模大300倍的Idefics-80B模型，同时其最大版本（2.2B参数）在GPU内存消耗仅为领先模型一半的情况下达到最先进的性能水平。此外，SmolVLM展示了出色的视频理解能力，证明了其在多模态任务中的实用性和能效优势。

链接: https://arxiv.org/abs/2504.05299
作者: Andrés Marafioti,Orr Zohar,Miquel Farré,Merve Noyan,Elie Bakouch,Pedro Cuenca,Cyril Zakka,Loubna Ben Allal,Anton Lozhkov,Nouamane Tazi,Vaibhav Srivastav,Joshua Lochner,Hugo Larcher,Mathieu Morlon,Lewis Tunstall,Leandro von Werra,Thomas Wolf
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales. Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.05299 [cs.AI] (or arXiv:2504.05299v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.05299 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Orr Zohar Mr [view email] [v1] Mon, 7 Apr 2025 17:58:57 UTC (4,802 KB)
zh

[CV-6] One-Minute Video Generation with Test-Time Training CVPR2025

【速读】：该论文试图解决长上下文视频生成的问题，具体而言是如何从文本故事板生成一分钟长度的视频。传统Transformer模型在处理长上下文时因自注意力机制效率低下而受限，而替代方案如Mamba层则难以表达复杂的多场景故事。论文的关键解决方案是引入Test-Time Training (TTT) 层，其隐藏状态本身可以是神经网络，从而具备更强的表达能力。通过将TTT层加入预训练的Transformer模型中，模型能够生成连贯性更高且能讲述复杂故事的视频，在人类评估中比基线方法（如Mamba、Gated DeltaNet和滑动窗口注意力层）高出34个Elo点。

链接: https://arxiv.org/abs/2504.05298
作者: Karan Dalal,Daniel Koceja,Gashon Hussein,Jiarui Xu,Yue Zhao,Youjin Song,Shihao Han,Ka Chun Cheung,Jan Kautz,Carlos Guestrin,Tatsunori Hashimoto,Sanmi Koyejo,Yejin Choi,Yu Sun,Xiaolong Wang
机构: NVIDIA (英伟达); Stanford University (斯坦福大学); UCSD (加州大学圣地亚哥分校); UC Berkeley (加州大学伯克利分校); UT Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: this https URL
zh

[CV-7] Let it Snow! Animating Static Gaussian Scenes With Dynamic Weather Effects

【速读】：该论文致力于解决在静态3D高斯点云场景中引入自然交互动态元素的挑战。解决方案的关键在于提出了一种新颖的混合框架，结合高斯粒子表示法，通过将静态3D高斯点云映射到基于粒子的表示，并使用物质点方法（Material Point Method, MPM）模拟动态粒子的运动，同时引入专门的碰撞处理技术以正确处理动态元素与静态场景之间的相互作用。此外，该方法还设计了特定的外观参数，用于支持多种物理逼真的天气效果（如降雪、降雨、雾和沙尘暴）以及掉落物体的模拟。实验表明，该方法在视觉质量和物理真实性方面显著优于现有方法。

链接: https://arxiv.org/abs/2504.05296
作者: Gal Fiebelman,Hadar Averbuch-Elor,Sagie Benaim
机构: The Hebrew University of Jerusalem (希伯来大学耶路撒冷); Cornell University (康奈尔大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting has recently enabled fast and photorealistic reconstruction of static 3D scenes. However, introducing dynamic elements that interact naturally with such static scenes remains challenging. Accordingly, we present a novel hybrid framework that combines Gaussian-particle representations for incorporating physically-based global weather effects into static 3D Gaussian Splatting scenes, correctly handling the interactions of dynamic elements with the static scene. We follow a three-stage process: we first map static 3D Gaussians to a particle-based representation. We then introduce dynamic particles and simulate their motion using the Material Point Method (MPM). Finally, we map the simulated particles back to the Gaussian domain while introducing appearance parameters tailored for specific effects. To correctly handle the interactions of dynamic elements with the static scene, we introduce specialized collision handling techniques. Our approach supports a variety of weather effects, including snowfall, rainfall, fog, and sandstorms, and can also support falling objects, all with physically plausible motion and appearance. Experiments demonstrate that our method significantly outperforms existing approaches in both visual quality and physical realism.
zh

[CV-8] AnomalousNet: A Hybrid Approach with Attention U-Nets and Change Point Detection for Accurate Characterization of Anomalous Diffusion in Video Data

【速读】：该论文致力于解决在短且噪声较大的视频数据中，通过粒子轨迹准确估计异常扩散指数和扩散系数的问题，特别是当这些数据导致轨迹不完整且异质性较高时，传统统计方法面临显著挑战。论文的关键解决方案在于提出了一种数据驱动的方法，该方法整合了粒子跟踪、基于注意力机制的U-Net架构以及变点检测算法，不仅能够高精度地推断异常扩散参数，还能识别时间上的状态转换，即使在噪声存在和时间分辨率有限的情况下亦如此。这种方法在第2届异常扩散(AnDi)挑战赛的视频任务基准测试中表现优异，位列顶级提交之列。

链接: https://arxiv.org/abs/2504.05271
作者: Yusef Ahsini,Marc Escoto,J. Alberto Conejero
机构: Instituto Universitario de Matemática Pura y Aplicada (IUMPA), Universitat Politècnica de València (瓦伦西亚理工大学); Centro de Investigación en Gestión e Ingeniería de Producción (CIGIP), Universitat Politècnica de València (瓦伦西亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:Anomalous diffusion occurs in a wide range of systems, including protein transport within cells, animal movement in complex habitats, pollutant dispersion in groundwater, and nanoparticle motion in synthetic materials. Accurately estimating the anomalous diffusion exponent and the diffusion coefficient from the particle trajectories is essential to distinguish between sub-diffusive, super-diffusive, or normal diffusion regimes. These estimates provide a deeper insight into the underlying dynamics of the system, facilitating the identification of particle behaviors and the detection of changes in diffusion states. However, analyzing short and noisy video data, which often yield incomplete and heterogeneous trajectories, poses a significant challenge for traditional statistical approaches. We introduce a data-driven method that integrates particle tracking, an attention U-Net architecture, and a change-point detection algorithm to address these issues. This approach not only infers the anomalous diffusion parameters with high accuracy but also identifies temporal transitions between different states, even in the presence of noise and limited temporal resolution. Our methodology demonstrated strong performance in the 2nd Anomalous Diffusion (AnDi) Challenge benchmark within the top submissions for video tasks. Comments: 20 pages, 9 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2504.05271 [cs.CV] (or arXiv:2504.05271v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.05271 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-9] From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models CVPR’25

【速读】：该论文试图解决在扩展现实（XR）中利用不可靠输入信号生成流畅全身运动的问题。传统方法依赖于来自动作控制器的空间稀疏且始终开启的输入信号，而许多XR应用倾向于采用基于视觉的手部跟踪以减少用户摩擦并提升沉浸感。然而，手部跟踪信号相较于控制器不够精确，甚至可能长时间缺失。为应对这些不可靠输入，论文提出了滚动预测模型（Rolling Prediction Model, RPM），这是一种在线实时方法，可以从时空稀疏的输入信号中生成平滑的全身运动。RPM的关键在于不仅能够生成与输入匹配的准确运动（即跟踪模式），还能在输入缺失时生成合理的运动（即合成模式），并且能够在跟踪与合成模式之间实现无缝过渡。此外，为了验证处理噪声和缺失输入的重要性，论文发布了GORP数据集，这是首个来自商用虚拟现实（VR）头显的真实稀疏输入数据集，并提供了高质量的身体运动地面真实值。通过在合成数据和GORP上的基准测试，展示了如何利用现实数据集弥合实际应用中的差距。

链接: https://arxiv.org/abs/2504.05265
作者: German Barquero,Nadine Bertsch,Manojkumar Marramreddy,Carlos Chacón,Filippo Arcadu,Ferran Rigual,Nicky Sijia He,Cristina Palmero,Sergio Escalera,Yuting Ye,Robin Kips
机构: Meta Reality Labs (Meta 实验室); Universitat de Barcelona (巴塞罗那大学); Computer Vision Center (计算机视觉中心); King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in CVPR’25. Webpage: this https URL

点击查看摘要

Abstract:In extended reality (XR), generating full-body motion of the users is important to understand their actions, drive their virtual avatars for social interaction, and convey a realistic sense of presence. While prior works focused on spatially sparse and always-on input signals from motion controllers, many XR applications opt for vision-based hand tracking for reduced user friction and better immersion. Compared to controllers, hand tracking signals are less accurate and can even be missing for an extended period of time. To handle such unreliable inputs, we present Rolling Prediction Model (RPM), an online and real-time approach that generates smooth full-body motion from temporally and spatially sparse input signals. Our model generates 1) accurate motion that matches the inputs (i.e., tracking mode) and 2) plausible motion when inputs are missing (i.e., synthesis mode). More importantly, RPM generates seamless transitions from tracking to synthesis, and vice versa. To demonstrate the practical importance of handling noisy and missing inputs, we present GORP, the first dataset of realistic sparse inputs from a commercial virtual reality (VR) headset with paired high quality body motion ground truth. GORP provides 14 hours of VR gameplay data from 28 people using motion controllers (spatially sparse) and hand tracking (spatially and temporally sparse). We benchmark RPM against the state of the art on both synthetic data and GORP to highlight how we can bridge the gap for real-world applications with a realistic dataset and by handling unreliable input signals. Our code, pretrained models, and GORP dataset are available in the project webpage.
zh

[CV-10] Explaining Low Perception Model Competency with High-Competency Counterfactuals

【速读】：该论文试图解决的问题是如何解释图像分类模型在预测中缺乏置信度的原因，而不仅仅是表明其不确定性水平。传统方法多关注模型决策的解释，但对模型为何缺乏信心的研究较少。论文提出利用反事实（Counterfactual）图像来揭示低模型能力（即广义预测不确定性的一种形式）的具体原因。

解决方案的关键在于开发五种新颖的反事实图像生成方法：Image Gradient Descent (IGD)、Feature Gradient Descent (FGD)、Autoencoder Reconstruction (Reco)、Latent Gradient Descent (LGD) 和 Latent Nearest Neighbors (LNN)。通过在两个包含已知低模型能力原因的数据集上的评估，发现Reco、LGD和LNN是最有前景的反事实生成方法。进一步研究显示，在预训练的Multimodal Large Language Model (MLLM) 中结合这些反事实图像能够显著提升语言模型生成准确解释的能力，从而证明了反事实图像在解释低感知模型能力方面的实用价值。

链接: https://arxiv.org/abs/2504.05254
作者: Sara Pohland,Claire Tomlin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:There exist many methods to explain how an image classification model generates its decision, but very little work has explored methods to explain why a classifier might lack confidence in its prediction. As there are various reasons the classifier might lose confidence, it would be valuable for this model to not only indicate its level of uncertainty but also explain why it is uncertain. Counterfactual images have been used to visualize changes that could be made to an image to generate a different classification decision. In this work, we explore the use of counterfactuals to offer an explanation for low model competency–a generalized form of predictive uncertainty that measures confidence. Toward this end, we develop five novel methods to generate high-competency counterfactual images, namely Image Gradient Descent (IGD), Feature Gradient Descent (FGD), Autoencoder Reconstruction (Reco), Latent Gradient Descent (LGD), and Latent Nearest Neighbors (LNN). We evaluate these methods across two unique datasets containing images with six known causes for low model competency and find Reco, LGD, and LNN to be the most promising methods for counterfactual generation. We further evaluate how these three methods can be utilized by pre-trained Multimodal Large Language Models (MLLMs) to generate language explanations for low model competency. We find that the inclusion of a counterfactual image in the language model query greatly increases the ability of the model to generate an accurate explanation for the cause of low model competency, thus demonstrating the utility of counterfactual images in explaining low perception model competency.
zh

[CV-11] Contour Integration Underlies Human-Like Vision

【速读】：该论文试图解决深度学习模型在泛化至新输入分布时仍落后于人类的问题，特别是针对轮廓整合（contour integration）这一人类视觉的标志性能力。现有基准未能通过多种受控条件分析模型的具体失败点。为解决此问题，研究设计了一个实验，测试在不同程度物体碎片化下的物体识别性能，并发现大多数测试模型（超过1,000个）在小规模训练数据集下表现远低于人类，仅当训练数据集规模达到约50亿时，模型性能才接近人类。解决方案的关键在于揭示了人类视觉中的“整合偏向”（integration bias），即偏好识别由方向性片段组成的物体而非无方向性片段。研究进一步表明，具备这种偏向的模型在任务中表现更优，且此偏向随训练数据集规模增加而增强。通过训练模型以模拟轮廓整合能力，不仅提升了模型的形状偏向（shape bias），还可能揭示了大规模数据驱动学习中用于物体识别的核心机制。

链接: https://arxiv.org/abs/2504.05253
作者: Ben Lonnqvist,Elsa Scialom,Abdulkadir Gokce,Zehra Merchant,Michael H. Herzog,Martin Schrimpf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the tremendous success of deep learning in computer vision, models still fall behind humans in generalizing to new input distributions. Existing benchmarks do not investigate the specific failure points of models by analyzing performance under many controlled conditions. Our study systematically dissects where and why models struggle with contour integration – a hallmark of human vision – by designing an experiment that tests object recognition under various levels of object fragmentation. Humans (n=50) perform at high accuracy, even with few object contours present. This is in contrast to models which exhibit substantially lower sensitivity to increasing object contours, with most of the over 1,000 models we tested barely performing above chance. Only at very large scales ( \sim5B training dataset size) do models begin to approach human performance. Importantly, humans exhibit an integration bias – a preference towards recognizing objects made up of directional fragments over directionless fragments. We find that not only do models that share this property perform better at our task, but that this bias also increases with model training dataset size, and training models to exhibit contour integration leads to high shape bias. Taken together, our results suggest that contour integration is a hallmark of object vision that underlies object recognition performance, and may be a mechanism learned from data at scale.
zh

[CV-12] xture2LoD3: Enabling LoD3 Building Reconstruction With Panoramic Images CVPR

【速读】：该论文致力于解决LoD3（Level of Detail 3）建筑重建这一长期存在的挑战，主要难点在于传统对象导向建模范式对地表参照、封闭几何结构、立面语义以及低多边形表示的严格要求，与非结构化网格导向模型形成对比。论文提出的方法名为Texture2LoD3，其核心解决方案是利用3D建筑模型先验知识和街景全景图像，通过将低细节建筑模型作为平面目标来纠正街景全景图像的正射投影，并结合对精确纹理化低层次建筑表面的分割，从而实现满足LoD3要求的重建，包括地表参照、封闭几何结构和低多边形表示。此外，在缺乏LoD3验证数据的情况下，论文还引入了ReLoD3数据集，实验表明该方法可提升立面分割精度11%，并替代昂贵的手动投影过程。关键创新点在于结合建筑模型先验和图像处理技术，实现高效且低成本的LoD3建筑重建。

链接: https://arxiv.org/abs/2504.05249
作者: Wenzhao Tang,Weihang Li,Xiucheng Liang,Olaf Wysocki,Filip Biljecki,Christoph Holst,Boris Jutzi
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for CVPRW '25

点击查看摘要

Abstract:Despite recent advancements in surface reconstruction, Level of Detail (LoD) 3 building reconstruction remains an unresolved challenge. The main issue pertains to the object-oriented modelling paradigm, which requires georeferencing, watertight geometry, facade semantics, and low-poly representation – Contrasting unstructured mesh-oriented models. In Texture2LoD3, we introduce a novel method leveraging the ubiquity of 3D building model priors and panoramic street-level images, enabling the reconstruction of LoD3 building models. We observe that prior low-detail building models can serve as valid planar targets for ortho-rectifying street-level panoramic images. Moreover, deploying segmentation on accurately textured low-level building surfaces supports maintaining essential georeferencing, watertight geometry, and low-poly representation for LoD3 reconstruction. In the absence of LoD3 validation data, we additionally introduce the ReLoD3 dataset, on which we experimentally demonstrate that our method leads to improved facade segmentation accuracy by 11% and can replace costly manual projections. We believe that Texture2LoD3 can scale the adoption of LoD3 models, opening applications in estimating building solar potential or enhancing autonomous driving simulations. The project website, code, and data are available here: this https URL.
zh

[CV-13] Federated Learning for Medical Image Classification: A Comprehensive Benchmark

【速读】：该论文旨在解决联邦学习在医疗影像分析中的优化算法适用性与性能提升问题。当前研究多集中于自然图像的小范围数据集，缺乏针对医疗场景的充分验证与比较实验。论文的关键在于通过全面评估多种先进的联邦学习算法在多个医疗影像数据集上的分类模型表现，并提出一种结合去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPM）的生成技术与标签平滑策略的数据增强方法，以显著提高联邦学习在医疗影像分类任务中的性能。该方案不仅提供了系统性能指标（如通信成本和计算效率）的评估，还为未来的研究和应用提供了基准与指导。

链接: https://arxiv.org/abs/2504.05238
作者: Zhekai Zhou,Guibo Luo,Mingzhi Chen,Zhenyu Weng,Yuesheng Zhu
机构: School of Electronic and Computer Engineering, Peking University (北京大学电子与计算机工程学院); Shien-Ming Wu School of Intelligent Engineering, South China University of Technology (华南理工大学温子铭智能工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:The federated learning paradigm is wellsuited for the field of medical image analysis, as it can effectively cope with machine learning on isolated multicenter data while protecting the privacy of participating parties. However, current research on optimization algorithms in federated learning often focuses on limited datasets and scenarios, primarily centered around natural images, with insufficient comparative experiments in medical contexts. In this work, we conduct a comprehensive evaluation of several state-of-the-art federated learning algorithms in the context of medical imaging. We conduct a fair comparison of classification models trained using various federated learning algorithms across multiple medical imaging datasets. Additionally, we evaluate system performance metrics, such as communication cost and computational efficiency, while considering different federated learning architectures. Our findings show that medical imaging datasets pose substantial challenges for current federated learning optimization algorithms. No single algorithm consistently delivers optimal performance across all medical federated learning scenarios, and many optimization algorithms may underperform when applied to these datasets. Our experiments provide a benchmark and guidance for future research and application of federated learning in medical imaging contexts. Furthermore, we propose an efficient and robust method that combines generative techniques using denoising diffusion probabilistic models with label smoothing to augment datasets, widely enhancing the performance of federated learning on classification tasks across various medical imaging datasets. Our code will be released on GitHub, offering a reliable and comprehensive benchmark for future federated learning studies in medical imaging.
zh

[CV-14] Mapping biodiversity at very-high resolution in Europe

【速读】：该论文旨在解决高分辨率生物多样性制图的挑战，特别是在欧洲大陆尺度上的物种分布预测、生物多样性指标生成及栖息地分类问题。传统方法在处理跨物种依赖关系建模、异构存在-缺失数据的偏差感知训练以及多源遥感数据的大规模推断方面存在局限性。为了解决这些问题，论文提出了一种级联多模态管道（cascading multimodal pipeline），其关键是结合深度物种分布模型（deep-SDM）和基于Transformer的语言模型Pl@ntBERT。其中，deep-SDM利用遥感数据、气候时间序列和物种观测数据，在50×50米的空间分辨率下预测物种组成；Pl@ntBERT则用于基于这些预测结果生成生物多样性指标地图并进行栖息地分类。通过这一框架，论文实现了从洲际尺度的物种分布图、生物多样性指标图到栖息地地图的生成，提供了精细的生态学洞见。

链接: https://arxiv.org/abs/2504.05231
作者: César Leblanc,Lukas Picek,Benjamin Deneu,Pierre Bonnet,Maximilien Servajean,Rémi Palard,Alexis Joly
机构: INRIA; WSL; CIRAD; LIRMM
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper describes a cascading multimodal pipeline for high-resolution biodiversity mapping across Europe, integrating species distribution modeling, biodiversity indicators, and habitat classification. The proposed pipeline first predicts species compositions using a deep-SDM, a multimodal model trained on remote sensing, climate time series, and species occurrence data at 50x50m resolution. These predictions are then used to generate biodiversity indicator maps and classify habitats with Pl@ntBERT, a transformer-based LLM designed for species-to-habitat mapping. With this approach, continental-scale species distribution maps, biodiversity indicator maps, and habitat maps are produced, providing fine-grained ecological insights. Unlike traditional methods, this framework enables joint modeling of interspecies dependencies, bias-aware training with heterogeneous presence-absence data, and large-scale inference from multi-source remote sensing inputs.
zh

[CV-15] A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?

【速读】：该论文试图解决医学影像分析中文本-图像预训练模型面临的标注数据稀缺以及难以有效编码细粒度医学概念的问题。解决方案的关键在于回归到监督式的单模态预训练（unimodal pre-training），利用细粒度标签替代现有的多模态方法，以更好地整合异构数据源并提升模型性能。研究还探讨了新的方法来更有效地结合细粒度标签与噪声文本监督。

链接: https://arxiv.org/abs/2504.05227
作者: Julio Silva-Rodríguez,Jose Dolz,Ismail Ben Ayed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IPMI 2025. Code and weights: this https URL

点击查看摘要

Abstract:Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular, there is an impressive amount of recent literature developing vision-language models for radiology. However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode. In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead. We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources. Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings. Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision.
zh

[CV-16] Reinforced Multi-teacher Knowledge Distillation for Efficient General Image Forgery Detection and Localization AAAI2025

【速读】：该论文旨在解决图像伪造检测与定位（Image Forgery Detection and Localization, IFDL）任务中，现有方法在处理现实场景中经多样化篡改操作处理的伪造图像时效果不佳的问题。论文的关键在于提出了一种名为Reinforced Multi-teacher Knowledge Distillation (Re-MTKD) 的框架，其核心包括基于Encoder-Decoder结构的Cue-Net模型，以及针对三种主要图像伪造类型（复制-移动、拼接、修补）分别训练的三个Cue-Net教师模型。通过自知识蒸馏的方式，这些教师模型共同指导学生模型的学习，并结合Reinforced Dynamic Teacher Selection (Re-DTS) 策略动态分配教师权重，以实现对不同篡改痕迹共性与特性的有效学习。实验结果表明，该方法在多种新兴数据集上的性能优于当前其他先进方法。

链接: https://arxiv.org/abs/2504.05224
作者: Zeqin Yu,Jiangqun Ni,Jian Zhang,Haoyi Deng,Yuzhen Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published to AAAI2025 (Oral)

点击查看摘要

Abstract:Image forgery detection and localization (IFDL) is of vital importance as forged images can spread misinformation that poses potential threats to our daily lives. However, previous methods still struggled to effectively handle forged images processed with diverse forgery operations in real-world scenarios. In this paper, we propose a novel Reinforced Multi-teacher Knowledge Distillation (Re-MTKD) framework for the IFDL task, structured around an encoder-decoder \textbfConvNeXt-\textbfUperNet along with \textbfEdge-Aware Module, named Cue-Net. First, three Cue-Net models are separately trained for the three main types of image forgeries, i.e., copy-move, splicing, and inpainting, which then serve as the multi-teacher models to train the target student model with Cue-Net through self-knowledge distillation. A Reinforced Dynamic Teacher Selection (Re-DTS) strategy is developed to dynamically assign weights to the involved teacher models, which facilitates specific knowledge transfer and enables the student model to effectively learn both the common and specific natures of diverse tampering traces. Extensive experiments demonstrate that, compared with other state-of-the-art methods, the proposed method achieves superior performance on several recently emerged datasets comprised of various kinds of image forgeries.
zh

[CV-17] An ensemble deep learning approach to detect tumors on Mohs micrographic surgery slides

【速读】：该论文旨在解决Mohs微 graphic 手术（MMS）中切除高风险非黑色素瘤皮肤癌时，由于术中组织病理学检查耗时、费力且需要高度专业化而带来的挑战。论文的关键解决方案是开发一种基于深度学习的模型，用于检测Mohs切片中的基底细胞癌（BCC）及伪影。解决方案的核心在于利用U-Net架构训练模型以分割切片中的肿瘤和非肿瘤区域，并进一步对分割后的图像块进行分类，从而实现对整个切片图像（WSI）的预测。在分割阶段，肿瘤和非肿瘤区域的Dice评分分别为0.70和0.67，AUC评分分别为0.98和0.96；在肿瘤分类阶段，基于图像块的检测AUC达到0.98，基于切片的检测AUC为0.91。最终，该研究展示了能够高效检测Mohs切片中肿瘤与非肿瘤区域的人工智能系统，可辅助Mohs外科医生和皮肤病理学家做出更精准的决策。

链接: https://arxiv.org/abs/2504.05219
作者: Abdurrahim Yilmaz,Serra Atilla Aydin,Deniz Temur,Furkan Yuceyalcin,Berkin Deniz Kahya,Rahmetullah Varol,Ozay Gokoz,Gulsum Gencoglan,Huseyin Uvet,Gonca Elcin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:Mohs micrographic surgery (MMS) is the gold standard technique for removing high risk nonmelanoma skin cancer however, intraoperative histopathological examination demands significant time, effort, and professionality. The objective of this study is to develop a deep learning model to detect basal cell carcinoma (BCC) and artifacts on Mohs slides. A total of 731 Mohs slides from 51 patients with BCCs were used in this study, with 91 containing tumor and 640 without tumor which was defined as non-tumor. The dataset was employed to train U-Net based models that segment tumor and non-tumor regions on the slides. The segmented patches were classified as tumor, or non-tumor to produce predictions for whole slide images (WSIs). For the segmentation phase, the deep learning model success was measured using a Dice score with 0.70 and 0.67 value, area under the curve (AUC) score with 0.98 and 0.96 for tumor and non-tumor, respectively. For the tumor classification, an AUC of 0.98 for patch-based detection, and AUC of 0.91 for slide-based detection was obtained on the test dataset. We present an AI system that can detect tumors and non-tumors in Mohs slides with high success. Deep learning can aid Mohs surgeons and dermatopathologists in making more accurate decisions.
zh

[CV-18] Correcting Class Imbalances with Self-Training for Improved Universal Lesion Detection and Tagging

【速读】：该论文旨在解决通用病灶检测与标注（ULDT）在CT研究中的挑战，特别是在肿瘤负荷评估和随时间跟踪病灶状态（生长/缩小）方面。由于缺乏完全注释的数据以及类别不平衡的问题，现有方法难以有效发展。为了解决这些问题，论文提出了一种基于自训练的解决方案。关键在于通过一个VFNet模型，在DeepLesion数据集的有限子集上进行初始训练，随后利用自训练过程将从更大未标注数据集中发现的新病灶候选者纳入训练集，并采用多轮迭代更新模型。此外，通过引入类别平衡策略，如对挖掘出的病灶样本进行上采样及使用可变阈值政策，显著提升了灵敏度并缓解了类别不平衡问题，最终实现了对所有8种类别的灵敏度提升或保持。

链接: https://arxiv.org/abs/2504.05207
作者: Alexander Shieh,Tejas Sudharshan Mathai,Jianfei Liu,Angshuman Paul,Ronald M. Summers
机构: Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Radiology and Imaging Sciences, Clinical Center, National Institutes of Health (国立卫生研究院), Bethesda MD, USA; Indian Institute of Technology (印度理工学院), Jodhpur, Rajasthan, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published at SPIE Medical Imaging 2023

点击查看摘要

Abstract:Universal lesion detection and tagging (ULDT) in CT studies is critical for tumor burden assessment and tracking the progression of lesion status (growth/shrinkage) over time. However, a lack of fully annotated data hinders the development of effective ULDT approaches. Prior work used the DeepLesion dataset (4,427 patients, 10,594 studies, 32,120 CT slices, 32,735 lesions, 8 body part labels) for algorithmic development, but this dataset is not completely annotated and contains class imbalances. To address these issues, in this work, we developed a self-training pipeline for ULDT. A VFNet model was trained on a limited 11.5% subset of DeepLesion (bounding boxes + tags) to detect and classify lesions in CT studies. Then, it identified and incorporated novel lesion candidates from a larger unseen data subset into its training set, and self-trained itself over multiple rounds. Multiple self-training experiments were conducted with different threshold policies to select predicted lesions with higher quality and cover the class imbalances. We discovered that direct self-training improved the sensitivities of over-represented lesion classes at the expense of under-represented classes. However, upsampling the lesions mined during self-training along with a variable threshold policy yielded a 6.5% increase in sensitivity at 4 FP in contrast to self-training without class balancing (72% vs 78.5%) and a 11.7% increase compared to the same self-training policy without upsampling (66.8% vs 78.5%). Furthermore, we show that our results either improved or maintained the sensitivity at 4FP for all 8 lesion classes.
zh

[CV-19] 3D Universal Lesion Detection and Tagging in CT with Self-Training

【速读】：该论文试图解决放射科医生在计算机断层扫描（CT）研究中进行病灶定位、分类和大小测量这一繁琐任务的问题，并提出了一种通用病灶检测与标记（ULDT）的方法来同时减轻病灶测量的负担并支持肿瘤负荷评估。然而，现有的ULDT方法依赖于公开的DeepLesion数据集，该数据集不仅未能提供病灶的完整三维（3D）范围，还表现出严重的类别不平衡。为了解决这些问题，论文的关键方案是一种自训练流程，用于检测3D病灶并根据其发生的解剖部位对其进行标记。具体而言，通过使用DeepLesion数据集的一个显著受限的30%子集训练一个VFNet模型来进行二维病灶检测与标记，然后将二维病灶上下文扩展到三维，并将挖掘出的三维病灶提议重新集成到基线训练数据中以实现多轮模型再训练。这种自训练过程使VFNet模型能够从自己的预测中学习，从而实现在三维空间中检测并标记病灶。结果表明，与使用整个DeepLesion数据集的现有方法相比，我们的方法在有限的30%数据子集上达到了相似的平均敏感性（46.9% vs. 46.8%），并且首次实现了联合检测三维病灶及按解剖部位标签进行标记的能力。

链接: https://arxiv.org/abs/2504.05201
作者: Jared Frazier,Tejas Sudharshan Mathai,Jianfei Liu,Angshuman Paul,Ronald M. Summers
机构: Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Radiology and Imaging Sciences, Clinical Center, National Institutes of Health (国立卫生研究院), Bethesda MD, USA; Indian Institute of Technology (印度理工学院), Jodhpur, Rajasthan, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published at SPIE Medical Imaging 2023

点击查看摘要

Abstract:Radiologists routinely perform the tedious task of lesion localization, classification, and size measurement in computed tomography (CT) studies. Universal lesion detection and tagging (ULDT) can simultaneously help alleviate the cumbersome nature of lesion measurement and enable tumor burden assessment. Previous ULDT approaches utilize the publicly available DeepLesion dataset, however it does not provide the full volumetric (3D) extent of lesions and also displays a severe class imbalance. In this work, we propose a self-training pipeline to detect 3D lesions and tag them according to the body part they occur in. We used a significantly limited 30% subset of DeepLesion to train a VFNet model for 2D lesion detection and tagging. Next, the 2D lesion context was expanded into 3D, and the mined 3D lesion proposals were integrated back into the baseline training data in order to retrain the model over multiple rounds. Through the self-training procedure, our VFNet model learned from its own predictions, detected lesions in 3D, and tagged them. Our results indicated that our VFNet model achieved an average sensitivity of 46.9% at [0.125:8] false positives (FP) with a limited 30% data subset in comparison to the 46.8% of an existing approach that used the entire DeepLesion dataset. To our knowledge, we are the first to jointly detect lesions in 3D and tag them according to the body part label.
zh

[CV-20] raining state-of-the-art pathology foundation models with orders of magnitude less data

【速读】：该论文旨在解决计算病理学领域中基于视觉基础模型（Vision Foundation Models, FMs）在训练数据规模、模型性能以及利用病理图像信息效率方面的局限性问题。论文的关键解决方案在于通过改进标准DINOv2框架，并结合病理学领域的特定图像处理技术优化基础模型的训练，同时引入高分辨率图像微调的后处理方法以增强嵌入表示的信息丰富度。此外，研究展示了即使在远少于其他先进模型所使用的全幻灯片图像（WSIs）数量的情况下，所提出的基础模型仍能在下游任务中表现出相当或更优的性能，这表明进一步提升模型与算法仍有巨大潜力可挖掘。

链接: https://arxiv.org/abs/2504.05186
作者: Mikhail Karasikov,Joost van Doorn,Nicolas Känzig,Melis Erdal Cesur,Hugo Mark Horlings,Robert Berke,Fei Tang,Sebastian Otálora
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:The field of computational pathology has recently seen rapid advances driven by the development of modern vision foundation models (FMs), typically trained on vast collections of pathology images. Recent studies demonstrate that increasing the training data set and model size and integrating domain-specific image processing techniques can significantly enhance the model’s performance on downstream tasks. Building on these insights, our work incorporates several recent modifications to the standard DINOv2 framework from the literature to optimize the training of pathology FMs. We also apply a post-training procedure for fine-tuning models on higher-resolution images to further enrich the information encoded in the embeddings. We present three novel pathology FMs trained on up to two orders of magnitude fewer WSIs than those used to train other state-of-the-art FMs while demonstrating a comparable or superior performance on downstream tasks. Even the model trained on TCGA alone (12k WSIs) outperforms most existing FMs and, on average, matches Virchow2, the second-best FM published to date. This suggests that there still remains a significant potential for further improving the models and algorithms used to train pathology FMs to take full advantage of the vast data collections.
zh

[CV-21] MSA-UNet3: Multi-Scale Attention UNet3 with New Supervised Prototypical Contrastive Loss for Coronary DSA Image Segmentation

【速读】：该论文旨在解决冠状动脉数字减影血管造影（Digital Subtraction Angiography, DSA）图像分割中因低对比度、噪声、结构重叠、类内方差高及类别不平衡等问题导致的精确血管边界提取困难。为克服这些挑战，论文提出了一种多尺度注意力增强的UNet3+架构——MSA-UNet3+。其关键创新在于结合了多尺度膨胀瓶颈模块（Multi-Scale Dilated Bottleneck, MSD-Bottleneck）与上下文注意融合模块（Contextual Attention Fusion Module, CAFM），不仅提升了多尺度特征提取能力，还保留了细粒度细节并增强了上下文理解。此外，论文引入了一种新的监督原型对比损失函数（Supervised Prototypical Contrastive Loss, SPCL），通过关注难分类的背景样本，缓解类别不平衡和类内方差高的问题。实验结果表明，MSA-UNet3+在私有冠状动脉DSA数据集上的性能优于现有方法，显著提升了分割精度及相关指标。

链接: https://arxiv.org/abs/2504.05184
作者: Rayan Merghani Ahmed,Adnan Iltaf,Bin Li,Shoujun Zhou
机构: SIAT(深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:The accurate segmentation of coronary Digital Subtraction Angiography (DSA) images is essential for diagnosing and treating coronary artery diseases. Despite advances in deep learning-based segmentation, challenges such as low contrast, noise, overlapping structures, high intra-class variance, and class imbalance limit precise vessel delineation. To overcome these limitations, we propose the MSA-UNet3+: a Multi-Scale Attention enhanced UNet3+ architecture for coronary DSA image segmentation. The framework combined Multi-Scale Dilated Bottleneck (MSD-Bottleneck) with Contextual Attention Fusion Module (CAFM), which not only enhances multi-scale feature extraction but also preserve fine-grained details, and improve contextual understanding. Furthermore, we propose a new Supervised Prototypical Contrastive Loss (SPCL), which combines supervised and prototypical contrastive learning to minimize class imbalance and high intra-class variance by focusing on hard-to-classified background samples. Experiments carried out on a private coronary DSA dataset demonstrate that MSA-UNet3+ outperforms state-of-the-art methods, achieving a Dice coefficient of 87.73%, an F1-score of 87.78%, and significantly reduced Average Surface Distance (ASD) and Average Contour Distance (ACD). The developed framework provides clinicians with precise vessel segmentation, enabling accurate identification of coronary stenosis and supporting informed diagnostic and therapeutic decisions. The code will be released at the following GitHub profile link this https URL.
zh

[CV-22] he 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

【速读】：该论文致力于解决运动表达视频分割（Motion Expression Video Segmentation）问题，其目标是根据输入的运动描述精确分割视频中的对象，尤其关注多对象及复杂运动表达场景。与传统的指代视频对象分割（Referring Video Object Segmentation, RVOS）不同，该任务更具挑战性。为应对这一挑战，论文提出了一种简单而有效的推理优化方法，以充分挖掘大规模多模态模型（Large Multimodal Models, LMMs）在指代视频分割中的潜力。方案的关键在于：首先采用Sa2VA作为基线模型，该模型是一种用于图像和视频密集接地理解的统一LMM；其次，在推理过程中均匀采样视频帧以增强模型对整个视频的理解能力；最后，通过集成多个专家模型的结果来缓解单一模型错误预测的问题。这一系列措施使该方法在MeViS测试集上达到了61.98%的JF分数，并在CVPR 2025第4届PVUW Challenge MeViS赛道中排名第一。

链接: https://arxiv.org/abs/2504.05178
作者: Hao Fang,Runmin Cong,Xiankai Lu,Zhiyang Chen,Wei Zhang
机构: Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motion expression video segmentation is designed to segment objects in accordance with the input motion expressions. In contrast to the conventional Referring Video Object Segmentation (RVOS), it places emphasis on motion as well as multi-object expressions, making it more arduous. Recently, Large Multimodal Models (LMMs) have begun to shine in RVOS due to their powerful vision-language perception capabilities. In this work, we propose a simple and effective inference optimization method to fully unleash the potential of LMMs in referring video segmentation. Firstly, we use Sa2VA as our baseline, which is a unified LMM for dense grounded understanding of both images and videos. Secondly, we uniformly sample the video frames during the inference process to enhance the model’s understanding of the entire video. Finally, we integrate the results of multiple expert models to mitigate the erroneous predictions of a single model. Our solution achieved 61.98% JF on the MeViS test set and ranked 1st place in the 4th PVUW Challenge MeViS Track at CVPR 2025.
zh

[CV-23] SSLFusion: Scale Space Aligned Latent Fusion Model for Multimodal 3D Object Detection AAAI2025

【速读】：该论文致力于解决多模态三维物体检测中因二维图像特征与三维点云特征在尺度和空间信息上的不匹配所导致的挑战。现有方法通常在单一阶段聚合多模态特征，难以有效整合不同尺度和模态间的特征，从而限制了检测精度。此外，基于Query-Key-Value (QKV) 的交叉注意力操作虽然有助于通过捕获非局部上下文推断物体的位置和存在性，但其计算复杂度较高。为应对这些挑战，论文提出了一种名为SSLFusion的新颖模型，其关键在于包含三个组成部分：尺度对齐融合策略（SAF）、三维到二维空间对齐模块（SAM）以及潜在跨模态融合模块（LFM）。其中，SAF 通过在多个级别上聚合图像和点云特征来缓解模态间的尺度不匹配；SAM 利用三维坐标信息融入二维图像特征以减少图像与点云特征之间的模态差距；LFM 在潜在空间中捕捉跨模态非局部上下文，避免使用QKV-based注意力操作，从而降低计算复杂度。实验结果表明，SSLFusion在KITTI和DENSE数据集上的表现优于现有最先进的方法，并在KITTI测试集的中等难度级别上获得了2.15%的绝对3D AP提升。

链接: https://arxiv.org/abs/2504.05170
作者: Bonan Ding,Jin Xie,Jing Nie,Jiale Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Multimodal 3D object detection based on deep neural networks has indeed made significant progress. However, it still faces challenges due to the misalignment of scale and spatial information between features extracted from 2D images and those derived from 3D point clouds. Existing methods usually aggregate multimodal features at a single stage. However, leveraging multi-stage cross-modal features is crucial for detecting objects of various scales. Therefore, these methods often struggle to integrate features across different scales and modalities effectively, thereby restricting the accuracy of detection. Additionally, the time-consuming Query-Key-Value-based (QKV-based) cross-attention operations often utilized in existing methods aid in reasoning the location and existence of objects by capturing non-local contexts. However, this approach tends to increase computational complexity. To address these challenges, we present SSLFusion, a novel Scale Space Aligned Latent Fusion Model, consisting of a scale-aligned fusion strategy (SAF), a 3D-to-2D space alignment module (SAM), and a latent cross-modal fusion module (LFM). SAF mitigates scale misalignment between modalities by aggregating features from both images and point clouds across multiple levels. SAM is designed to reduce the inter-modal gap between features from images and point clouds by incorporating 3D coordinate information into 2D image features. Additionally, LFM captures cross-modal non-local contexts in the latent space without utilizing the QKV-based attention operations, thus mitigating computational complexity. Experiments on the KITTI and DENSE datasets demonstrate that our SSLFusion outperforms state-of-the-art methods. Our approach obtains an absolute gain of 2.15% in 3D AP, compared with the state-of-art method GraphAlign on the moderate level of the KITTI test set.
zh

[CV-24] Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion

【速读】：该论文旨在解决现有统一图像融合方法在处理多源图像信息整合时存在的两个主要问题：一是将所有融合任务视为单一问题导致忽略任务特定特性，从而限制整体性能；二是现有通用图像融合方法依赖显式的任务识别以适配不同任务，在推理阶段限制了模型对未见任务的泛化能力。为了解决这些问题，论文提出了一种名为“TITA”的新颖统一图像融合框架，其关键是动态平衡任务不变交互（Task-invariant Interaction）与任务特定适应（Task-specific Adaptation）。具体而言，通过引入增强像素注意力（Interaction-enhanced Pixel Attention, IPA）模块来提升像素级交互以提取更好的多源互补信息，同时利用基于操作的自适应融合（Operation-based Adaptive Fusion, OAF）模块根据任务属性动态调整操作权重。此外，还采用了快速自适应多任务优化（Fast Adaptive Multitask Optimization, FAMO）策略以缓解联合训练过程中跨任务梯度冲突的影响。实验结果表明，TITA 不仅在三种图像融合场景中实现了与专用方法竞争的性能，而且展示了对未见融合任务的强大泛化能力。

链接: https://arxiv.org/abs/2504.05164
作者: Xingyu Hu,Junjun Jiang,Chenyang Wang,Kui Jiang,Xianming Liu,Jiayi Ma
机构: Harbin Institute of Technology (哈尔滨工业大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified image fusion aims to integrate complementary information from multi-source images, enhancing image quality through a unified framework applicable to diverse fusion tasks. While treating all fusion tasks as a unified problem facilitates task-invariant knowledge sharing, it often overlooks task-specific characteristics, thereby limiting the overall performance. Existing general image fusion methods incorporate explicit task identification to enable adaptation to different fusion tasks. However, this dependence during inference restricts the model’s generalization to unseen fusion tasks. To address these issues, we propose a novel unified image fusion framework named “TITA”, which dynamically balances both Task-invariant Interaction and Task-specific Adaptation. For task-invariant interaction, we introduce the Interaction-enhanced Pixel Attention (IPA) module to enhance pixel-wise interactions for better multi-source complementary information extraction. For task-specific adaptation, the Operation-based Adaptive Fusion (OAF) module dynamically adjusts operation weights based on task properties. Additionally, we incorporate the Fast Adaptive Multitask Optimization (FAMO) strategy to mitigate the impact of gradient conflicts across tasks during joint training. Extensive experiments demonstrate that TITA not only achieves competitive performance compared to specialized methods across three image fusion scenarios but also exhibits strong generalization to unseen fusion tasks.
zh

[CV-25] PanoDreamer: Consistent Text to 360-Degree Scene Generation CVPR2025

【速读】：该论文旨在解决从文本描述或参考图像自动生成高质量且几何一致的完整3D场景的问题，尤其是在超出参考图像视场范围较大时，当前方法常面临低质量纹理和不一致3D结构的挑战。为应对这些难题，论文提出了一种名为PanoDreamer的新框架，其关键在于结合大型语言模型与warp-refine（变形-精化）流水线，首先通过文本和图像控制生成一组初始图像并合成全景图，将其提升至3D形成初始点云，随后利用多种方法从不同视角生成一致的附加图像以扩展和精化点云，并最终采用3D高斯点溅射技术构建高质量的几何一致3D场景。

链接: https://arxiv.org/abs/2504.05152
作者: Zhexiao Xiong,Zhang Chen,Zhong Li,Yi Xu,Nathan Jacobs
机构: OPPO US Research Center (OPPO美国研究院); Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 Workshop on Computer Vision for Metaverse

点击查看摘要

Abstract:Automatically generating a complete 3D scene from a text description, a reference image, or both has significant applications in fields like virtual reality and gaming. However, current methods often generate low-quality textures and inconsistent 3D structures. This is especially true when extrapolating significantly beyond the field of view of the reference image. To address these challenges, we propose PanoDreamer, a novel framework for consistent, 3D scene generation with flexible text and image control. Our approach employs a large language model and a warp-refine pipeline, first generating an initial set of images and then compositing them into a 360-degree panorama. This panorama is then lifted into 3D to form an initial point cloud. We then use several approaches to generate additional images, from different viewpoints, that are consistent with the initial point cloud and expand/refine the initial point cloud. Given the resulting set of images, we utilize 3D Gaussian Splatting to create the final 3D scene, which can then be rendered from different viewpoints. Experiments demonstrate the effectiveness of PanoDreamer in generating high-quality, geometrically consistent 3D scenes.
zh

[CV-26] Stereo-LiDAR Fusion by Semi-Global Matching With Discrete Disparity-Matching Cost and Semidensification

【速读】：本文提出了一种实时、无学习深度估计方法，旨在融合LiDAR数据与立体相机输入以提高深度估计的精度与效率。为实现这一目标，方案的关键在于三个核心技术：结合离散视差匹配成本（Discrete Disparity-matching Cost, DDC）的半全局匹配（Semi-Global Matching, SGM）立体视觉算法、LiDAR视差的半密集化处理以及结合立体图像与LiDAR数据的一致性检查。这些组件均设计为可在GPU上并行化运行，以确保实时性能。在KITTI数据集上的评估表明，该方法的误差率为2.79%，优于先前最先进的实时立体-LiDAR融合方法（3.05%）。此外，实验验证了该方法在不同点云密度、天气条件及室内环境中的高适应性。我们认为，此方法的实时性和无学习特性使其在机器人与自动化领域具有很高的实用价值。

链接: https://arxiv.org/abs/2504.05148
作者: Yasuhiro Yao,Ryoichi Ishikawa,Takeshi Oishi
机构: Institute of Industrial Science, University of Tokyo (东京大学产业技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 8 figures, 7 tables

点击查看摘要

Abstract:We present a real-time, non-learning depth estimation method that fuses Light Detection and Ranging (LiDAR) data with stereo camera input. Our approach comprises three key techniques: Semi-Global Matching (SGM) stereo with Discrete Disparity-matching Cost (DDC), semidensification of LiDAR disparity, and a consistency check that combines stereo images and LiDAR data. Each of these components is designed for parallelization on a GPU to realize real-time performance. When it was evaluated on the KITTI dataset, the proposed method achieved an error rate of 2.79%, outperforming the previous state-of-the-art real-time stereo-LiDAR fusion method, which had an error rate of 3.05%. Furthermore, we tested the proposed method in various scenarios, including different LiDAR point densities, varying weather conditions, and indoor environments, to demonstrate its high adaptability. We believe that the real-time and non-learning nature of our method makes it highly practical for applications in robotics and automation.
zh

[CV-27] EffOWT: Transfer Visual Language Models to Open-World Tracking Efficiently and Effectively

【速读】：该论文旨在解决在将视觉语言模型（Visual Language Models, VLMs）迁移到开放世界跟踪任务（Open-World Tracking, OWT）时，细调策略面临的挑战：全量微调会导致过高的参数和内存开销，而零样本策略则会导致性能不佳。为了解决这一问题，论文提出了EffOWT方法，其关键是通过在外置VLM主干网络之外构建一个小型独立的可学习侧网络，并仅对该侧网络执行反向传播，从而满足效率需求。此外，通过引入Transformer与CNN的混合结构增强侧网络，同时对MLP实施稀疏交互以显著减少参数更新和内存成本。这些方法使得EffOWT在未知类别的跟踪度量OWTA上实现了5.5%的绝对提升，同时仅需更新1.3%的参数，节省36.4%的内存。

链接: https://arxiv.org/abs/2504.05141
作者: Bingyang Wang,Kaer Huang,Bin Li,Yiqiang Yan,Lihe Zhang,Huchuan Lu,You He
机构: Dalian University of Technology (大连理工大学); Lenovo (联想); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Open-World Tracking (OWT) aims to track every object of any category, which requires the model to have strong generalization capabilities. Trackers can improve their generalization ability by leveraging Visual Language Models (VLMs). However, challenges arise with the fine-tuning strategies when VLMs are transferred to OWT: full fine-tuning results in excessive parameter and memory costs, while the zero-shot strategy leads to sub-optimal performance. To solve the problem, EffOWT is proposed for efficiently transferring VLMs to OWT. Specifically, we build a small and independent learnable side network outside the VLM backbone. By freezing the backbone and only executing backpropagation on the side network, the model’s efficiency requirements can be met. In addition, EffOWT enhances the side network by proposing a hybrid structure of Transformer and CNN to improve the model’s performance in the OWT field. Finally, we implement sparse interactions on the MLP, thus reducing parameter updates and memory costs significantly. Thanks to the proposed methods, EffOWT achieves an absolute gain of 5.5% on the tracking metric OWTA for unknown categories, while only updating 1.3% of the parameters compared to full fine-tuning, with a 36.4% memory saving. Other metrics also demonstrate obvious improvement.
zh

[CV-28] BoxSeg: Quality-Aware and Peer-Assisted Learning for Box-supervised Instance Segmentation

【速读】：该论文旨在解决仅使用边界框（bounding box）标注实现实例分割（instance segmentation）的问题。现有的方法主要通过教师-学生框架获取高质量的伪掩膜（pseudo masks），但这些方法仍面临伪掩膜质量参差不齐、噪声影响显著等问题。为解决上述挑战，论文提出了一种名为BoxSeg的新框架，其关键在于两个创新模块：Quality-Aware Module (QAM) 和 Peer-assisted Copy-paste (PC)。其中，QAM 利用质量感知的多掩膜补全机制提升伪掩膜质量并更有效地评估掩膜质量，从而减轻噪声掩膜的影响；而 PC 模块借鉴 Peer-Assisted Learning 的思想，在高质量伪掩膜的指导下进一步优化低质量掩膜的质量。理论分析与实验结果表明，这两个模块在提升实例分割性能方面具有显著效果，并且能够推广到其他相关模型的改进中。

链接: https://arxiv.org/abs/2504.05137
作者: Jinxiang Lai,Wenlong Wu,Jiawei Zhan,Jian Li,Bin-Bin Gao,Jun Liu,Jie Zhang,Song Guo
机构: HKUST Hongkong China; DJI China; Tencent China; HKUST Hongkong China; HKUST Hongkong China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Box-supervised instance segmentation methods aim to achieve instance segmentation with only box annotations. Recent methods have demonstrated the effectiveness of acquiring high-quality pseudo masks under the teacher-student framework. Building upon this foundation, we propose a BoxSeg framework involving two novel and general modules named the Quality-Aware Module (QAM) and the Peer-assisted Copy-paste (PC). The QAM obtains high-quality pseudo masks and better measures the mask quality to help reduce the effect of noisy masks, by leveraging the quality-aware multi-mask complementation mechanism. The PC imitates Peer-Assisted Learning to further improve the quality of the low-quality masks with the guidance of the obtained high-quality pseudo masks. Theoretical and experimental analyses demonstrate the proposed QAM and PC are effective. Extensive experimental results show the superiority of our BoxSeg over the state-of-the-art methods, and illustrate the QAM and PC can be applied to improve other models.
zh

[CV-29] DA2Diff: Exploring Degradation-aware Adaptive Diffusion Priors for All-in-One Weather Restoration

【速读】：该论文旨在解决在多种恶劣天气条件下图像复原的挑战，特别是针对复杂且多样化的退化模式，现有统一模型难以有效处理多天气去除的问题。论文的关键创新在于提出了一种名为DA2Diff的新扩散范式，其核心是引入了降质感知的自适应先验。具体而言，通过在CLIP空间中利用可学习的提示（learnable prompts）与图像的相似性约束来捕捉特定于不同天气的降质特征，并将这些提示整合到扩散模型中，以实现对多种天气类型的有效复原。此外，为了进一步增强模型对复杂退化的适应能力，论文设计了一个动态专家选择调制器，利用动态天气感知路由灵活分配不同的复原专家，从而使扩散模型能够自适应地恢复多样化的退化场景。实验结果验证了DA2Diff方法在定量和定性评估中的优越性能。

链接: https://arxiv.org/abs/2504.05135
作者: Jiamei Xiong,Xuefeng Yan,Yongzhen Wang,Wei Zhao,Xiao-Ping Zhang,Mingqiang Wei
机构: School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (南京航空航天大学计算机科学与技术学院); Collaborative Innovation Center of Novel Software Technology and Industrialization (新型软件技术产业化协同创新中心); College of Computer Science and Technology, Anhui University of Technology (安徽工业大学计算机科学与技术学院); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration under adverse weather conditions is a critical task for many vision-based applications. Recent all-in-one frameworks that handle multiple weather degradations within a unified model have shown potential. However, the diversity of degradation patterns across different weather conditions, as well as the complex and varied nature of real-world degradations, pose significant challenges for multiple weather removal. To address these challenges, we propose an innovative diffusion paradigm with degradation-aware adaptive priors for all-in-one weather restoration, termed DA2Diff. It is a new exploration that applies CLIP to perceive degradation-aware properties for better multi-weather restoration. Specifically, we deploy a set of learnable prompts to capture degradation-aware representations by the prompt-image similarity constraints in the CLIP space. By aligning the snowy/hazy/rainy images with snow/haze/rain prompts, each prompt contributes to different weather degradation characteristics. The learned prompts are then integrated into the diffusion model via the designed weather specific prompt guidance module, making it possible to restore multiple weather types. To further improve the adaptiveness to complex weather degradations, we propose a dynamic expert selection modulator that employs a dynamic weather-aware router to flexibly dispatch varying numbers of restoration experts for each weather-distorted image, allowing the diffusion model to restore diverse degradations adaptively. Experimental results substantiate the favorable performance of DA2Diff over state-of-the-arts in quantitative and qualitative evaluation. Source code will be available after acceptance.
zh

[CV-30] Balancing Robustness and Efficiency in Embedded DNNs Through Activation Function Selection

【速读】：该论文试图解决机器学习嵌入式系统在安全关键应用（如航空航天和自动驾驶）中因软错误引起的鲁棒性问题。随着现代电子器件对背景辐射敏感性的增加，深度神经网络（DNNs）对软错误的抗扰性不仅依赖于目标设备技术，还与模型结构以及参数的数值表示和算术精度密切相关。论文指出，压缩技术（如剪枝和量化）虽然可以降低内存占用和计算复杂度，但会改变模型结构和表示方式，从而影响软错误的鲁棒性。此外，激活函数的选择虽常被忽视，但它不仅影响模型的准确性和可训练性，还影响其可压缩性和抗误差能力。

解决方案的关键在于探索使用有界激活函数（bounded activation functions）来增强模型对参数扰动的鲁棒性，同时评估这些激活函数对模型准确性、可压缩性以及计算负载的影响。研究采用了一种与技术无关的方法，并聚焦于为高光谱图像语义分割设计的编码器-解码器卷积模型，具体实验基于AMD-Xilinx的KV260 SoM平台展开。

链接: https://arxiv.org/abs/2504.05119
作者: Jon Gutiérrez Zaballa,Koldo Basterretxea,Javier Echanobe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Machine learning-based embedded systems for safety-critical applications, such as aerospace and autonomous driving, must be robust to perturbations caused by soft errors. As transistor geometries shrink and voltages decrease, modern electronic devices become more susceptible to background radiation, increasing the concern about failures produced by soft errors. The resilience of deep neural networks (DNNs) to these errors depends not only on target device technology but also on model structure and the numerical representation and arithmetic precision of their parameters. Compression techniques like pruning and quantization, used to reduce memory footprint and computational complexity, alter both model structure and representation, affecting soft error robustness. In this regard, although often overlooked, the choice of activation functions (AFs) impacts not only accuracy and trainability but also compressibility and error resilience. This paper explores the use of bounded AFs to enhance robustness against parameter perturbations, while evaluating their effects on model accuracy, compressibility, and computational load with a technology-agnostic approach. We focus on encoder-decoder convolutional models developed for semantic segmentation of hyperspectral images with application to autonomous driving systems. Experiments are conducted on an AMD-Xilinx’s KV260 SoM.
zh

[CV-31] ABCDWaveNet: Advancing Robust Road Ponding Detection in Fog through Dynamic Frequency-Spatial Synergy

【速读】：该论文旨在解决雾天条件下路面积水检测的可靠性问题，这是先进驾驶辅助系统（ADAS）面临的一项长期挑战。论文提出了一种名为ABCDWaveNet的新框架，通过动态频率-空间协同作用实现鲁棒的积水检测。其核心解决方案的关键在于结合动态卷积以适应不同能见度下的自适应特征提取，并利用基于小波的模块增强频率-空间特征协同性，显著提升了对抗雾干扰的鲁棒性。在此基础上，ABCDWaveNet进一步捕捉多尺度结构与上下文信息，并采用自适应注意力耦合门（AACG）融合全局与局部特征，从而提高检测精度。这一创新性的动态频率-空间协同机制是解决该问题的核心所在。

链接: https://arxiv.org/abs/2504.05112
作者: Ronghui Zhang,Dakang Lyu,Tengfei Li,Yunfan Wu,Ujjal Manandhar,Benfei Wang,Junzhou Chen,Bolin Gao,Danwei Wang,Yiqiu Tan
机构: Sun Yat-sen University (中山大学); Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road ponding presents a significant threat to vehicle safety, particularly in adverse fog conditions, where reliable detection remains a persistent challenge for Advanced Driver Assistance Systems (ADAS). To address this, we propose ABCDWaveNet, a novel deep learning framework leveraging Dynamic Frequency-Spatial Synergy for robust ponding detection in fog. The core of ABCDWaveNet achieves this synergy by integrating dynamic convolution for adaptive feature extraction across varying visibilities with a wavelet-based module for synergistic frequency-spatial feature enhancement, significantly improving robustness against fog interference. Building on this foundation, ABCDWaveNet captures multi-scale structural and contextual information, subsequently employing an Adaptive Attention Coupling Gate (AACG) to adaptively fuse global and local features for enhanced accuracy. To facilitate realistic evaluations under combined adverse conditions, we introduce the Foggy Low-Light Puddle dataset. Extensive experiments demonstrate that ABCDWaveNet establishes new state-of-the-art performance, achieving significant Intersection over Union (IoU) gains of 3.51%, 1.75%, and 1.03% on the Foggy-Puddle, Puddle-1000, and our Foggy Low-Light Puddle datasets, respectively. Furthermore, its processing speed of 25.48 FPS on an NVIDIA Jetson AGX Orin confirms its suitability for ADAS deployment. These findings underscore the effectiveness of the proposed Dynamic Frequency-Spatial Synergy within ABCDWaveNet, offering valuable insights for developing proactive road safety solutions capable of operating reliably in challenging weather conditions.
zh

[CV-32] Climplicit: Climatic Implicit Embeddings for Global Ecological Tasks ICLR2025

【速读】：该论文旨在解决气候数据深度学习在宏观生态学应用中的可及性问题，由于存储需求、计算资源以及技术专业知识的限制，这一领域的方法尚未被气候科学界广泛采用。为了解决这些问题，论文提出了一种名为Climplicit的空间-时间地理编码器，该模型经过预训练，能够在地球上的任何位置生成隐式的气候表示。解决方案的关键在于通过避免下载原始气候栅格数据和训练特征提取器的需求，大幅减少了磁盘空间使用（高达1000倍）并显著降低了下游任务的计算需求，从而使得深度学习方法更加易于在宏观生态学研究中推广应用。

链接: https://arxiv.org/abs/2504.05089
作者: Johannes Dollinger,Damien Robert,Elena Plekhanova,Lukas Drees,Jan Dirk Wegner
机构: EcoVision Lab, DM3L, University of Zurich (生态视觉实验室, DM3L, 苏黎世大学); Swiss Federal Research Institute WSL (瑞士联邦森林、雪与景观研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a workshop paper at “Tackling Climate Change with Machine Learning”, ICLR 2025

点击查看摘要

Abstract:Deep learning on climatic data holds potential for macroecological applications. However, its adoption remains limited among scientists outside the deep learning community due to storage, compute, and technical expertise barriers. To address this, we introduce Climplicit, a spatio-temporal geolocation encoder pretrained to generate implicit climatic representations anywhere on Earth. By bypassing the need to download raw climatic rasters and train feature extractors, our model uses x1000 fewer disk space and significantly reduces computational needs for downstream tasks. We evaluate our Climplicit embeddings on biomes classification, species distribution modeling, and plant trait regression. We find that linear probing our Climplicit embeddings consistently performs better or on par with training a model from scratch on downstream tasks and overall better than alternative geolocation encoding models.
zh

[CV-33] Content-Distortion High-Order Interaction for Blind Image Quality Assessment

【速读】：该论文旨在解决现有无参考图像质量评估（No-Reference Image Quality Assessment, NR-IQA）方法未能有效捕捉图像内容与失真之间复杂交互关系的问题，这一不足限制了其准确感知图像质量的能力。论文的关键在于提出了一种名为CoDI-IQA（Content-Distortion高阶交互用于NR-IQA）的新方法，通过在分层交互框架内聚合局部失真和全局内容特征，并引入渐进感知交互模块（Progressive Perception Interaction Module, PPIM），显式模拟内容和失真单独及共同对图像质量的影响，实现高阶交互建模。此方案通过内部交互、粗粒度交互和细粒度交互的融合，确保模型能够正确表示潜在的交互模式，并采用多级特征融合策略以维持交互稳定性。

链接: https://arxiv.org/abs/2504.05076
作者: Shuai Liu,Qingyu Mao,Chao Li,Jiacong Chen,Fanyang Meng,Yonghong Tian,Yongsheng Liang
机构: College of Applied Technology, Shenzhen University (深圳大学应用技术学院); College of Big Data and Internet, Shenzhen Technology University (深圳技术大学大数据与互联网学院); College of Electronic and Information Engineering, Shenzhen University (深圳大学电子与信息工程学院); School of Electronics and Information Engineering, Harbin Institute of Technology, Shenzhen (哈尔滨工业大学（深圳）电子与信息工程学院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages (main text: 14 pages + appendix: 5 pages), 9 figures, 23 tables. In submission

点击查看摘要

Abstract:The content and distortion are widely recognized as the two primary factors affecting the visual quality of an image. While existing No-Reference Image Quality Assessment (NR-IQA) methods have modeled these factors, they fail to capture the complex interactions between content and distortions. This shortfall impairs their ability to accurately perceive quality. To confront this, we analyze the key properties required for interaction modeling and propose a robust NR-IQA approach termed CoDI-IQA (Content-Distortion high-order Interaction for NR-IQA), which aggregates local distortion and global content features within a hierarchical interaction framework. Specifically, a Progressive Perception Interaction Module (PPIM) is proposed to explicitly simulate how content and distortions independently and jointly influence image quality. By integrating internal interaction, coarse interaction, and fine interaction, it achieves high-order interaction modeling that allows the model to properly represent the underlying interaction patterns. To ensure sufficient interaction, multiple PPIMs are employed to hierarchically fuse multi-level content and distortion features at different granularities. We also tailor a training strategy suited for CoDI-IQA to maintain interaction stability. Extensive experiments demonstrate that the proposed method notably outperforms the state-of-the-art methods in terms of prediction accuracy, data efficiency, and generalization ability.
zh

[CV-34] PvNeXt: Rethinking Network Design and Temporal Motion for Point Cloud Video Recognition ICLR2025

【速读】：该论文旨在解决点云视频感知任务中现有4D表示学习方法存在的计算冗余问题。这些方法通常依赖于迭代处理和密集查询操作，虽然能够有效捕捉时间特征，但导致了显著的计算开销。为了解决这一问题，论文提出了一种名为PvNeXt的框架，其关键在于通过个性化的一次性查询操作实现高效且有效的点云视频识别。具体而言，PvNeXt包含两个核心模块：Motion Imitator（运动模仿器）和Single-Step Motion Encoder（单步运动编码器）。Motion Imitator用于捕获点云序列中的时间动态特性，并生成与每一帧对应的虚拟运动；而Single-Step Motion Encoder执行一次性查询操作，将每帧点云与其对应的虚拟运动帧关联起来，从而从点云序列中提取运动线索并捕捉整个序列的时间动态。通过这两个模块的结合，PvNeXt实现了针对每一帧的个性化一次性查询，从而避免了帧特定的循环和密集查询过程。多项基准实验验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.05075
作者: Jie Wang,Tingfa Xu,Lihe Ding,Xinjie Zhang,Long Bai,Jianan Li
机构: Beijing Institute of Technology (北京理工大学); The Chinese University of Hong Kong (香港中文大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Point cloud video perception has become an essential task for the realm of 3D vision. Current 4D representation learning techniques typically engage in iterative processing coupled with dense query operations. Although effective in capturing temporal features, this approach leads to substantial computational redundancy. In this work, we propose a framework, named as PvNeXt, for effective yet efficient point cloud video recognition, via personalized one-shot query operation. Specially, PvNeXt consists of two key modules, the Motion Imitator and the Single-Step Motion Encoder. The former module, the Motion Imitator, is designed to capture the temporal dynamics inherent in sequences of point clouds, thus generating the virtual motion corresponding to each frame. The Single-Step Motion Encoder performs a one-step query operation, associating point cloud of each frame with its corresponding virtual motion frame, thereby extracting motion cues from point cloud sequences and capturing temporal dynamics across the entire sequence. Through the integration of these two modules, PvNeXt enables personalized one-shot queries for each frame, effectively eliminating the need for frame-specific looping and intensive query processes. Extensive experiments on multiple benchmarks demonstrate the effectiveness of our method.
zh

[CV-35] LDGNet: A Lightweight Difference Guiding Network for Remote Sensing Change Detection

【速读】：该论文旨在解决现有变化检测（Change Detection, CD）方法在追求更高精度的同时，计算成本和参数规模增加导致难以实现实时轻量级处理的问题。论文的关键在于提出了一种轻量级差异引导网络（LDGNet），通过利用绝对差异图像引导光学遥感变化检测。解决方案的核心包括两个创新模块：首先，提出了差异引导模块（Difference Guiding Module, DGM），通过多尺度特征增强轻量级主干网络的特征表示能力；其次，设计了基于视觉状态空间模型（Visual State Space Model, VSSM）的差异感知动态融合模块（Difference-Aware Dynamic Fusion, DADF），用于轻量级长距离依赖建模，同时强化变化语义提取并抑制噪声和背景干扰。这些设计使方法在保持极低计算成本（仅3.43M参数和1.12G浮点运算）的情况下，达到了与当前主流方法相当或更优的性能。

链接: https://arxiv.org/abs/2504.05062
作者: Chenfeng Xu
机构: China University of Mining and Technology (中国矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of deep learning, the field of change detection (CD) in remote sensing imagery has achieved remarkable progress. Existing change detection methods primarily focus on achieving higher accuracy with increased computational costs and parameter sizes, leaving development of lightweight methods for rapid real-world processing an underexplored challenge. To address this challenge, we propose a Lightweight Difference Guiding Network (LDGNet), leveraging absolute difference image to guide optical remote sensing change detection. First, to enhance the feature representation capability of the lightweight backbone network, we propose the Difference Guiding Module (DGM), which leverages multi-scale features extracted from the absolute difference image to progressively influence the original image encoder at each layer, thereby reinforcing feature extraction. Second, we propose the Difference-Aware Dynamic Fusion (DADF) module with Visual State Space Model (VSSM) for lightweight long-range dependency modeling. The module first uses feature absolute differences to guide VSSM’s global contextual modeling of change regions, then employs difference attention to dynamically fuse these long-range features with feature differences, enhancing change semantics while suppressing noise and background. Extensive experiments on multiple datasets demonstrate that our method achieves comparable or superior performance to current state-of-the-art (SOTA) methods requiring several times more computation, while maintaining only 3.43M parameters and 1.12G FLOPs.
zh

[CV-36] CMaP-SAM: Contraction Mapping Prior for SAM-driven Few-shot Segmentation

【速读】：该论文致力于解决Few-shot Segmentation (FSS) 中两个关键问题：一是未能充分利用查询图像中的结构相关性；二是将连续位置先验转换为离散点提示时存在显著信息损失。为应对这些挑战，论文提出了一种名为CMaP-SAM的新框架，其核心解决方案包括三个关键组件：(1) 收缩映射模块，通过像素级结构相似性优化位置先验，将其形式化为具有收敛保证的Banach收缩映射，从而生成保留参考图像语义引导和查询图像结构相关性的收敛先验；(2) 自适应分布对齐模块，连接连续先验与SAM的二值掩码提示编码器；(3) 前景-背景解耦细化架构，用于生成精确的最终分割掩码。实验结果表明，CMaP-SAM在PASCAL-5^i数据集上达到71.1 mIoU，在COCO-20^i数据集上达到56.1 mIoU，实现了最先进的性能。

链接: https://arxiv.org/abs/2504.05049
作者: Shuai Chen,Fanman Meng,Haoran Wei,Chenhao Wu,Qingbo Wu,Linfeng Xu,Hongliang Li
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 figures

点击查看摘要

Abstract:Few-shot segmentation (FSS) aims to segment new classes using few annotated images. While recent FSS methods have shown considerable improvements by leveraging Segment Anything Model (SAM), they face two critical limitations: insufficient utilization of structural correlations in query images, and significant information loss when converting continuous position priors to discrete point prompts. To address these challenges, we propose CMaP-SAM, a novel framework that introduces contraction mapping theory to optimize position priors for SAM-driven few-shot segmentation. CMaP-SAM consists of three key components: (1) a contraction mapping module that formulates position prior optimization as a Banach contraction mapping with convergence guarantees. This module iteratively refines position priors through pixel-wise structural similarity, generating a converged prior that preserves both semantic guidance from reference images and structural correlations in query images; (2) an adaptive distribution alignment module bridging continuous priors with SAM’s binary mask prompt encoder; and (3) a foreground-background decoupled refinement architecture producing accurate final segmentation masks. Extensive experiments demonstrate CMaP-SAM’s effectiveness, achieving state-of-the-art performance with 71.1 mIoU on PASCAL- 5^i and 56.1 on COCO- 20^i datasets.
zh

[CV-37] MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond

【速读】：该论文旨在解决现有运动捕捉（MoCap）方法主要关注视觉相似性而忽视物理逼真的问题，导致下游任务如驱动虚拟人在3D场景或人形机器人在现实世界中出现时间漂移、抖动、滑动、穿透等现象，并且全局轨迹准确性较差。论文从人体与物理世界相互作用的角度重新审视了运动捕捉，并探索了压力信号的作用。

解决方案的关键在于构建了一个包含压力、RGB和光学传感器的大规模运动捕捉数据集MotionPRO，以及提出了一种结合小核解码器和长短时记忆注意力模块的网络来利用压力信号进行姿态和轨迹估计。此外，通过在相机轴上施加正交相似性和在垂直轴上施加全身接触约束来增强跨注意策略，以融合压力和RGB特征图。实验表明，融合压力信号不仅显著提高了客观指标的性能，还能够在3D场景中合理驱动虚拟人类（SMPL），并且将物理感知引入人形机器人可以使其执行更精确和稳定的动作，这对具身人工智能的发展具有重要意义。

链接: https://arxiv.org/abs/2504.05046
作者: Shenghao Ren,Yi Lu,Jiayi Huang,Jiayi Zhao,He Zhang,Tao Yu,Qiu Shen,Xun Cao
机构: School of Electronic Science and Engineering, Nanjing University (南京大学), China; Key Laboratory of Optoelectronic Devices and Systems with Extreme Performances of MOE, Nanjing University (南京大学), China; BNRist, Tsinghua University (清华大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing human Motion Capture (MoCap) methods mostly focus on the visual similarity while neglecting the physical plausibility. As a result, downstream tasks such as driving virtual human in 3D scene or humanoid robots in real world suffer from issues such as timing drift and jitter, spatial problems like sliding and penetration, and poor global trajectory accuracy. In this paper, we revisit human MoCap from the perspective of interaction between human body and physical world by exploring the role of pressure. Firstly, we construct a large-scale human Motion capture dataset with Pressure, RGB and Optical sensors (named MotionPRO), which comprises 70 volunteers performing 400 types of motion, encompassing a total of 12.4M pose frames. Secondly, we examine both the necessity and effectiveness of the pressure signal through two challenging tasks: (1) pose and trajectory estimation based solely on pressure: We propose a network that incorporates a small kernel decoder and a long-short-term attention module, and proof that pressure could provide accurate global trajectory and plausible lower body pose. (2) pose and trajectory estimation by fusing pressure and RGB: We impose constraints on orthographic similarity along the camera axis and whole-body contact along the vertical axis to enhance the cross-attention strategy to fuse pressure and RGB feature maps. Experiments demonstrate that fusing pressure with RGB features not only significantly improves performance in terms of objective metrics, but also plausibly drives virtual humans (SMPL) in 3D scene. Furthermore, we demonstrate that incorporating physical perception enables humanoid robots to perform more precise and stable actions, which is highly beneficial for the development of embodied artificial intelligence. Project page is available at: this https URL
zh

[CV-38] InstructionBench: An Instructional Video Understanding Benchmark

【速读】：该论文试图解决视频大型语言模型（Video-LLMs）在理解教学视频方面的不足，特别是在增强教学内容可访问性方面的重要研究空白。论文的关键解决方案是引入InstructionBench，这是一个针对教学视频理解设计的基准测试集。InstructionBench挑战模型在具有严格步骤流程的教学视频中的高级时间推理能力，并通过GPT-4生成开放问答和多项选择题对粗粒度事件级和细粒度对象级推理进行评估。论文采用过滤策略排除仅依赖常识知识的问题，专注于视觉感知与分析，最终构建了一个包含来自700多段视频的5000个问题的数据集。此外，作者开发了一个包含近2.5k视频和超过19k问答对的教学视频数据集，利用自动化数据生成框架，以丰富社区的研究资源。

链接: https://arxiv.org/abs/2504.05040
作者: Haiwan Wei,Yitian Yuan,Xiaohan Lan,Wei Ke,Lin Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite progress in video large language models (Video-LLMs), research on instructional video understanding, crucial for enhancing access to instructional content, remains insufficient. To address this, we introduce InstructionBench, an Instructional video understanding Benchmark, which challenges models’ advanced temporal reasoning within instructional videos characterized by their strict step-by-step flow. Employing GPT-4, we formulate Q\A pairs in open-ended and multiple-choice formats to assess both Coarse-Grained event-level and Fine-Grained object-level reasoning. Our filtering strategies exclude questions answerable purely by common-sense knowledge, focusing on visual perception and analysis when evaluating Video-LLM models. The benchmark finally contains 5k questions across over 700 videos. We evaluate the latest Video-LLMs on our InstructionBench, finding that closed-source models outperform open-source ones. However, even the best model, GPT-4o, achieves only 53.42% accuracy, indicating significant gaps in temporal reasoning. To advance the field, we also develop a comprehensive instructional video dataset with over 19k Q\A pairs from nearly 2.5k videos, using an automated data generation framework, thereby enriching the community’s research resources.
zh

[CV-39] CloSE: A Compact Shape- and Orientation-Agnostic Cloth State Representation

【速读】：本文旨在解决布料操作中的难题，主要由于布料的非刚性特性，使得对其变形的良好表示变得至关重要。论文提出了一种新的布料变形状态表示方法。首先，基于计算出的拓扑指数提出了dGLI圆盘表示法，这些指数针对布料网格边界的边缘段，并排列在一个圆形网格上。dGLI圆盘的热图揭示了与布料状态相关的特征模式，这些特征对于不同形状、大小或位置的布料（如角落和折叠位置）具有一致性。随后，将这些重要特征从dGLI圆盘抽象到一个圆上，称为布料状态表示（CloSE）。此表示方法紧凑、连续且适用于不同形状。最后，在语义标注和高低层次规划两个相关应用中展示了该表示方法的优势。关键在于创新性地设计了dGLI圆盘表示及其进一步抽象得到的CloSE表示，以有效捕捉布料的关键状态特征。

链接: https://arxiv.org/abs/2504.05033
作者: Jay Kamat,Júlia Borràs,Carme Torras
机构: Institut de Robòtica i Informàtica Industrial, CSIC-UPC (机器人与工业信息研究所, CSIC-UPC)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cloth manipulation is a difficult problem mainly because of the non-rigid nature of cloth, which makes a good representation of deformation essential. We present a new representation for the deformation-state of clothes. First, we propose the dGLI disk representation, based on topological indices computed for segments on the edges of the cloth mesh border that are arranged on a circular grid. The heat-map of the dGLI disk uncovers patterns that correspond to features of the cloth state that are consistent for different shapes, sizes of positions of the cloth, like the corners and the fold locations. We then abstract these important features from the dGLI disk onto a circle, calling it the Cloth StatE representation (CloSE). This representation is compact, continuous, and general for different shapes. Finally, we show the strengths of this representation in two relevant applications: semantic labeling and high- and low-level planning. The code, the dataset and the video can be accessed from : this https URL
zh

[CV-40] AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification

【速读】：本文针对二元社会关系建模中的三个主要挑战：(1) 无法有效建模不对称关系（如一方视对方为朋友而另一方仅视为熟人），(2) 离散帧采样破坏连续交互导致时间连续性中断，以及(3) 忽略周期性行为线索（如有节奏的声音或重复手势）在推断二元关系演化中的重要作用。为解决这些问题，论文提出了AsyReC框架，其核心创新包括：(i) 带节点-边双注意力机制的三元图神经网络，动态加权多模态线索以捕捉交互的不对称性；(ii) 基于片段级关系学习的架构，保留时间连续性以实现真实场景交互动力学的细粒度建模；(iii) 周期性时间编码器，将时间索引投影到正弦/余弦波形以建模重复行为模式。实验结果表明，这些创新显著提升了二元关系分类的鲁棒性，尤其验证了不对称交互建模和周期性时间编码的关键作用。

链接: https://arxiv.org/abs/2504.05030
作者: Wang Tang,Fethiye Irmak Dogan,Linbo Qing,Hatice Gunes
机构: AFAR Lab, Department of Computer Science and Technology, University of Cambridge (剑桥大学); College of Electronics and Information Engineering, Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Dyadic social relationships, which refer to relationships between two individuals who know each other through repeated interactions (or not), are shaped by shared spatial and temporal experiences. Current computational methods for modeling these relationships face three major challenges: (1) the failure to model asymmetric relationships, e.g., one individual may perceive the other as a friend while the other perceives them as an acquaintance, (2) the disruption of continuous interactions by discrete frame sampling, which segments the temporal continuity of interaction in real-world scenarios, and (3) the limitation to consider periodic behavioral cues, such as rhythmic vocalizations or recurrent gestures, which are crucial for inferring the evolution of dyadic relationships. To address these challenges, we propose AsyReC, a multimodal graph-based framework for asymmetric dyadic relationship classification, with three core innovations: (i) a triplet graph neural network with node-edge dual attention that dynamically weights multimodal cues to capture interaction asymmetries (addressing challenge 1); (ii) a clip-level relationship learning architecture that preserves temporal continuity, enabling fine-grained modeling of real-world interaction dynamics (addressing challenge 2); and (iii) a periodic temporal encoder that projects time indices onto sine/cosine waveforms to model recurrent behavioral patterns (addressing challenge 3). Extensive experiments on two public datasets demonstrate state-of-the-art performance, while ablation studies validate the critical role of asymmetric interaction modeling and periodic temporal encoding in improving the robustness of dyadic relationship classification in real-world scenarios. Our code is publicly available at: this https URL.
zh

[CV-41] RS-RAG : Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

【速读】：该论文旨在解决现有遥感领域视觉语言模型（Remote Sensing Vision-Language Models, RSVLMs）在复杂或上下文依赖性查询中的语义推理能力不足的问题。具体而言，现有的方法通常局限于封闭集场景理解，侧重于通用场景描述，缺乏整合外部知识的能力，这限制了其处理涉及领域特定知识或世界知识的任务的能力。为了解决这些问题，论文提出了一个关键方案：首先构建了一个多模态遥感世界知识（Remote Sensing World Knowledge, RSWK）数据集，包含来自175个国家的14,141个知名地标高分辨率卫星图像及其详细文本描述，融合了遥感领域的专业知识与更广泛的世界知识；在此基础上，提出了一种新颖的遥感检索增强生成框架（Remote Sensing Retrieval-Augmented Generation, RS-RAG）。该框架的核心包括两个关键组件：多模态知识向量数据库构建模块，用于将遥感图像及其关联的文本知识编码到统一的向量空间中；以及知识检索与响应生成模块，通过检索和重新排序相关知识，并将其融入知识增强提示中以指导视觉语言模型生成上下文相关的响应。实验验证表明，RS-RAG 在图像描述、图像分类和视觉问答三项代表性任务中显著优于当前最先进的基线方法。

链接: https://arxiv.org/abs/2504.04988
作者: Congcong Wen,Yiting Lin,Xiaokang Qu,Nan Li,Yong Liao,Hui Lin,Xiang Li
机构: School of Cyber Science and Technology, University of Science and Technology of China (中国科学技术大学网络科学与技术学院); Department of Electrical and Computer Engineering, New York University Abu Dhabi (纽约大学阿布扎比分校电气与计算机工程系); China Academy of Electronics and Information Technology (中国电子科技集团公司); Department of Computer Science at the University of Reading (英国雷丁大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in VLMs has demonstrated impressive capabilities across a variety of tasks in the natural image domain. Motivated by these advancements, the remote sensing community has begun to adopt VLMs for remote sensing vision-language tasks, including scene understanding, image captioning, and visual question answering. However, existing remote sensing VLMs typically rely on closed-set scene understanding and focus on generic scene descriptions, yet lack the ability to incorporate external knowledge. This limitation hinders their capacity for semantic reasoning over complex or context-dependent queries that involve domain-specific or world knowledge. To address these challenges, we first introduced a multimodal Remote Sensing World Knowledge (RSWK) dataset, which comprises high-resolution satellite imagery and detailed textual descriptions for 14,141 well-known landmarks from 175 countries, integrating both remote sensing domain knowledge and broader world knowledge. Building upon this dataset, we proposed a novel Remote Sensing Retrieval-Augmented Generation (RS-RAG) framework, which consists of two key components. The Multi-Modal Knowledge Vector Database Construction module encodes remote sensing imagery and associated textual knowledge into a unified vector space. The Knowledge Retrieval and Response Generation module retrieves and re-ranks relevant knowledge based on image and/or text queries, and incorporates the retrieved content into a knowledge-augmented prompt to guide the VLM in producing contextually grounded responses. We validated the effectiveness of our approach on three representative vision-language tasks, including image captioning, image classification, and visual question answering, where RS-RAG significantly outperformed state-of-the-art baselines.
zh

[CV-42] DiCoTTA: Domain-invariant Learning for Continual Test-time Adaptation

【速读】：该论文致力于解决连续测试时适应（Continual Test-Time Adaptation, CTTA）的问题，即在测试过程中模型需要适应不断变化的未知领域的同时，还需保留之前学到的知识。现有的CTTA方法大多仅关注于当前测试领域的适应，而忽视了对未来可能遇到的任意测试领域的泛化能力。为了解决这一局限性，论文提出了一个名为DiCoTTA的新型在线域不变学习框架。DiCoTTA的关键在于它能够在测试过程中动态地学习特征表示，使其对于当前以及之前的测试领域都具有域不变性。为此，论文设计了一种新的模型架构和一种专门用于学习域不变特征的测试时适应策略，同时提出了一种新的数据结构和优化算法，以有效地管理来自先前测试领域的信息。实验结果显示，DiCoTTA在四个公开的CTTA基准数据集上达到了最先进的性能，并且在未见的测试领域上表现出色的泛化能力。

链接: https://arxiv.org/abs/2504.04981
作者: Sohyun Lee,Nayeong Kim,Juwon Kang,Seong Joon Oh,Suha Kwak
机构: POSTECH; University of Tübingen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper studies continual test-time adaptation (CTTA), the task of adapting a model to constantly changing unseen domains in testing while preserving previously learned knowledge. Existing CTTA methods mostly focus on adaptation to the current test domain only, overlooking generalization to arbitrary test domains a model may face in the future. To tackle this limitation, we present a novel online domain-invariant learning framework for CTTA, dubbed DiCoTTA. DiCoTTA aims to learn feature representation to be invariant to both current and previous test domains on the fly during testing. To this end, we propose a new model architecture and a test-time adaptation strategy dedicated to learning domain-invariant features without corrupting semantic contents, along with a new data structure and optimization algorithm for effectively managing information from previous test domains. DiCoTTA achieved state-of-the-art performance on four public CTTA benchmarks. Moreover, it showed superior generalization to unseen test domains.
zh

[CV-43] REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning CVPR2025

【速读】：该论文旨在解决从第一人称视角图像输入实时估计高保真人体全身运动（包括身体与手部动作）的问题。现有方法因基于扩散模型的迭代细化过程难以实现因果性和实时性，无法同时捕捉身体与手部姿态之间的相关性。为解决此问题，论文提出的关键方案包括：(1) 级联的身体-手部去噪扩散模块，以快速前馈方式有效建模第一人称身体与手部运动的相关性；(2) 扩散蒸馏技术，通过单一步骤的去噪实现高质量运动估计。此外，模型采用改进的Transformer架构，支持因果建模输出运动的同时增强对未知运动长度的泛化能力。同时，当身份先验信息可用时，论文还提出了基于目标身份少量姿态示例的新颖身份条件方法，进一步提升运动估计质量。实验表明，REWIND 在有无基于示例的身份条件情况下均显著优于现有基线方法。

链接: https://arxiv.org/abs/2504.04956
作者: Jihyun Lee,Weipeng Xu,Alexander Richard,Shih-En Wei,Shunsuke Saito,Shaojie Bai,Te-Li Wang,Minhyuk Sung,Tae-Kyun(T-K)Kim,Jason Saragih
机构: Codec Avatars Lab, Meta (Meta); KAIST (韩国科学技术院); Imperial College London (帝国理工学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025, project page: this https URL

点击查看摘要

Abstract:We present REWIND (Real-Time Egocentric Whole-Body Motion Diffusion), a one-step diffusion model for real-time, high-fidelity human motion estimation from egocentric image inputs. While an existing method for egocentric whole-body (i.e., body and hands) motion estimation is non-real-time and acausal due to diffusion-based iterative motion refinement to capture correlations between body and hand poses, REWIND operates in a fully causal and real-time manner. To enable real-time inference, we introduce (1) cascaded body-hand denoising diffusion, which effectively models the correlation between egocentric body and hand motions in a fast, feed-forward manner, and (2) diffusion distillation, which enables high-quality motion estimation with a single denoising step. Our denoising diffusion model is based on a modified Transformer architecture, designed to causally model output motions while enhancing generalizability to unseen motion lengths. Additionally, REWIND optionally supports identity-conditioned motion estimation when identity prior is available. To this end, we propose a novel identity conditioning method based on a small set of pose exemplars of the target identity, which further enhances motion estimation quality. Through extensive experiments, we demonstrate that REWIND significantly outperforms the existing baselines both with and without exemplar-based identity conditioning.
zh

[CV-44] A Taxonomy of Self-Handover DATE

【速读】：该论文试图解决自手交接（self-handover）这一在双手中协调动作中的研究空白问题。尽管自手交接在复杂任务中促进无缝过渡，但其背后的执行策略尚未被充分探索。论文的关键解决方案在于引入了首个系统的自手交接分类法，并通过对手工标注的超过12小时烹饪活动数据的分析，揭示了自手交接不仅是被动的过渡，而是涉及双手前馈调整的高度协调动作。此外，论文展示了利用先进的视觉-语言模型对自手交接类型进行分类的可行性，为人类操作行为的自动化分析提供了新方法，同时强调了自手交接在实现平滑任务转换中的作用，这对适应性双臂机器人能力的发展至关重要。

链接: https://arxiv.org/abs/2504.04939
作者: Naoki Wake,Atsushi Kanehira,Kazuhiro Sasabuchi,Jun Takamatsu,Katsushi Ikeuchi
机构: Microsoft (微软)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, 1 table, Last updated on April 7th, 2025

点击查看摘要

Abstract:Self-handover, transferring an object between one’s own hands, is a common but understudied bimanual action. While it facilitates seamless transitions in complex tasks, the strategies underlying its execution remain largely unexplored. Here, we introduce the first systematic taxonomy of self-handover, derived from manual annotation of over 12 hours of cooking activity performed by 21 participants. Our analysis reveals that self-handover is not merely a passive transition, but a highly coordinated action involving anticipatory adjustments by both hands. As a step toward automated analysis of human manipulation, we further demonstrate the feasibility of classifying self-handover types using a state-of-the-art vision-language model. These findings offer fresh insights into bimanual coordination, underscoring the role of self-handover in enabling smooth task transitions-an ability essential for adaptive dual-arm robotics.
zh

[CV-45] RCCFormer: A Robust Crowd Counting Network Based on Transformer

【速读】：该论文旨在解决人群计数任务中因尺度变化和复杂背景导致的精度下降问题。解决方案的关键在于提出了一种基于 Transformer 的鲁棒性人群计数网络 RCCFormer，其设计专注于背景抑制与尺度感知能力。具体而言，RCCFormer 引入了多级特征融合模块（Multi-level Feature Fusion Module, MFFM），通过精妙整合主干网络不同阶段提取的特征，构建了一个能够捕捉复杂且全面特征表示的强大基础模型；同时，引入细节嵌入注意力块（Detail-Embedded Attention Block, DEAB），利用全局自注意力与局部注意力结合可学习的方式高效融合上下文信息与局部细节，增强模型聚焦前景区域的能力并有效减少背景噪声干扰；此外，开发了自适应尺度感知模块（Adaptive Scale-Aware Module, ASAM），以创新的输入相关可变形卷积（Input-dependent Deformable Convolution, IDConv）为核心构建单元，动态适应头部目标形状和尺度的变化，显著提升网络应对大规模变化的能力。这些创新点共同构成了 RCCFormer 在多个基准数据集（ShanghaiTech Part_A 和 Part_B、NWPU-Crowd、QNRF）上的卓越表现，验证了其在人群计数任务中的先进性。

链接: https://arxiv.org/abs/2504.04935
作者: Peng Liu,Heng-Chao Li,Sen Lei,Nanqing Liu,Bin Feng,Xiao Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Crowd counting, which is a key computer vision task, has emerged as a fundamental technology in crowd analysis and public safety management. However, challenges such as scale variations and complex backgrounds significantly impact the accuracy of crowd counting. To mitigate these issues, this paper proposes a robust Transformer-based crowd counting network, termed RCCFormer, specifically designed for background suppression and scale awareness. The proposed method incorporates a Multi-level Feature Fusion Module (MFFM), which meticulously integrates features extracted at diverse stages of the backbone architecture. It establishes a strong baseline capable of capturing intricate and comprehensive feature representations, surpassing traditional baselines. Furthermore, the introduced Detail-Embedded Attention Block (DEAB) captures contextual information and local details through global self-attention and local attention along with a learnable manner for efficient fusion. This enhances the model’s ability to focus on foreground regions while effectively mitigating background noise interference. Additionally, we develop an Adaptive Scale-Aware Module (ASAM), with our novel Input-dependent Deformable Convolution (IDConv) as its fundamental building block. This module dynamically adapts to changes in head target shapes and scales, significantly improving the network’s capability to accommodate large-scale variations. The effectiveness of the proposed method is validated on the ShanghaiTech Part_A and Part_B, NWPU-Crowd, and QNRF datasets. The results demonstrate that our RCCFormer achieves excellent performance across all four datasets, showcasing state-of-the-art outcomes.
zh

[CV-46] Inter-event Interval Microscopy for Event Cameras

【速读】：该论文致力于解决从稀疏事件数据重建强度信息这一长期存在的难题，尤其是在荧光显微镜中对静态与动态场景实现事件到强度的转换。传统方法主要依赖于事件积分，而本文提出的方法——基于事件间间隔的显微镜技术（Inter-event Interval Microscopy, IEIM）的关键在于量化每个像素连续事件之间的事件间时间间隔。通过在事件相机中设置固定阈值，这些时间间隔能够精确表示强度值。在硬件层面，该方法集成了脉冲光调制设备于配备事件相机的显微镜系统中，称为基于脉冲调制的事件驱动荧光显微镜（Pulse Modulation-based Event-driven Fluorescence Microscopy）。此外，作者构建了IEIMat数据集以涵盖多种场景，包括高动态范围和高速情况下的实验数据。IEIMat数据集上的实验结果表明，相比其他方法，IEIM在保持较低带宽的同时实现了更高的空间分辨率、时间分辨率以及更广的动态范围。

链接: https://arxiv.org/abs/2504.04924
作者: Changqing Su,Yanqin Chen,Zihan Lin,Zhen Cheng,You Zhou,Bo Xiong,Zhaofei Yu,Tiejun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Event cameras, an innovative bio-inspired sensor, differ from traditional cameras by sensing changes in intensity rather than directly perceiving intensity and recording these variations as a continuous stream of “events”. The intensity reconstruction from these sparse events has long been a challenging problem. Previous approaches mainly focused on transforming motion-induced events into videos or achieving intensity imaging for static scenes by integrating modulation devices at the event camera acquisition end. In this paper, for the first time, we achieve event-to-intensity conversion using a static event camera for both static and dynamic scenes in fluorescence microscopy. Unlike conventional methods that primarily rely on event integration, the proposed Inter-event Interval Microscopy (IEIM) quantifies the time interval between consecutive events at each pixel. With a fixed threshold in the event camera, the time interval can precisely represent the intensity. At the hardware level, the proposed IEIM integrates a pulse light modulation device within a microscope equipped with an event camera, termed Pulse Modulation-based Event-driven Fluorescence this http URL, we have collected IEIMat dataset under various scenes including high dynamic range and high-speed scenarios. Experimental results on the IEIMat dataset demonstrate that the proposed IEIM achieves superior spatial and temporal resolution, as well as a higher dynamic range, with lower bandwidth compared to other methods. The code and the IEIMat dataset will be made publicly available.
zh

[CV-47] IterMask3D: Unsupervised Anomaly Detection and Segmentation with Test-Time Iterative Mask Refinement in 3D Brain MR

【速读】：该论文旨在解决无监督异常检测与分割方法在处理医学影像（如3D脑部MRI）时因输入图像被人为破坏而导致信息损失的问题，这可能引发次优重建及误报增加。为解决此问题，论文提出IterMask3D，这是一种迭代的空间掩码细化策略，通过逐步揭示“正常”区域并利用其信息指导后续重建，从而减少误报。此外，为了提高重建性能，还引入高频图像内容作为额外结构信息来引导掩码区域的重建。关键在于IterMask3D策略及其结合高频图像内容的创新应用。

链接: https://arxiv.org/abs/2504.04911
作者: Ziyun Liang,Xiaoqing Guo,Wentian Xu,Yasin Ibrahim,Natalie Voets,Pieter M Pretorius,J. Alison Noble,Konstantinos Kamnitsas
机构: Department of Engineering Science, University of Oxford (牛津大学), UK; Department of Computer Science, Hong Kong Baptist University (香港浸会大学); Nuffield Department of Clinical Neurosciences, University of Oxford (牛津大学); Department of Neuroradiology, Oxford University Hospitals NHS Foundation Trust (牛津大学医院 NHS 基金会信托)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unsupervised anomaly detection and segmentation methods train a model to learn the training distribution as ‘normal’. In the testing phase, they identify patterns that deviate from this normal distribution as ‘anomalies’. To learn the `normal’ distribution, prevailing methods corrupt the images and train a model to reconstruct them. During testing, the model attempts to reconstruct corrupted inputs based on the learned ‘normal’ distribution. Deviations from this distribution lead to high reconstruction errors, which indicate potential anomalies. However, corrupting an input image inevitably causes information loss even in normal regions, leading to suboptimal reconstruction and an increased risk of false positives. To alleviate this, we propose IterMask3D, an iterative spatial mask-refining strategy designed for 3D brain MRI. We iteratively spatially mask areas of the image as corruption and reconstruct them, then shrink the mask based on reconstruction error. This process iteratively unmasks ‘normal’ areas to the model, whose information further guides reconstruction of ‘normal’ patterns under the mask to be reconstructed accurately, reducing false positives. In addition, to achieve better reconstruction performance, we also propose using high-frequency image content as additional structural information to guide the reconstruction of the masked area. Extensive experiments on the detection of both synthetic and real-world imaging artifacts, as well as segmentation of various pathological lesions across multiple MRI sequences, consistently demonstrate the effectiveness of our proposed method.
zh

[CV-48] Video-Bench: Human-Aligned Video Generation Benchmark CVPR’25

【速读】：该论文旨在解决现有视频生成基准在评估生成模型所产生视频质量时与人类判断不一致的问题。传统基准虽涵盖多维度评估但缺乏与人类感知的良好一致性，而基于大型语言模型（Large Language Model, LLM）的基准虽然具备类人推理能力，但在视频质量指标理解和跨模态一致性方面存在局限性。为此，论文提出了一种名为Video-Bench的新基准，其关键在于通过丰富的提示套件（prompt suite）和广泛的评估维度，系统性地利用多任务大规模语言模型（Multi-Task Large Language Models, MLLMs），结合少量样本评分（few-shot scoring）和查询链（chain-of-query）技术，提供一种结构化且可扩展的生成视频评估方法。实验结果表明，Video-Bench在所有评估维度上均实现了更优的人类偏好一致性，并在框架评估与人工评价分歧时提供了更为客观和准确的洞见。

链接: https://arxiv.org/abs/2504.04907
作者: Hui Han,Siyuan Li,Jiaqi Chen,Yiwen Yuan,Yuling Wu,Chak Tou Leong,Hanwen Du,Junchen Fu,Youhua Li,Jie Zhang,Chi Zhang,Li-jia Li,Yongxin Ni
机构: Shanghai Jiao Tong University (上海交通大学); Stanford University (斯坦福大学); Fellou AI; Fudan University (复旦大学); Carnegie Mellon University (卡内基梅隆大学); Hong Kong Polytechnic University (香港理工大学); Soochow University (苏州大学); University of Glasgow (格拉斯哥大学); City University of Hong Kong (香港城市大学); Westlake University (西湖大学); LiveX AI; National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR’25

点击查看摘要

Abstract:Video generation assessment is essential for ensuring that generative models produce visually realistic, high-quality videos while aligning with human expectations. Current video generation benchmarks fall into two main categories: traditional benchmarks, which use metrics and embeddings to evaluate generated video quality across multiple dimensions but often lack alignment with human judgments; and large language model (LLM)-based benchmarks, though capable of human-like reasoning, are constrained by a limited understanding of video quality metrics and cross-modal consistency. To address these challenges and establish a benchmark that better aligns with human preferences, this paper introduces Video-Bench, a comprehensive benchmark featuring a rich prompt suite and extensive evaluation dimensions. This benchmark represents the first attempt to systematically leverage MLLMs across all dimensions relevant to video generation assessment in generative models. By incorporating few-shot scoring and chain-of-query techniques, Video-Bench provides a structured, scalable approach to generated video evaluation. Experiments on advanced models including Sora demonstrate that Video-Bench achieves superior alignment with human preferences across all dimensions. Moreover, in instances where our framework’s assessments diverge from human evaluations, it consistently offers more objective and accurate insights, suggesting an even greater potential advantage over traditional human judgment.
zh

[CV-49] Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision

【速读】：该论文试图解决低级视觉领域中超过100种子任务的问题，涵盖图像恢复、图像增强、弱语义密集预测以及风格化四大主要类别。论文提出的解决方案OmniLV（Lunima-OmniLV）是一种通用的多模态多任务框架，其关键是通过结合文本和视觉提示提供灵活且友好的交互方式，并基于Diffusion Transformer (DiT) 的生成先验支持任意分辨率，在1K分辨率下实现最优性能，同时保持细粒度细节和高保真度。此外，论文强调分别对文本和视觉指令进行编码，并结合浅层特征控制的协同训练对于缓解任务歧义和提升多任务泛化能力至关重要。

链接: https://arxiv.org/abs/2504.04903
作者: Yuandong Pu,Le Zhuo,Kaiwen Zhu,Liangbin Xie,Wenlong Zhang,Xiangyu Chen,Pneg Gao,Yu Qiao,Chao Dong,Yihao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Lunima-OmniLV (abbreviated as OmniLV), a universal multimodal multi-task framework for low-level vision that addresses over 100 sub-tasks across four major categories: image restoration, image enhancement, weak-semantic dense prediction, and stylization. OmniLV leverages both textual and visual prompts to offer flexible and user-friendly interactions. Built on Diffusion Transformer (DiT)-based generative priors, our framework supports arbitrary resolutions – achieving optimal performance at 1K resolution – while preserving fine-grained details and high fidelity. Through extensive experiments, we demonstrate that separately encoding text and visual instructions, combined with co-training using shallow feature control, is essential to mitigate task ambiguity and enhance multi-task generalization. Our findings also reveal that integrating high-level generative tasks into low-level vision models can compromise detail-sensitive restoration. These insights pave the way for more robust and generalizable low-level vision systems.
zh

[CV-50] SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models CVPR2025

【速读】：该论文旨在解决多模态基础模型中因文本与视觉内容交互而引发的字型攻击（Typography Attacks）所导致的误分类问题。现有数据集在规模和多样性方面存在局限性，难以充分研究此类漏洞。为此，论文提出了解决方案的关键在于构建了一个名为SCAM的大规模且多样化的现实字型攻击图像数据集，包含1,162张图像，覆盖数百种类别对象及攻击词。通过在SCAM上的广泛基准测试，发现字型攻击显著降低了视觉-语言模型（Vision-Language Models, VLMs）的性能，并揭示训练数据和模型架构影响其易受攻击的程度。研究进一步表明，尽管最先进的大型视觉-语言模型（Large Vision-Language Models, LVLMs）中的视觉编码器选择使字型攻击持续存在，但更大的大型语言模型（Large Language Models, LLMs）主干有助于减轻这种脆弱性。此外，合成攻击与真实世界的手写攻击高度相似，验证了其在研究中的适用性。论文提供了全面的资源和实证见解，以促进鲁棒且可信的多模态人工智能系统的未来研究。

链接: https://arxiv.org/abs/2504.04893
作者: Justus Westerhoff,Erblina Purellku,Jakob Hackstein,Leo Pinetzki,Lorenz Hufe
机构: BLISS e.V.; Berliner Hochschule für Technik (BHT) (柏林工业大学); Fraunhofer HHI; Technische Universität Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to CVPR 2025 Workshop EVAL-FoMo-2

点击查看摘要

Abstract:Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. However, existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing 1,162 images across hundreds of object categories and attack words. Through extensive benchmarking of Vision-Language Models (VLMs) on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings reveal that typographic attacks persist in state-of-the-art Large Vision-Language Models (LVLMs) due to the choice of their vision encoder, though larger Large Language Models (LLMs) backbones help mitigate their vulnerability. Additionally, we demonstrate that synthetic attacks closely resemble real-world (handwritten) attacks, validating their use in research. Our work provides a comprehensive resource and empirical insights to facilitate future research toward robust and trustworthy multimodal AI systems. We publicly release the datasets introduced in this paper under this https URL, along with the code for evaluations at this https URL.
zh

[CV-51] Content-Aware Transformer for All-in-one Image Restoration

【速读】：该论文旨在解决基于窗口的自注意力机制在图像恢复任务中受限于有限感受野的问题。为了解决这一挑战，论文提出了一种名为DSwinIR（Deformable Sliding window Transformer for Image Restoration）的新方法。DSwinIR的关键创新在于引入了一种可变形滑动窗口自注意力机制（deformable sliding window self-attention），该机制能够根据图像内容自适应调整感受野，从而将注意力集中于重要区域并增强与显著特征对齐的特征提取。此外，还设计了一种中心汇聚模式（central ensemble pattern）以减少注意力窗口内无关内容的包含。通过结合这两种技术，DSwinIR不仅放大了卷积神经网络（CNNs）和Transformer架构的优势，还缓解了它们各自的局限性，实现了在多种图像恢复任务中的最新性能表现。

链接: https://arxiv.org/abs/2504.04869
作者: Gang Wu,Junjun Jiang,Kui Jiang,Xianming Liu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration has witnessed significant advancements with the development of deep learning models. Although Transformer architectures have progressed considerably in recent years, challenges remain, particularly the limited receptive field in window-based self-attention. In this work, we propose DSwinIR, a Deformable Sliding window Transformer for Image Restoration. DSwinIR introduces a novel deformable sliding window self-attention that adaptively adjusts receptive fields based on image content, enabling the attention mechanism to focus on important regions and enhance feature extraction aligned with salient features. Additionally, we introduce a central ensemble pattern to reduce the inclusion of irrelevant content within attention windows. In this way, the proposed DSwinIR model integrates the deformable sliding window Transformer and central ensemble pattern to amplify the strengths of both CNNs and Transformers while mitigating their limitations. Extensive experiments on various image restoration tasks demonstrate that DSwinIR achieves state-of-the-art performance. For example, in image deraining, compared to DRSformer on the SPA dataset, DSwinIR achieves a 0.66 dB PSNR improvement. In all-in-one image restoration, compared to PromptIR, DSwinIR achieves over a 0.66 dB and 1.04 dB improvement on three-task and five-task settings, respectively. Pretrained models and code are available at our project this https URL.
zh

[CV-52] Embracing Dynamics: Dynamics-aware 4D Gaussian Splatting SLAM IROS2025

【速读】：该论文旨在解决基于3D高斯点泼绘（3D Gaussian Splatting, 3DGS）的同步定位与建图（SLAM）技术在动态环境中的两大核心问题：姿态漂移（pose drift）以及无法准确重建动态场景地图。为了解决这些问题，论文提出了D4DGS-SLAM，这是首个基于4D高斯点泼绘（4DGS）动态场景表示的SLAM方法。其关键在于通过引入时间维度增强场景表示，利用动态感知的信息模块（dynamics-aware InfoModule），实现对场景点的动态性、可见性和可靠性分析，并筛选出稳定的静态点用于跟踪；同时，在优化高斯点时针对不同动态特性施加差异化的各向同性正则化项。这些创新使D4DGS-SLAM能够高质量地重建动态场景，并在真实动态场景数据集上的实验结果验证了其在相机位姿跟踪和地图质量方面的优越性能。

链接: https://arxiv.org/abs/2504.04844
作者: Zhicong Sun,Jacqueline Lo,Jinxing Hu
机构: Hong Kong Polytechnic University (香港理工大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is currently under reviewed for IROS 2025

点击查看摘要

Abstract:Simultaneous localization and mapping (SLAM) technology now has photorealistic mapping capabilities thanks to the real-time high-fidelity rendering capability of 3D Gaussian splatting (3DGS). However, due to the static representation of scenes, current 3DGS-based SLAM encounters issues with pose drift and failure to reconstruct accurate maps in dynamic environments. To address this problem, we present D4DGS-SLAM, the first SLAM method based on 4DGS map representation for dynamic environments. By incorporating the temporal dimension into scene representation, D4DGS-SLAM enables high-quality reconstruction of dynamic scenes. Utilizing the dynamics-aware InfoModule, we can obtain the dynamics, visibility, and reliability of scene points, and filter stable static points for tracking accordingly. When optimizing Gaussian points, we apply different isotropic regularization terms to Gaussians with varying dynamic characteristics. Experimental results on real-world dynamic scene datasets demonstrate that our method outperforms state-of-the-art approaches in both camera pose tracking and map quality.
zh

[CV-53] FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

【速读】：该论文致力于解决从单张静态肖像创建可动画化 avatar 的挑战，特别是现有方法难以捕捉微妙的面部表情、相关的全身运动以及动态背景的问题。为了解决这些局限性，论文提出了一种新颖的框架，利用预训练的视频扩散变换模型生成高保真且连贯的带可控运动动力学的说话肖像。方案的关键在于其双阶段的音视频对齐策略：第一阶段采用片段级训练方案，通过在整个场景（包括参考肖像、上下文物体和背景）中对音频驱动的动力学进行对齐来建立一致的全局运动；第二阶段则使用唇部追踪掩码细化帧级别的唇部动作，确保与音频信号的精确同步。此外，通过引入一个专注于面部的跨注意力模块替代传统的参考网络，并集成一个运动强度调制模块以显式控制表情和身体运动的强度，进一步增强了身份保持和运动灵活性。实验结果表明，该方法在真实感、一致性、运动强度和身份保留方面表现更优。

链接: https://arxiv.org/abs/2504.04842
作者: Mengchao Wang,Qiang Wang,Fan Jiang,Yaqi Fan,Yunpeng Zhang,Yonggang Qi,Kun Zhao,Mu Xu
机构: AMAP, Alibaba Group; Beijing University of Posts and Telecommunications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: this https URL.
zh

[CV-54] Prior2Former – Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation

【速读】：该论文致力于解决现有全景分割方法在面对新颖类别和分布外（Out-of-Distribution, OOD）数据时的可靠性不足问题，特别是在自动驾驶等安全关键领域，确保模型在未见过场景中的可靠性至关重要。为填补这一性能与可靠性的差距，论文提出了一种基于确信学习（evidential learning）的Prior2Former (P2F)，这是首个用于分割视觉变换器的方法。P2F的关键创新在于通过引入Beta先验，扩展了掩码视觉变换器架构以计算像素级二进制掩码分配中的模型不确定性，从而实现高质量的不确定性估计，有效检测新颖和OOD对象，推动异常实例分割和开放世界全景分割达到当前最佳水平。此外，P2F无需使用OOD数据样本或针对空类（即未标记类）进行对比训练，使其在缺乏此类先验信息的实际应用中具有高度适用性。

链接: https://arxiv.org/abs/2504.04841
作者: Sebastian Schmidt,Julius Körner,Dominik Fuchsgruber,Stefano Gasperini,Federico Tombari,Stephan Günnemann
机构: Technical University of Munich (慕尼黑工业大学); BMW Group (宝马集团); Visualais; Google Zurich (谷歌苏黎世)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In panoptic segmentation, individual instances must be separated within semantic classes. As state-of-the-art methods rely on a pre-defined set of classes, they struggle with novel categories and out-of-distribution (OOD) data. This is particularly problematic in safety-critical applications, such as autonomous driving, where reliability in unseen scenarios is essential. We address the gap between outstanding benchmark performance and reliability by proposing Prior2Former (P2F), the first approach for segmentation vision transformers rooted in evidential learning. P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel-wise binary mask assignments. This design enables high-quality uncertainty estimation that effectively detects novel and OOD objects enabling state-of-the-art anomaly instance segmentation and open-world panoptic segmentation. Unlike most segmentation models addressing unknown classes, P2F operates without access to OOD data samples or contrastive training on void (i.e., unlabeled) classes, making it highly applicable in real-world scenarios where such prior information is unavailable. Additionally, P2F can be flexibly applied to anomaly instance and panoptic segmentation. Through comprehensive experiments on the Cityscapes, COCO, SegmentMeIfYouCan, and OoDIS datasets, we demonstrate the state-of-the-art performance of P2F. It achieves the highest ranking in the OoDIS anomaly instance benchmark among methods not using OOD data in any way.
zh

[CV-55] Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos

【速读】：该论文旨在解决点云视频表示学习中的两大挑战：(1) 现有方法通过手工设计学习运动模式，在预训练阶段产生的运动模式在微调场景下不可迁移；(2) 传统的掩码自编码器（Masked AutoEncoder, MAE）框架难以弥合4D数据中存在的巨大表征鸿沟。为解决这些问题，论文提出了一种自解缠的MAE框架用于在预训练阶段学习判别性的4D表示。关键解决方案包括：(1) 在潜在空间建模运动表征以应对第一个挑战；(2) 引入潜在标记与几何标记相结合，在解码过程中解缠高低级特征以解决第二个挑战。实验验证表明，该自解缠学习框架可显著提升所有4D任务的微调性能，并在HOI4D数据集上实现了+3.8%的分割精度提升，优于现有自监督或全监督方法。

链接: https://arxiv.org/abs/2504.04837
作者: Zhi Zuo,Chenyi Zhuang,Zhiqiang Shen,Pan Gao,Jie Qin
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Point cloud video representation learning is primarily built upon the masking strategy in a self-supervised manner. However, the progress is slow due to several significant challenges: (1) existing methods learn the motion particularly with hand-crafted designs, leading to unsatisfactory motion patterns during pre-training which are non-transferable on fine-tuning scenarios. (2) previous Masked AutoEncoder (MAE) frameworks are limited in resolving the huge representation gap inherent in 4D data. In this study, we introduce the first self-disentangled MAE for learning discriminative 4D representations in the pre-training stage. To address the first challenge, we propose to model the motion representation in a latent space. The second issue is resolved by introducing the latent tokens along with the typical geometry tokens to disentangle high-level and low-level features during decoding. Extensive experiments on MSR-Action3D, NTU-RGBD, HOI4D, NvGesture, and SHREC’17 verify this self-disentangled learning framework. We demonstrate that it can boost the fine-tuning performance on all 4D tasks, which we term Uni4D. Our pre-trained model presents discriminative and meaningful 4D representations, particularly benefits processing long videos, as Uni4D gets +3.8% segmentation accuracy on HOI4D, significantly outperforming either self-supervised or fully-supervised methods after end-to-end fine-tuning.
zh

[CV-56] Inland Waterway Object Detection in Multi-environment: Dataset and Approach

【速读】：该论文旨在解决内河航道船舶视觉感知系统在复杂环境下的适应性不足问题，特别是现有数据集难以应对狭窄航道、多变天气和城市干扰等挑战。为解决这些问题，论文提出了关键方案：首先构建了一个包含32,478张高质量图像的多环境内河船舶数据集（Multi-environment Inland Waterway Vessel Dataset, MEIWVD），覆盖多种场景与环境条件；其次，设计了一种基于场景引导的图像增强模块，以自适应优化水体表面图像质量；同时，通过参数受限的空洞卷积提升船舶特征的表达能力，并采用多尺度空洞残差融合方法整合多尺度特征以实现更优检测效果。这些措施显著提升了检测器在复杂多环境场景中的性能。

链接: https://arxiv.org/abs/2504.04835
作者: Shanshan Wang,Haixiang Xu,Hui Feng,Xiaoqian Wang,Pei Song,Sijie Liu,Jianhua He
机构: Key Laboratory of High Performance Ship Technology (Wuhan University of Technology), Ministry of Education (教育部高性能船技术重点实验室（武汉理工大学）); College of Automobile Technology and Service, Wuhan City Polytechnic (武汉城市职业学院汽车技术与服务学院); School of Naval Architecture, Ocean and Energy Power Engineering, Wuhan University of Technology (武汉理工大学船舶与海洋及能源动力工程学院); School of Computer Science and Electronic Engineering, University of Essex (英国埃塞克斯大学计算机科学与电子工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages,11 figures,5 tables

点击查看摘要

Abstract:The success of deep learning in intelligent ship visual perception relies heavily on rich image data. However, dedicated datasets for inland waterway vessels remain scarce, limiting the adaptability of visual perception systems in complex environments. Inland waterways, characterized by narrow channels, variable weather, and urban interference, pose significant challenges to object detection systems based on existing datasets. To address these issues, this paper introduces the Multi-environment Inland Waterway Vessel Dataset (MEIWVD), comprising 32,478 high-quality images from diverse scenarios, including sunny, rainy, foggy, and artificial lighting conditions. MEIWVD covers common vessel types in the Yangtze River Basin, emphasizing diversity, sample independence, environmental complexity, and multi-scale characteristics, making it a robust benchmark for vessel detection. Leveraging MEIWVD, this paper proposes a scene-guided image enhancement module to improve water surface images based on environmental conditions adaptively. Additionally, a parameter-limited dilated convolution enhances the representation of vessel features, while a multi-scale dilated residual fusion method integrates multi-scale features for better detection. Experiments show that MEIWVD provides a more rigorous benchmark for object detection algorithms, and the proposed methods significantly improve detector performance, especially in complex multi-environment scenarios.
zh

[CV-57] Learning Affine Correspondences by Integrating Geometric Constraints

【速读】：该论文旨在解决现有基于仿射对应（Affine Correspondence）提取方法在性能上的诸多局限性问题，特别是在图像匹配和姿态估计任务中的准确性与鲁棒性不足。论文的关键创新在于提出了一种新的管道，通过结合密集匹配（Dense Matching）与几何约束（Geometric Constraints），设计了一个新颖的仿射对应提取框架。该框架借助密集匹配技术和一种新型的关键点尺度与方向估计算法，并引入基于几何约束的损失函数，以监督神经网络学习特征的几何属性，从而有效提升准确性。实验结果表明，所提方法在图像匹配任务中的精度和鲁棒性优于现有方法，并进一步通过相对位姿估计任务验证了其有效性，实现了更精确的位姿估计结果。

链接: https://arxiv.org/abs/2504.04834
作者: Pengju Sun,Banglei Guan,Zhenbao Yu,Yang Shang,Qifeng Yu,Daniel Barath
机构: College of Aerospace Science and Engineering, National University of Defense Technology, China (国防科技大学航空航天科学与工程学院); Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation, China (湖南省图像测量与视觉导航重点实验室); ETH Zurich, Switzerland (瑞士苏黎世联邦理工学院); HUN-REN SZTAKI, Hungary (匈牙利人类智能计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Affine correspondences have received significant attention due to their benefits in tasks like image matching and pose estimation. Existing methods for extracting affine correspondences still have many limitations in terms of performance; thus, exploring a new paradigm is crucial. In this paper, we present a new pipeline designed for extracting accurate affine correspondences by integrating dense matching and geometric constraints. Specifically, a novel extraction framework is introduced, with the aid of dense matching and a novel keypoint scale and orientation estimator. For this purpose, we propose loss functions based on geometric constraints, which can effectively improve accuracy by supervising neural networks to learn feature geometry. The experimental show that the accuracy and robustness of our method outperform the existing ones in image matching tasks. To further demonstrate the effectiveness of the proposed method, we applied it to relative pose estimation. Affine correspondences extracted by our method lead to more accurate poses than the baselines on a range of real-world datasets. The code is available at this https URL.
zh

[CV-58] SMF: Template-free and Rig-free Animation Transfer using Kinetic Codes

【速读】：该论文旨在解决动画重定向（Animation Retargeting）中的多个挑战，包括需要标注训练数据、依赖模板形状先验或人工设计的变形绑定、对未见运动和形状的泛化能力有限以及运动抖动等问题。论文提出了一种名为自监督运动场（Self-supervised Motion Fields, SMF）的框架，通过稀疏运动表示进行鲁棒训练，无需特定数据集的标注、模板或绑定。方案的关键在于引入了动能码（Kinetic Codes），这是一种基于新型自动编码器的稀疏运动编码，能够揭示语义丰富的潜在空间，从而简化大规模训练。此外，该架构包含专门的空间和时间梯度预测器，并实现端到端训练。最终网络通过动能码的潜在空间正则化，在不同形状和运动上的泛化性能显著提升。论文在AMASS、D4D、Mixamo数据集以及从单目视频中采样的未见运动上验证了方法的有效性，并在AMASS数据集的未见运动泛化任务中达到了新的State-of-the-Art (SoTA) 水平。

链接: https://arxiv.org/abs/2504.04831
作者: Sanjeev Muralikrishnan,Niladri Shekhar Dutt,Niloy J. Mitra
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Animation retargeting involves applying a sparse motion description (e.g., 2D/3D keypoint sequences) to a given character mesh to produce a semantically plausible and temporally coherent full-body motion. Existing approaches come with a mix of restrictions - they require annotated training data, assume access to template-based shape priors or artist-designed deformation rigs, suffer from limited generalization to unseen motion and/or shapes, or exhibit motion jitter. We propose Self-supervised Motion Fields (SMF) as a self-supervised framework that can be robustly trained with sparse motion representations, without requiring dataset specific annotations, templates, or rigs. At the heart of our method are Kinetic Codes, a novel autoencoder-based sparse motion encoding, that exposes a semantically rich latent space simplifying large-scale training. Our architecture comprises dedicated spatial and temporal gradient predictors, which are trained end-to-end. The resultant network, regularized by the Kinetic Codes’s latent space, has good generalization across shapes and motions. We evaluated our method on unseen motion sampled from AMASS, D4D, Mixamo, and raw monocular video for animation transfer on various characters with varying shapes and topology. We report a new SoTA on the AMASS dataset in the context of generalization to unseen motion. Project webpage at this https URL
zh

[CV-59] From Specificity to Generality: Revisiting Generalizable Artifacts in Detecting Face Deepfakes

【速读】：该论文试图解决如何构建一个对大多数面部深度伪造（Deepfake）有效的通用检测框架的问题。随着生成式 AI (Generative AI) 技术的快速发展，深度伪造检测变得日益重要，但不同生成器产生的伪造痕迹种类繁多且各不相同，难以逐一学习。为应对这一挑战，论文的关键解决方案在于将深度伪造中的伪造痕迹归类为两类互补的核心类型：人脸不一致性伪影（Face Inconsistency Artifacts, FIA）和上采样伪影（Up-Sampling Artifacts, USA）。FIA 源于生成复杂面部细节的困难，导致面部特征与周围区域之间的不一致；而 USA 是解码器在上采样过程中不可避免留下的痕迹。论文提出了一种新的数据级伪深度伪造生成框架，通过仅引入 FIA 和 USA 来构造伪样本，避免引入其他较少通用的伪影，从而实现标准图像分类器在未见过的深度伪造样本上的良好泛化能力。

链接: https://arxiv.org/abs/2504.04827
作者: Long Ma,Zhiyuan Yan,Yize Chen,Jin Xu,Qinglang Guo,Hu Huang,Yong Liao,Hui Lin
机构: School of Cyber Science and Technology, University of Science and Technology of China (中国科学技术大学网络科学与技术学院); School of Electronic and Computer Engineering, Peking University (北京大学电子与计算机工程学院); School of Data Science, The Chinese University of Hong Kong (香港中文大学数据科学学院); School of Information Science and Technology, University of Science and Technology of China (中国科学技术大学信息科学与技术学院); China Academy of Electronics and Information Technology (中国电子科技集团公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting deepfakes has been an increasingly important topic, especially given the rapid development of AI generation techniques. In this paper, we ask: How can we build a universal detection framework that is effective for most facial deepfakes? One significant challenge is the wide variety of deepfake generators available, resulting in varying forgery artifacts (e.g., lighting inconsistency, color mismatch, etc). But should we ``teach" the detector to learn all these artifacts separately? It is impossible and impractical to elaborate on them all. So the core idea is to pinpoint the more common and general artifacts across different deepfakes. Accordingly, we categorize deepfake artifacts into two distinct yet complementary types: Face Inconsistency Artifacts (FIA) and Up-Sampling Artifacts (USA). FIA arise from the challenge of generating all intricate details, inevitably causing inconsistencies between the complex facial features and relatively uniform surrounding areas. USA, on the other hand, are the inevitable traces left by the generator’s decoder during the up-sampling process. This categorization stems from the observation that all existing deepfakes typically exhibit one or both of these artifacts. To achieve this, we propose a new data-level pseudo-fake creation framework that constructs fake samples with only the FIA and USA, without introducing extra less-general artifacts. Specifically, we employ a super-resolution to simulate the USA, while design a Blender module that uses image-level self-blending on diverse facial regions to create the FIA. We surprisingly found that, with this intuitive design, a standard image classifier trained only with our pseudo-fake data can non-trivially generalize well to unseen deepfakes.
zh

[CV-60] SUEDE:Shared Unified Experts for Physical-Digital Face Attack Detection Enhancement ICME2025

【速读】：该论文旨在解决同时检测物理攻击（如打印照片）和数字威胁（如DeepFake）的挑战，这些攻击类型目前分别作为独立的视觉任务（Face Anti-Spoofing和Forgery Detection）进行研究。由于不同攻击类型的固有差异，难以构建一个统一框架来同时识别这两种模态的数据，这是现有方法面临的主要问题。论文的关键创新在于提出SUEDE（Shared Unified Experts for Physical-Digital Face Attack Detection Enhancement），通过结合一个始终激活的共享专家（用于捕获两类攻击的共同特征）和多个按需激活的路由专家（针对特定攻击类型），有效解决了特征分布重叠与差异的问题，同时利用CLIP网络确保共享专家能够从先验视觉知识中受益，并在统一空间中对齐视觉-文本表示。

链接: https://arxiv.org/abs/2504.04818
作者: Zuying Xie,Changtao Miao,Ajian Liu,Jiabao Guo,Feng Li,Dan Guo,Yunfeng Diao
机构: Hefei University of Technology (合肥工业大学), China; University of Science and Technology of China (中国科学技术大学), China; Institute of Automation Chinese Academy of Sciences (中国科学院自动化研究所), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICME 2025

点击查看摘要

Abstract:Face recognition systems are vulnerable to physical attacks (e.g., printed photos) and digital threats (e.g., DeepFake), which are currently being studied as independent visual tasks, such as Face Anti-Spoofing and Forgery Detection. The inherent differences among various attack types present significant challenges in identifying a common feature space, making it difficult to develop a unified framework for detecting data from both attack modalities simultaneously. Inspired by the efficacy of Mixture-of-Experts (MoE) in learning across diverse domains, we explore utilizing multiple experts to learn the distinct features of various attack types. However, the feature distributions of physical and digital attacks overlap and differ. This suggests that relying solely on distinct experts to learn the unique features of each attack type may overlook shared knowledge between them. To address these issues, we propose SUEDE, the Shared Unified Experts for Physical-Digital Face Attack Detection Enhancement. SUEDE combines a shared expert (always activated) to capture common features for both attack types and multiple routed experts (selectively activated) for specific attack types. Further, we integrate CLIP as the base network to ensure the shared expert benefits from prior visual knowledge and align visual-text representations in a unified space. Extensive results demonstrate SUEDE achieves superior performance compared to state-of-the-art unified detection methods.
zh

[CV-61] DebGCD: Debiased Learning with Distribution Guidance for Generalized Category Discovery ICLR2025

【速读】：本文旨在解决广义类别发现（Generalized Category Discovery, GCD）的问题。在包含标注图像和未标注图像的数据集中，目标是对未标注子集中的所有图像进行分类，无论这些图像是来自已知类别还是未知类别。论文指出，在GCD中，由于缺乏未知类别的真实标签，已知类别与未知类别之间存在固有的标签偏倚。当前最先进的方法利用通过自我蒸馏训练的参数化分类器结合软标签来处理GCD问题，但未能解决此偏倚问题。此外，这些方法将所有未标注样本均匀对待，忽视了确定性水平的变化，导致学习效果不佳。同时，有效识别已知类别与未知类别之间的语义分布变化这一重要方面也被忽略。

为了解决上述挑战，本文提出了DebGCD框架，即一种带有分布引导的去偏见学习方法。首先，DebGCD在一个与GCD分类器相同的特征空间内协同训练一个辅助的去偏见分类器，逐步增强GCD特征。其次，在一个独立的特征空间中引入语义分布检测器，以隐式提升GCD的学习效能。此外，采用基于语义分布确定性的课程学习策略，以优化的速度引导去偏见学习过程。在GCD基准数据集上的全面评估表明，该框架始终表现出最先进的性能，突显其优越性。

关键在于通过引入去偏见分类器和语义分布检测器来解决标签偏倚问题，并结合课程学习策略优化学习过程。

链接: https://arxiv.org/abs/2504.04804
作者: Yuanpei Liu,Kai Han
机构: Visual AI Lab, The University of Hong Kong (香港大学视觉AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a conference paper at ICLR 2025

点击查看摘要

Abstract:In this paper, we tackle the problem of Generalized Category Discovery (GCD). Given a dataset containing both labelled and unlabelled images, the objective is to categorize all images in the unlabelled subset, irrespective of whether they are from known or unknown classes. In GCD, an inherent label bias exists between known and unknown classes due to the lack of ground-truth labels for the latter. State-of-the-art methods in GCD leverage parametric classifiers trained through self-distillation with soft labels, leaving the bias issue unattended. Besides, they treat all unlabelled samples uniformly, neglecting variations in certainty levels and resulting in suboptimal learning. Moreover, the explicit identification of semantic distribution shifts between known and unknown classes, a vital aspect for effective GCD, has been neglected. To address these challenges, we introduce DebGCD, a \underlineDebiased learning with distribution guidance framework for \underlineGCD. Initially, DebGCD co-trains an auxiliary debiased classifier in the same feature space as the GCD classifier, progressively enhancing the GCD features. Moreover, we introduce a semantic distribution detector in a separate feature space to implicitly boost the learning efficacy of GCD. Additionally, we employ a curriculum learning strategy based on semantic distribution certainty to steer the debiased learning at an optimized pace. Thorough evaluations on GCD benchmarks demonstrate the consistent state-of-the-art performance of our framework, highlighting its superiority. Project page: this https URL
zh

[CV-62] OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在序数回归（Ordinal Regression, OR）任务上性能不足的问题。序数回归是一种重要的分类形式，用于处理具有固有顺序关系的目标变量。论文的关键解决方案是提出了一种名为OrderChain的新颖且通用的提示范式，它通过特定性和共同性建模来提升MLLMs的序数理解能力。具体而言，OrderChain包含一组任务感知提示以促进不同OR任务的特定性建模，并引入了一种新的范围优化链式思维（Range Optimization Chain-of-Thought, RO-CoT），通过将OR任务均匀分解为多个小范围优化子任务来学习共同的思维方式。此外，还提出了类别递归划分（Category Recursive Division, CRD）方法，以生成指令候选类别提示，支持RO-CoT的自动优化。实验结果表明，使用OrderChain的大型语言与视觉助手（Large Language and Vision Assistant, LLaVA）模型在多种OR数据集上的性能显著优于基线模型，例如在Adience数据集上的年龄估计任务从47.5%提升至93.2%，并在Diabetic Retinopathy数据集上从30.0%提升至85.7%。此外，OrderChain使LLaVA在Adience数据集上相比最先进的方法提高了27%的准确率和0.24的MAE。据我们所知，这是首个增强MLLMs进行OR任务的工作，并在广泛的OR数据集上验证了其有效性。

链接: https://arxiv.org/abs/2504.04801
作者: Jinhong Wang,Shuo Tong,Jian liu,Dongqi Tang,Weiqiang Wang,Wentong Li,Hongxia Xu,Danny Chen,Jintai Chen,Jian Wu
机构: ZJU(浙江大学); Ant Group(蚂蚁集团); University of Notre Dame (圣母大学); HKUST (Guangzhou) (香港科技大学（广州）);
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. Specifically, our OrderChain consists of a set of task-aware prompts to facilitate the specificity modeling of diverse OR tasks and a new range optimization Chain-of-Thought (RO-CoT), which learns a commonality way of thinking about OR tasks by uniformly decomposing them into multiple small-range optimization subtasks. Further, we propose a category recursive division (CRD) method to generate instruction candidate category prompts to support RO-CoT automatic optimization. Comprehensive experiments show that a Large Language and Vision Assistant (LLaVA) model with our OrderChain improves baseline LLaVA significantly on diverse OR datasets, e.g., from 47.5% to 93.2% accuracy on the Adience dataset for age estimation, and from 30.0% to 85.7% accuracy on the Diabetic Retinopathy dataset. Notably, LLaVA with our OrderChain also remarkably outperforms state-of-the-art methods by 27% on accuracy and 0.24 on MAE on the Adience dataset. To our best knowledge, our OrderChain is the first work that augments MLLMs for OR tasks, and the effectiveness is witnessed across a spectrum of OR datasets.
zh

[CV-63] Dynamic Vision Mamba

【速读】：该论文旨在解决Mamba-based视觉模型中存在的空间冗余问题，具体表现为令牌（token）冗余和块（block）冗余。为了解决令牌冗余，作者分析发现早期的令牌剪枝方法会导致训练与推理之间不一致或在推理过程中引入额外计算。为此，他们通过在将剪枝后的序列输入到下一个Mamba块之前重新排列序列来定制化令牌剪枝以适应Mamba结构。针对块冗余，基于观察到Mamba-based视觉模型的推理速度受SSM块数量显著影响，允许每张图像动态选择SSM块。提出的Dynamic Vision Mamba (DyVM) 方法有效减少了浮点运算次数（FLOPs），同时仅造成微小的性能下降，在Vim-S数据集上实现了35.2%的FLOPs减少，且精度损失仅为1.7%。此方法还能够在不同Mamba视觉模型架构和不同的视觉任务中良好泛化。关键在于提出了一种既能减少计算量又能保持较好性能的动态机制。

链接: https://arxiv.org/abs/2504.04787
作者: Mengxuan Wu,Zekai Li,Zhiyuan Liang,Moyang Li,Xuanlei Zhao,Samir Khaki,Zheng Zhu,Xiaojiang Peng,Konstantinos N. Plataniotis,Kai Wang,Wangbo Zhao,Yang You
机构: NUS (新加坡国立大学); ETH (瑞士联邦理工学院); University of Toronto (多伦多大学); Tsinghua University (清华大学); Shenzhen Technology University (深圳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. However, spatial redundancy still exists in these models, represented by token and block redundancy. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference or introduce extra computation for inference. Therefore, we customize token pruning to fit the Mamba structure by rearranging the pruned sequence before feeding it into the next Mamba block. For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks. Our proposed method, Dynamic Vision Mamba (DyVM), effectively reduces FLOPs with minor performance drops. We achieve a reduction of 35.2% FLOPs with only a loss of accuracy of 1.7% on Vim-S. It also generalizes well across different Mamba vision model architectures and different vision tasks. Our code will be made public.
zh

[CV-64] Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing

【速读】：该论文旨在解决在基于指令引导的图像编辑任务中，同时执行多个指令时累积误差导致的质量下降以及因指令冲突引起的编辑不完整的问题。论文的关键解决方案是提出了一种名为“指令影响解耦（Instruction Influence Disentanglement, IID）”的新框架，该框架专为基于Diffusion Transformer (DiT) 的模型设计，能够在单一去噪过程中实现多指令的并行执行。IID通过分析DiT中的自注意力机制，识别出多指令设置下的独特注意力模式，并推导出针对每个指令的特定注意力掩码以解耦各指令的影响。这些掩码指导编辑过程确保局部化修改的同时保持未编辑区域的一致性。实验结果表明，与现有基线相比，IID在减少扩散步数的同时提高了保真度和指令完成度。

链接: https://arxiv.org/abs/2504.04784
作者: Hui Liu,Bin Zou,Suiyun Zhang,Kecheng Chen,Rui Liu,Haoliang Li
机构: City University of Hong Kong (香港城市大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Instruction-guided image editing enables users to specify modifications using natural language, offering more flexibility and control. Among existing frameworks, Diffusion Transformers (DiTs) outperform U-Net-based diffusion models in scalability and performance. However, while real-world scenarios often require concurrent execution of multiple instructions, step-by-step editing suffers from accumulated errors and degraded quality, and integrating multiple instructions with a single prompt usually results in incomplete edits due to instruction conflicts. We propose Instruction Influence Disentanglement (IID), a novel framework enabling parallel execution of multiple instructions in a single denoising process, designed for DiT-based models. By analyzing self-attention mechanisms in DiTs, we identify distinctive attention patterns in multi-instruction settings and derive instruction-specific attention masks to disentangle each instruction’s influence. These masks guide the editing process to ensure localized modifications while preserving consistency in non-edited regions. Extensive experiments on open-source and custom datasets demonstrate that IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines. The codes will be publicly released upon the acceptance of the paper.
zh

[CV-65] OCC-MLLM -CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance CVPR2025

【速读】：该论文旨在解决现有大规模视觉-语言多模态模型在理解被遮挡物体方面的不足。当前最先进的多模态大模型难以通过通用视觉编码器和监督学习策略提供令人满意的结果。为了解决这一问题，论文提出了OCC-MLLM-CoT-Alpha框架，其关键在于结合了3D感知监督和链式思维（Chain-of-Thoughts, CoT）引导机制。具体而言，该框架由一个包含大型多模态视觉-语言模型和3D重建专家模型的多模态视觉-语言模型体系结构组成，并通过监督与强化训练策略联合学习对应的多模态链式思维，从而增强模型对遮挡物体的识别能力。此外，构建了一个包含11万样本的大规模多模态链式思维推理数据集，用于训练和验证。实验结果显示，所提出的方法在多种先进模型的两种设置下显著提升了决策分数。

链接: https://arxiv.org/abs/2504.04781
作者: Chaoyi Wang,Baoqing Li,Xinhan Di
机构: Shanghai Institute of Microsystem and Information Technology, CAS, China (上海微系统与信息技术研究所, 中科院, 中国); Giant Network, China (巨人网络, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted to the Multimodal Algorithmic Reasoning (MAR) Workshop at CVPR 2025

点击查看摘要

Abstract:Comprehending occluded objects are not well studied in existing large-scale visual-language multi-modal models. Current state-of-the-art multi-modal large models struggles to provide satisfactory results in understanding occluded objects through universal visual encoders and supervised learning strategies. Therefore, we propose OCC-MLLM-CoT-Alpha, a multi-modal large vision language framework that integrates 3D-aware supervision and Chain-of-Thoughts guidance. Particularly, (1) we build a multi-modal large vision-language model framework which is consisted of a large multi-modal vision-language model and a 3D reconstruction expert model. (2) the corresponding multi-modal Chain-of-Thoughts is learned through a combination of supervised and reinforcement training strategies, allowing the multi-modal vision-language model to enhance the recognition ability with learned multi-modal chain-of-thoughts guidance. (3) A large-scale multi-modal chain-of-thoughts reasoning dataset, consisting of 110k samples of occluded objects held in hand, is built. In the evaluation, the proposed methods demonstrate decision score improvement of 15.75%,15.30%,16.98%,14.62%, and 4.42%,3.63%,6.94%,10.70% for two settings of a variety of state-of-the-art models.
zh

[CV-66] Bottom-Up Scattering Information Perception Network for SAR target recognition

【速读】：该论文旨在解决现有基于深度学习的合成孔径雷达（SAR）图像目标识别方法在感知和挖掘目标散射信息方面的不足，这些问题导致算法性能瓶颈和鲁棒性较差。为了解决这一挑战，论文提出了一种新颖的自下而上的散射信息感知网络，用于更可解释的目标识别。其关键在于构建专有的SAR图像解释网络：首先通过提出的局部散射感知感知器替代基于CNN的主干特征提取器，以深入挖掘目标的底层散射信息；其次，提出一种无监督的散射部分特征提取模型，以稳健地表征目标散射部分信息并提供细粒度的目标表示；最后，通过聚合目标部分的知识形成完整的目标描述，从而提升模型的可解释性和判别能力。

链接: https://arxiv.org/abs/2504.04780
作者: Chenxi Zhao,Daochang Wang,Siqian Zhang,Gangyao Kuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning methods based synthetic aperture radar (SAR) image target recognition tasks have been widely studied currently. The existing deep methods are insufficient to perceive and mine the scattering information of SAR images, resulting in performance bottlenecks and poor robustness of the algorithms. To this end, this paper proposes a novel bottom-up scattering information perception network for more interpretable target recognition by constructing the proprietary interpretation network for SAR images. Firstly, the localized scattering perceptron is proposed to replace the backbone feature extractor based on CNN networks to deeply mine the underlying scattering information of the target. Then, an unsupervised scattering part feature extraction model is proposed to robustly characterize the target scattering part information and provide fine-grained target representation. Finally, by aggregating the knowledge of target parts to form the complete target description, the interpretability and discriminative ability of the model is improved. We perform experiments on the FAST-Vehicle dataset and the SAR-ACD dataset to validate the performance of the proposed method.
zh

[CV-67] Enhancing Leaf Disease Classification Using GAT-GCN Hybrid Model

【速读】：该论文旨在解决农业中作物疾病高效低干预识别的需求问题，特别是在创新农业实践普及背景下疾病风险增加的挑战。解决方案的关键在于提出了一种结合图注意力网络（Graph Attention Networks, GATs）和图卷积网络（Graph Convolution Networks, GCNs）的混合模型用于叶片疾病分类。此方法通过超级像素分割实现高效的特征提取，并引入边缘增强技术提升模型鲁棒性，同时采用权重初始化优化训练过程，从而显著提高了疾病检测的泛化能力与准确性。

链接: https://arxiv.org/abs/2504.04764
作者: Shyam Sundhar,Riya Sharma,Priyansh Maheshwari,Suvidha Rupesh Kumar,T. Sunil Kumar
机构: Vellore Institute of Technology (维洛尔理工学院); University of Gavle (于默奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agriculture plays a critical role in the global economy, providing livelihoods and ensuring food security for billions. As innovative agricultural practices become more widespread, the risk of crop diseases has increased, highlighting the urgent need for efficient, low-intervention disease identification methods. This research presents a hybrid model combining Graph Attention Networks (GATs) and Graph Convolution Networks (GCNs) for leaf disease classification. GCNs have been widely used for learning from graph-structured data, and GATs enhance this by incorporating attention mechanisms to focus on the most important neighbors. The methodology integrates superpixel segmentation for efficient feature extraction, partitioning images into meaningful, homogeneous regions that better capture localized features. The authors have employed an edge augmentation technique to enhance the robustness of the model. The edge augmentation technique has introduced a significant degree of generalization in the detection capabilities of the model. To further optimize training, weight initialization techniques are applied. The hybrid model is evaluated against the individual performance of the GCN and GAT models and the hybrid model achieved a precision of 0.9822, recall of 0.9818, and F1-score of 0.9818 in apple leaf disease classification, a precision of 0.9746, recall of 0.9744, and F1-score of 0.9743 in potato leaf disease classification, and a precision of 0.8801, recall of 0.8801, and F1-score of 0.8799 in sugarcane leaf disease classification. These results demonstrate the robustness and performance of the model, suggesting its potential to support sustainable agricultural practices through precise and effective disease detection. This work is a small step towards reducing the loss of crops and hence supporting sustainable goals of zero hunger and life on land.
zh

[CV-68] Continuous Locomotive Crowd Behavior Generation CVPR2025

【速读】：该论文试图解决在心理学、机器人学、交通工程和虚拟环境等领域中，现有方法难以生成连续且逼真的群体行为轨迹的问题。传统方法多专注于合成瞬间场景，而无法有效复制真实世界中人群的持续动态特性。为解决这一挑战，论文提出了一种新颖的方法，通过设计一个包含发射器模型（crowd emitter model）与模拟器（simulator）的框架来实现具有异构行为及个体间交互的连续、逼真人群轨迹的自动生成。关键在于首先利用扩散模型从单张输入图像中提取空间布局信息（如分割图、外观图、人口密度图和人口概率图），并通过发射器为每个个体分配独立的行为特征（如类型、步速及起点/终点位置）。随后，模拟器基于马尔可夫链扩展行为以生成长期运动，最终交替使用发射器和模拟器完成复杂场景的人群填充。所有组件均具备用户可控性，并提出了一个基准协议用于评估生成人群在场景级动态和个体级轨迹精度上的真实性与质量。

链接: https://arxiv.org/abs/2504.04756
作者: Inhwan Bae,Junoh Lee,Hae-Gon Jeon
机构: Gwangju Institute of Science and Technology (光州科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted at CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Modeling and reproducing crowd behaviors are important in various domains including psychology, robotics, transport engineering and virtual environments. Conventional methods have focused on synthesizing momentary scenes, which have difficulty in replicating the continuous nature of real-world crowds. In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. We first design a crowd emitter model. To do this, we obtain spatial layouts from single input images, including a segmentation map, appearance map, population density map and population probability, prior to crowd generation. The emitter then continually places individuals on the timeline by assigning independent behavior characteristics such as agents’ type, pace, and start/end positions using diffusion models. Next, our crowd simulator produces their long-term locomotions. To simulate diverse actions, it can augment their behaviors based on a Markov chain. As a result, our overall framework populates the scenes with heterogeneous crowd behaviors by alternating between the proposed emitter and simulator. Note that all the components in the proposed framework are user-controllable. Lastly, we propose a benchmark protocol to evaluate the realism and quality of the generated crowds in terms of the scene-level population dynamics and the individual-level trajectory accuracy. We demonstrate that our approach effectively models diverse crowd behavior patterns and generalizes well across different geographical environments. Code is publicly available at this https URL .
zh

[CV-69] CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images CVPR2025

【速读】：该论文旨在解决从非约束的真实世界CAD图像（由不同经验的用户轻松捕获）逆向工程生成参数化CAD模型的问题。当前方法通常依赖于昂贵且耗时的3D扫描及后处理，而直接训练模型面临真实世界CAD数据稀缺的挑战。为应对这些挑战，论文提出CADCrafter框架，该框架仅使用合成无纹理CAD数据进行训练，同时在真实世界图像上进行测试。解决方案的关键在于引入了几何编码器以准确捕捉多样化的几何特征，并利用几何特征的纹理不变性提升泛化能力；此外，通过直接偏好优化（DPO）结合CAD序列质量的自动代码检查反馈，施加几何有效性约束，弥补了CAD参数序列编译为显式CAD模型过程中缺乏显式几何监督的问题。最后，论文构建了一个包含多视角图像与相应CAD命令序列对的真实世界数据集来验证方法的有效性。

链接: https://arxiv.org/abs/2504.04753
作者: Cheng Chen,Jiacheng Wei,Tianrun Chen,Chi Zhang,Xiaofeng Yang,Shangzhan Zhang,Bingchen Yang,Chuan-Sheng Foo,Guosheng Lin,Qixing Huang,Fayao Liu
机构: Nanyang Technological University (南洋理工大学); Institute for Infocomm Research, ASTAR (信息通信研究院, 新加坡科技研究局), Singapore; KOKONI3D, Moxin (Huzhou) Technology Co., LTD. (可柯尼3D, 摩新(湖州)科技有限公司); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Weatlake University (威特莱克大学); Centre for Frontier AI Research, ASTAR (前沿人工智能研究中心, 新加坡科技研究局), Singapore; Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:Creating CAD digital twins from the physical world is crucial for manufacturing, design, and simulation. However, current methods typically rely on costly 3D scanning with labor-intensive post-processing. To provide a user-friendly design process, we explore the problem of reverse engineering from unconstrained real-world CAD images that can be easily captured by users of all experiences. However, the scarcity of real-world CAD data poses challenges in directly training such models. To tackle these challenges, we propose CADCrafter, an image-to-parametric CAD model generation framework that trains solely on synthetic textureless CAD data while testing on real-world images. To bridge the significant representation disparity between images and parametric CAD models, we introduce a geometry encoder to accurately capture diverse geometric features. Moreover, the texture-invariant properties of the geometric features can also facilitate the generalization to real-world scenarios. Since compiling CAD parameter sequences into explicit CAD models is a non-differentiable process, the network training inherently lacks explicit geometric supervision. To impose geometric validity constraints, we employ direct preference optimization (DPO) to fine-tune our model with the automatic code checker feedback on CAD sequence quality. Furthermore, we collected a real-world dataset, comprised of multi-view images and corresponding CAD command sequence pairs, to evaluate our method. Experimental results demonstrate that our approach can robustly handle real unconstrained CAD images, and even generalize to unseen general objects.
zh

[CV-70] wo is Better than One: Efficient Ensemble Defense for Robust and Compact Models CVPR2025

【速读】：该论文旨在解决深度学习驱动的计算机视觉系统在资源受限的移动和边缘设备上部署性能不佳的问题。同时，针对已有模型压缩技术（如剪枝、量化和矩阵分解）导致的对抗攻击高脆弱性，提出了解决方案。论文的关键创新在于引入了高效集成防御（Efficient Ensemble Defense, EED）技术，通过基于不同剪枝重要性分数多样化压缩单一基础模型，并增强集成多样性以实现高对抗鲁棒性和资源效率。EED 在推理阶段动态确定必要子模型的数量，在保持高鲁棒性的同时最小化不必要的计算开销。在 CIFAR-10 和 SVHN 数据集上的实验表明，EED 在对抗鲁棒性方面达到当前最先进的性能，并提升了高达 1.86 倍的推理速度，证明其在资源受限环境中的强大实用性。

链接: https://arxiv.org/abs/2504.04747
作者: Yoojin Jung,Byung Cheol Song
机构: Department of Electrical and Computer Engineering, Inha University (电气与计算机工程系, 仁荷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:Deep learning-based computer vision systems adopt complex and large architectures to improve performance, yet they face challenges in deployment on resource-constrained mobile and edge devices. To address this issue, model compression techniques such as pruning, quantization, and matrix factorization have been proposed; however, these compressed models are often highly vulnerable to adversarial attacks. We introduce the \textbfEfficient Ensemble Defense (EED) technique, which diversifies the compression of a single base model based on different pruning importance scores and enhances ensemble diversity to achieve high adversarial robustness and resource efficiency. EED dynamically determines the number of necessary sub-models during the inference stage, minimizing unnecessary computations while maintaining high robustness. On the CIFAR-10 and SVHN datasets, EED demonstrated state-of-the-art robustness performance compared to existing adversarial pruning techniques, along with an inference speed improvement of up to 1.86 times. This proves that EED is a powerful defense solution in resource-constrained environments.
zh

[CV-71] Grounding 3D Object Affordance with Language Instructions Visual Observations and Interactions CVPR2025

【速读】：该论文旨在解决基于语言指令、视觉观测和交互来定位三维物体操作性（affordance）的问题，这一任务受到认知科学的启发。论文的关键在于提出了一种名为LMAffordance3D的多模态、语言引导的三维物体操作性定位网络，通过融合二维与三维空间特征及语义特征，实现对物体操作性的精准定位。为支持该任务，作者构建了一个包含点云、图像和语言指令的Affordance Grounding数据集（AGPIL），以应对因观察角度、物体旋转或空间遮挡导致的部分观测挑战。实验结果验证了所提方法在该任务上的有效性和优越性。

链接: https://arxiv.org/abs/2504.04744
作者: He Zhu,Quyu Kong,Kechun Xu,Xunlong Xia,Bing Deng,Jieping Ye,Rong Xiong,Yue Wang
机构: Zhejiang University (浙江大学); Alibaba Cloud (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: CVPR 2025

点击查看摘要

Abstract:Grounding 3D object affordance is a task that locates objects in 3D space where they can be manipulated, which links perception and action for embodied intelligence. For example, for an intelligent robot, it is necessary to accurately ground the affordance of an object and grasp it according to human instructions. In this paper, we introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions, which is inspired by cognitive science. We collect an Affordance Grounding dataset with Points, Images and Language instructions (AGPIL) to support the proposed task. In the 3D physical world, due to observation orientation, object rotation, or spatial occlusion, we can only get a partial observation of the object. So this dataset includes affordance estimations of objects from full-view, partial-view, and rotation-view perspectives. To accomplish this task, we propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network, which applies a vision-language model to fuse 2D and 3D spatial features with semantic features. Comprehensive experiments on AGPIL demonstrate the effectiveness and superiority of our method on this task, even in unseen experimental settings. Our project is available at this https URL.
zh

[CV-72] AnyArtisticGlyph: Multilingual Controllable Artistic Glyph Generation

【速读】：该论文旨在解决现有艺术字体图像生成方法在细节合成中容易出现模糊或错误纹理的问题，这些问题限制了生成结果的自然度与准确性。为应对这一挑战，论文提出了一种基于扩散模型的多语言可控艺术字体生成方法AnyArtisticGlyph。其关键在于引入了字体融合与嵌入模块（用于生成详细的结构特征）以及视觉-文本融合与嵌入模块（利用CLIP模型编码参考图像，并结合变换描述符实现全局图像的平滑生成），同时通过粗粒度特征级损失函数进一步提升生成精度。实验表明，该方法能够生成自然且细节丰富的艺术字体图像，在性能上达到当前最佳水平。

链接: https://arxiv.org/abs/2504.04743
作者: Xiongbo Lu,Yaxiong Chen,Shengwu Xiong
机构: School of Computer Science and Artificial Intelligence, Wuhan University of Technology (武汉理工大学计算机科学与人工智能学院), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artistic Glyph Image Generation (AGIG) differs from current creativity-focused generation models by offering finely controllable deterministic generation. It transfers the style of a reference image to a source while preserving its content. Although advanced and promising, current methods may reveal flaws when scrutinizing synthesized image details, often producing blurred or incorrect textures, posing a significant challenge. Hence, we introduce AnyArtisticGlyph, a diffusion-based, multilingual controllable artistic glyph generation model. It includes a font fusion and embedding module, which generates latent features for detailed structure creation, and a vision-text fusion and embedding module that uses the CLIP model to encode references and blends them with transformation caption embeddings for seamless global image generation. Moreover, we incorporate a coarse-grained feature-level loss to enhance generation accuracy. Experiments show that it produces natural, detailed artistic glyph images with state-of-the-art performance. Our project will be open-sourced on this https URL to advance text generation technology.
zh

[CV-73] Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在场景组成性理解上的困难，即正确识别场景为原子视觉概念的组合。尽管最先进的模型如GPT-4o在区分类似“狗追逐猫”与“猫追逐狗”的场景方面已取得进展，但在Winoground等基准测试中，它们的表现仍远低于人类水平。论文的关键解决方案是通过数据阐明这些视觉概念，训练模型更倾向于选择图像的正确描述而非接近但错误的描述。为此，作者提出了SCRAMBLe（Synthetic Compositional Reasoning Augmentation of MLLMs with Binary preference Learning），一种基于自动生成的偏好数据对开放权重MLLMs进行偏好调优的方法。该方法从多个视觉-语言组成性基准测试中显著提升了模型的组成性推理能力，并在一般视觉问答任务中也取得了小幅但显著的改进。

链接: https://arxiv.org/abs/2504.04740
作者: Samarth Mishra,Kate Saenko,Venkatesh Saligrama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compositionality, or correctly recognizing scenes as compositions of atomic visual concepts, remains difficult for multimodal large language models (MLLMs). Even state of the art MLLMs such as GPT-4o can make mistakes in distinguishing compositions like “dog chasing cat” vs “cat chasing dog”. While on Winoground, a benchmark for measuring such reasoning, MLLMs have made significant progress, they are still far from a human’s performance. We show that compositional reasoning in these models can be improved by elucidating such concepts via data, where a model is trained to prefer the correct caption for an image over a close but incorrect one. We introduce SCRAMBLe: Synthetic Compositional Reasoning Augmentation of MLLMs with Binary preference Learning, an approach for preference tuning open-weight MLLMs on synthetic preference data generated in a fully automated manner from existing image-caption data. SCRAMBLe holistically improves these MLLMs’ compositional reasoning capabilities which we can see through significant improvements across multiple vision language compositionality benchmarks, as well as smaller but significant improvements on general question answering tasks. As a sneak peek, SCRAMBLe tuned Molmo-7B model improves on Winoground from 49.5% to 54.8% (best reported to date), while improving by ~1% on more general visual question answering tasks. Code for SCRAMBLe along with tuned models and our synthetic training dataset is available at this https URL.
zh

[CV-74] Inverse: Vision-Centric 3D Semantic Occupancy Prediction Assisted with 3D Object Detection

【速读】：本文旨在解决自动驾驶车辆（Autonomous Vehicles, AVs）利用车载环视相机预测周围环境的详细几何与语义信息（即3D语义占用预测）的问题。现有方法主要通过设计复杂的内部模块（如高效的特征采样与聚合过程或中间特征表示格式）来提升模型性能。本文的关键解决方案是引入多任务学习框架，通过增加一个额外的3D监督信号，即结合辅助的3D目标检测分支，以增强中间特征捕捉场景中小型动态物体的能力。这些小型动态物体通常包括易受伤害的道路使用者（如自行车、摩托车和行人），其检测对于确保自动驾驶的安全性至关重要。实验结果表明，该方法在nuScenes数据集上达到了最先进的性能，在雨天和夜间等具有挑战性的场景中表现出色，尤其在易受伤害道路使用者（Vulnerable Road Users, VRU）的检测方面表现优异。

链接: https://arxiv.org/abs/2504.04732
作者: Zhenxing Ming,Julie Stephany Berrio,Mao Shan,Stewart Worrall
机构: Australian Centre for Robotics (ACFR) at the University of Sydney (NSW, Australia)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D semantic occupancy prediction aims to forecast detailed geometric and semantic information of the surrounding environment for autonomous vehicles (AVs) using onboard surround-view cameras. Existing methods primarily focus on intricate inner structure module designs to improve model performance, such as efficient feature sampling and aggregation processes or intermediate feature representation formats. In this paper, we explore multitask learning by introducing an additional 3D supervision signal by incorporating an additional 3D object detection auxiliary branch. This extra 3D supervision signal enhances the model’s overall performance by strengthening the capability of the intermediate features to capture small dynamic objects in the scene, and these small dynamic objects often include vulnerable road users, i.e. bicycles, motorcycles, and pedestrians, whose detection is crucial for ensuring driving safety in autonomous vehicles. Extensive experiments conducted on the nuScenes datasets, including challenging rainy and nighttime scenarios, showcase that our approach attains state-of-the-art results, achieving an IoU score of 31.73% and a mIoU score of 20.91% and excels at detecting vulnerable road users (VRU). The code will be made available at:this https URL
zh

[CV-75] Exploring Kernel Transformations for Implicit Neural Representations

【速读】：本文旨在探索输入/输出的核变换（kernel transformation）对隐式神经表示（Implicit Neural Representations, INRs）性能的影响，而非关注模型内部组件（如激活函数）的变化。论文的关键在于提出了一种简单而有效的结合尺度（scale）和偏移（shift）的方法，通过核变换显著提升INRs的性能，且几乎不增加计算开销。此外，作者从网络深度和归一化两个视角解读了尺度与偏移变换带来的性能优势。这一研究为理解与改进INRs提供了新的视角和方法路径。

链接: https://arxiv.org/abs/2504.04728
作者: Sheng Zheng,Chaoning Zhang,Dongshen Han,Fachrina Dewi Puspitasari,Xinhong Hao,Yang Yang,Heng Tao Shen
机构: Center for Future Media and the School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学未来媒体中心和计算机科学与工程学院); School of Mechatronical Engineering, Beijing Institute of Technology (北京理工大学机电工程学院); School of Computing, Kyung Hee University (庆熙大学计算学院); Institute of Electronic and Information Engineering, University of Electronic Science and Technology of China (电子科技大学电子与信息工程学院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE Transactions on Multimedia (TMM) on December 20, 2024 (To appear on IEEE Website soon)

点击查看摘要

Abstract:Implicit neural representations (INRs), which leverage neural networks to represent signals by mapping coordinates to their corresponding attributes, have garnered significant attention. They are extensively utilized for image representation, with pixel coordinates as input and pixel values as output. In contrast to prior works focusing on investigating the effect of the model’s inside components (activation function, for instance), this work pioneers the exploration of the effect of kernel transformation of input/output while keeping the model itself unchanged. A byproduct of our findings is a simple yet effective method that combines scale and shift to significantly boost INR with negligible computation overhead. Moreover, we present two perspectives, depth and normalization, to interpret the performance benefits caused by scale and shift transformation. Overall, our work provides a new avenue for future works to understand and improve INR through the lens of kernel transformation.
zh

[CV-76] actileNet: Bridging the Accessibility Gap with AI-Generated Tactile Graphics for Individuals with Vision Impairment

【速读】：该论文旨在解决传统方法在创建触觉图形（tactile graphics）过程中因劳动密集型导致的需求难以满足的问题。解决方案的关键在于提出TactileNet，这是一个基于文本到图像Stable Diffusion (SD) 模型的AI驱动框架，并结合Low-Rank Adaptation (LoRA) 和DreamBooth技术，通过微调SD模型生成符合触觉标准且细节忠实的高质量触觉图形，同时大幅降低计算成本。这种方法不仅实现了高精度（92.86%的标准合规性）和高保真度，还展示了可扩展性，能够生成大量多样化且定制化的触觉图形输出。

链接: https://arxiv.org/abs/2504.04722
作者: Adnan Khan,Alireza Choubineh,Mai A. Shaaban,Abbas Akkasi,Majid Komeili
机构: School of Computer Science, Carleton University (卡内基梅隆大学计算机学院), Ottawa, Canada; Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学), Abu Dhabi, UAE
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tactile graphics are essential for providing access to visual information for the 43 million people globally living with vision loss, as estimated by global prevalence data. However, traditional methods for creating these tactile graphics are labor-intensive and struggle to meet demand. We introduce TactileNet, the first comprehensive dataset and AI-driven framework for generating tactile graphics using text-to-image Stable Diffusion (SD) models. By integrating Low-Rank Adaptation (LoRA) and DreamBooth, our method fine-tunes SD models to produce high-fidelity, guideline-compliant tactile graphics while reducing computational costs. Evaluations involving tactile experts show that generated graphics achieve 92.86% adherence to tactile standards and 100% alignment with natural images in posture and features. Our framework also demonstrates scalability, generating 32,000 images (7,050 filtered for quality) across 66 classes, with prompt editing enabling customizable outputs (e.g., adding/removing details). Our work empowers designers to focus on refinement, significantly accelerating accessibility efforts. It underscores the transformative potential of AI for social good, offering a scalable solution to bridge the accessibility gap in education and beyond.
zh

[CV-77] On the Robustness of GUI Grounding Models Against Image Attacks

【速读】：该论文试图解决GUI接地（GUI grounding）模型在实际应用中因自然噪声和对抗性扰动导致的鲁棒性不足的问题。论文的关键在于系统性地评估了当前最先进的GUI接地模型（如Uground）在三种条件下的鲁棒性：自然噪声、非目标对抗攻击和目标对抗攻击。通过广泛实验表明，这些模型对对抗扰动和低分辨率条件表现出高度敏感性。研究揭示了GUI接地模型的脆弱性，并为未来提升其实际应用中的鲁棒性提供了重要基准。

链接: https://arxiv.org/abs/2504.04716
作者: Haoren Zhao,Tianyi Chen,Zhen Wang
机构: Hangzhou Dianzi University (杭州电子科技大学); Microsoft (微软); Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graphical User Interface (GUI) grounding models are crucial for enabling intelligent agents to understand and interact with complex visual interfaces. However, these models face significant robustness challenges in real-world scenarios due to natural noise and adversarial perturbations, and their robustness remains underexplored. In this study, we systematically evaluate the robustness of state-of-the-art GUI grounding models, such as UGround, under three conditions: natural noise, untargeted adversarial attacks, and targeted adversarial attacks. Our experiments, which were conducted across a wide range of GUI environments, including mobile, desktop, and web interfaces, have clearly demonstrated that GUI grounding models exhibit a high degree of sensitivity to adversarial perturbations and low-resolution conditions. These findings provide valuable insights into the vulnerabilities of GUI grounding models and establish a strong benchmark for future research aimed at enhancing their robustness in practical applications. Our code is available at this https URL.
zh

[CV-78] SapiensID: Foundation for Human Recognition CVPR2025

【速读】：该论文旨在解决现有行人再识别（ReID）系统在处理姿态（pose）、可见性（visibility）及场景多样性变化时表现不佳的问题。传统方法通常依赖于独立的面部和身体分析模型，限制了其在真实世界复杂场景中的有效性。为弥合这一差距，论文提出了一种统一的模型SapiensID，其关键创新点包括：(i) 动态区域生成机制Retina Patch (RP)，用于自适应调整感兴趣区域的尺度并保持一致的特征分割；(ii) 可变长度令牌学习的掩码识别模型Masked Recognition Model (MRM)；以及(iii) 姿态不变特征学习模块Semantic Attention Head (SAH)，通过聚合关键身体部位周围的特征实现姿态无关的表征。此外，为了支持训练，作者构建了一个大规模数据集WebBody4M，包含丰富的姿态与尺度变化。实验结果表明，SapiensID在多种行人ReID基准测试中达到最先进的性能，并在短期与长期场景中均优于专用模型，同时在跨姿态-尺度行人ReID任务中建立了强大的基线能力。

链接: https://arxiv.org/abs/2504.04708
作者: Minchul Kim,Dingqiang Ye,Yiyang Su,Feng Liu,Xiaoming Liu
机构: Department of Computer Science and Engineering, Michigan State University (密歇根州立大学); Department of Computer Science, Drexel University (德雷塞尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in CVPR2025

点击查看摘要

Abstract:Existing human recognition systems often rely on separate, specialized models for face and body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and context vary widely. This paper introduces SapiensID, a unified model that bridges this gap, achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch (RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent tokenization of regions of interest, (ii) a masked recognition model (MRM) that learns from variable token length, and (iii) Semantic Attention Head (SAH), an module that learns pose-invariant representations by pooling features around key body parts. To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations. Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios while remaining competitive with dedicated face recognition systems. Furthermore, SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale ReID, demonstrating its ability to generalize to complex, real-world conditions.
zh

[CV-79] DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation CVPR2025

【速读】：该论文旨在探讨是否有必要像处理RGB图像那样通过神经网络显式编码深度信息，并提出一种新的方法来学习RGBD特征表示。关键在于设计了一种名为DFormerv2的强大RGBD编码器，它将深度图作为几何先验而非通过神经网络编码深度信息。其核心解决方案是提取深度图及图像块令牌之间的几何线索，并将其用作自注意力机制中的几何先验以分配注意力权重。实验结果表明，DFormerv2在多种RGBD语义分割基准测试中表现出色。

链接: https://arxiv.org/abs/2504.04701
作者: Bo-Wen Yin,Jiao-Long Cao,Ming-Ming Cheng,Qibin Hou
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images and perform feature fusion between them to enable more robust predictions. Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. Our goal is to extract the geometry clues from the depth and spatial distances among all the image patch tokens, which will then be used as geometry priors to allocate attention weights in self-attention. Extensive experiments demonstrate that DFormerv2 exhibits exceptional performance in various RGBD semantic segmentation benchmarks. Code is available at: this https URL.
zh

[CV-80] Bridging Knowledge Gap Between Image Inpainting and Large-Area Visible Watermark Removal AAAI2025

【速读】：本文旨在解决可见水印去除（包括水印清理和背景内容恢复）中的两大挑战：一是现有基于深度神经网络（Deep Neural Network, DNN）的模型在处理大面积水印时表现不佳；二是这些模型对水印掩码预测质量的依赖性过强。为克服这些问题，论文提出了一种新颖的功能适应框架，利用预训练图像修复模型的表征建模能力。关键在于通过将水印下的残余背景信息融入修复主干模型来弥合图像修复与水印去除之间的知识鸿沟，并设计了一个双分支系统以捕捉和嵌入残余背景特征，这些特征通过门控特征融合模块与修复主干模型的中间特征合并。此外，为减少对高质量水印掩码的依赖，引入了一种新的训练范式，使用粗粒度水印掩码指导推理过程，从而实现对测试阶段水印掩码质量不敏感的可见图像去除模型。实验结果表明，该方法显著优于现有的最新技术。

链接: https://arxiv.org/abs/2504.04687
作者: Yicheng Leng,Chaowei Fang,Junye Chen,Yixiang Fang,Sheng Li,Guanbin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: To be published in AAAI 2025

点击查看摘要

Abstract:Visible watermark removal which involves watermark cleaning and background content restoration is pivotal to evaluate the resilience of watermarks. Existing deep neural network (DNN)-based models still struggle with large-area watermarks and are overly dependent on the quality of watermark mask prediction. To overcome these challenges, we introduce a novel feature adapting framework that leverages the representation modeling capacity of a pre-trained image inpainting model. Our approach bridges the knowledge gap between image inpainting and watermark removal by fusing information of the residual background content beneath watermarks into the inpainting backbone model. We establish a dual-branch system to capture and embed features from the residual background content, which are merged into intermediate features of the inpainting backbone model via gated feature fusion modules. Moreover, for relieving the dependence on high-quality watermark masks, we introduce a new training paradigm by utilizing coarse watermark masks to guide the inference process. This contributes to a visible image removal model which is insensitive to the quality of watermark mask during testing. Extensive experiments on both a large-scale synthesized dataset and a real-world dataset demonstrate that our approach significantly outperforms existing state-of-the-art methods. The source code is available in the supplementary materials.
zh

[CV-81] DeclutterNeRF: Generative-Free 3D Scene Recovery for Occlusion Removal CVPR2025

【速读】：该论文旨在解决现有新型视角合成（NVS）技术在去除遮挡（occlusion removal）时存在的两个主要问题：一是依赖生成先验的方法虽能填补空洞但容易引入新伪影和模糊；二是现有评估遮挡移除方法的数据集缺乏真实场景的复杂性和视点变化。为解决这些问题，论文提出了DeclutterSet数据集和DeclutterNeRF方法。DeclutterNeRF的关键创新在于摒弃生成先验，通过联合多视角优化可学习相机参数、遮挡退火正则化，并采用可解释的随机结构相似性损失函数，从而实现高质量且无伪影的不完整图像重建。

链接: https://arxiv.org/abs/2504.04679
作者: Wanzhou Liu,Zhexiao Xiong,Xinyu Li,Nathan Jacobs
机构: Washington University in St. Louis (圣路易斯华盛顿大学); Georgia Institute of Technology (乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 4th CV4Metaverse Workshop. 15 pages, 10 figures. Code and data at: this https URL

点击查看摘要

Abstract:Recent novel view synthesis (NVS) techniques, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have greatly advanced 3D scene reconstruction with high-quality rendering and realistic detail recovery. Effectively removing occlusions while preserving scene details can further enhance the robustness and applicability of these techniques. However, existing approaches for object and occlusion removal predominantly rely on generative priors, which, despite filling the resulting holes, introduce new artifacts and blurriness. Moreover, existing benchmark datasets for evaluating occlusion removal methods lack realistic complexity and viewpoint variations. To address these issues, we introduce DeclutterSet, a novel dataset featuring diverse scenes with pronounced occlusions distributed across foreground, midground, and background, exhibiting substantial relative motion across viewpoints. We further introduce DeclutterNeRF, an occlusion removal method free from generative priors. DeclutterNeRF introduces joint multi-view optimization of learnable camera parameters, occlusion annealing regularization, and employs an explainable stochastic structural similarity loss, ensuring high-quality, artifact-free reconstructions from incomplete images. Experiments demonstrate that DeclutterNeRF significantly outperforms state-of-the-art methods on our proposed DeclutterSet, establishing a strong baseline for future research.
zh

[CV-82] Dual Consistent Constraint via Disentangled Consistency and Complementarity for Multi-view Clustering

【速读】：该论文旨在解决多视图聚类中现有方法仅关注表征一致性学习而忽视各视图互补性贡献的问题，这限制了多视图表征学习的效果。为应对这一挑战，论文提出了一种新颖的多视图聚类框架，引入解耦变分自编码器来分离多视图的共享信息（一致性信息）与私有信息（互补性信息）。关键解决方案在于首先通过对比学习最大化不同视图间的互信息以学习一致且信息丰富的表征，随后利用一致性推理约束显式利用互补信息，在保证跨所有视图共享信息一致性的过程中实现互补信息的有效整合。具体而言，通过视图内的重建和跨视图的共享信息重建施加双重一致性约束，不仅提升了数据表征质量，还易于扩展至其他场景，尤其是复杂多视图场景中。这一方法首次尝试在统一的多视图聚类理论框架下采用双重一致性约束。

链接: https://arxiv.org/abs/2504.04676
作者: Bo Li,Jing Yun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-view clustering can explore common semantics from multiple views and has received increasing attention in recent years. However, current methods focus on learning consistency in representation, neglecting the contribution of each view’s complementarity aspect in representation learning. This limit poses a significant challenge in multi-view representation learning. This paper proposes a novel multi-view clustering framework that introduces a disentangled variational autoencoder that separates multi-view into shared and private information, i.e., consistency and complementarity information. We first learn informative and consistent representations by maximizing mutual information across different views through contrastive learning. This process will ignore complementary information. Then, we employ consistency inference constraints to explicitly utilize complementary information when attempting to seek the consistency of shared information across all views. Specifically, we perform a within-reconstruction using the private and shared information of each view and a cross-reconstruction using the shared information of all views. The dual consistency constraints are not only effective in improving the representation quality of data but also easy to extend to other scenarios, especially in complex multi-view scenes. This could be the first attempt to employ dual consistent constraint in a unified MVC theoretical framework. During the training procedure, the consistency and complementarity features are jointly optimized. Extensive experiments show that our method outperforms baseline methods.
zh

[CV-83] 3DM-WeConvene: Learned Image Compression with 3D Multi-Level Wavelet-Domain Convolution and Entropy Model

【速读】：该论文旨在解决现有学习型图像压缩（Learned Image Compression, LIC）方法主要在空间域操作且缺乏有效减少频域相关性的机制的问题。为了解决这一挑战，论文提出了一种新颖的框架，通过将低复杂度的三维多级离散小波变换（3D Multi-level Discrete Wavelet Transform, 3D Multi-level DWT）集成到卷积层和熵编码中，以同时减少空间和通道相关性，从而提升频率选择性和率失真性能（Rate-Distortion Performance, R-D）。该框架的关键创新在于引入了三维多级小波域卷积（3D Multi-level Wavelet-Domain Convolution, 3DM-WeConv）层以及三维小波域信道自回归熵模型（3D Wavelet-Domain Channel-wise Autoregressive Entropy Model, 3DWeChARM），前者实现了数据在小波域的灵活处理与逆变换，后者则通过基于切片的熵编码进一步优化高频和低频分量的编码效率，并采用两阶段训练策略平衡不同频率分量的率失真表现。实验结果表明，所提方法在多种测试集上显著优于当前最先进的基于CNN的LIC方法，特别是在高分辨率图像上的压缩增益更为明显。

链接: https://arxiv.org/abs/2504.04658
作者: Haisheng Fu,Jie Liang,Feng Liang,Zhenman Fang,Guohe Zhang,Jingning Han
机构: School of Engineering Science, Simon Fraser University (西蒙弗雷泽大学工程科学学院), Canada; School of Microelectronics, Xi’an Jiaotong University (西安交通大学微电子学院), China; Google Inc. (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注: 13 pages

点击查看摘要

Abstract:Learned image compression (LIC) has recently made significant progress, surpassing traditional methods. However, most LIC approaches operate mainly in the spatial domain and lack mechanisms for reducing frequency-domain correlations. To address this, we propose a novel framework that integrates low-complexity 3D multi-level Discrete Wavelet Transform (DWT) into convolutional layers and entropy coding, reducing both spatial and channel correlations to improve frequency selectivity and rate-distortion (R-D) performance. Our proposed 3D multi-level wavelet-domain convolution (3DM-WeConv) layer first applies 3D multi-level DWT (e.g., 5/3 and 9/7 wavelets from JPEG 2000) to transform data into the wavelet domain. Then, different-sized convolutions are applied to different frequency subbands, followed by inverse 3D DWT to restore the spatial domain. The 3DM-WeConv layer can be flexibly used within existing CNN-based LIC models. We also introduce a 3D wavelet-domain channel-wise autoregressive entropy model (3DWeChARM), which performs slice-based entropy coding in the 3D DWT domain. Low-frequency (LF) slices are encoded first to provide priors for high-frequency (HF) slices. A two-step training strategy is adopted: first balancing LF and HF rates, then fine-tuning with separate weights. Extensive experiments demonstrate that our framework consistently outperforms state-of-the-art CNN-based LIC methods in R-D performance and computational complexity, with larger gains for high-resolution images. On the Kodak, Tecnick 100, and CLIC test sets, our method achieves BD-Rate reductions of -12.24%, -15.51%, and -12.97%, respectively, compared to H.266/VVC. Comments: 13 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP) Cite as: arXiv:2504.04658 [cs.CV] (or arXiv:2504.04658v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.04658 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Haisheng Fu [view email] [v1] Mon, 7 Apr 2025 01:11:50 UTC (1,823 KB) Full-text links: Access Paper: View a PDF of the paper titled 3DM-WeConvene: Learned Image Compression with 3D Multi-Level Wavelet-Domain Convolution and Entropy Model, by Haisheng Fu and Jie Liang and Feng Liang and Zhenman Fang and Guohe Zhang and Jingning HanView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-04 Change to browse by: cs stat stat.AP References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-84] DanceMosaic: High-Fidelity Dance Generation with Multimodal Editability

【速读】：该论文旨在解决现有舞蹈生成方法在生成高保真舞蹈序列时面临的挑战，包括缺乏出色的现实感、精确的舞姿-音乐同步性、丰富的动作多样性和物理合理性。此外，现有方法难以灵活编辑舞蹈序列以适应多模态引导信号（如音乐提示、姿态约束、动作标签和流派描述），限制了其创造性和适应性。为解决这些问题，论文提出的关键方案是DanceMosaic，它通过引入一种多模态掩码运动模型，融合文本到运动模型与音乐及姿态适配器，利用渐进生成掩码训练学习从多样化引导信号到高质量舞蹈运动序列的概率映射。同时，提出了无分类器的多模态引导机制及推理时优化机制，进一步强化生成动作与多模态引导的一致性。这些创新显著提升了舞蹈生成的质量和可编辑性。

链接: https://arxiv.org/abs/2504.04634
作者: Foram Niravbhai Shah,Parshwa Shah,Muhammad Usama Saleem,Ekkasit Pinyoanuntapong,Pu Wang,Hongfei Xue,Ahmed Helmy
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in dance generation have enabled automatic synthesis of 3D dance motions. However, existing methods still struggle to produce high-fidelity dance sequences that simultaneously deliver exceptional realism, precise dance-music synchronization, high motion diversity, and physical plausibility. Moreover, existing methods lack the flexibility to edit dance sequences according to diverse guidance signals, such as musical prompts, pose constraints, action labels, and genre descriptions, significantly restricting their creative utility and adaptability. Unlike the existing approaches, DanceMosaic enables fast and high-fidelity dance generation, while allowing multimodal motion editing. Specifically, we propose a multimodal masked motion model that fuses the text-to-motion model with music and pose adapters to learn probabilistic mapping from diverse guidance signals to high-quality dance motion sequences via progressive generative masking training. To further enhance the motion generation quality, we propose multimodal classifier-free guidance and inference-time optimization mechanism that further enforce the alignment between the generated motions and the multimodal guidance. Extensive experiments demonstrate that our method establishes a new state-of-the-art performance in dance generation, significantly advancing the quality and editability achieved by existing approaches.
zh

[CV-85] M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models

【速读】：该论文旨在解决多模态in-context learning (ICL) 在大规模视觉-语言模型 (Large Vision-Language Models, LVLMs) 中应用所面临的两大挑战：一是输入数据的令牌密集型特性导致计算负担重，二是跨模态少量学习的高复杂性限制了表示方法的表达能力。为了解决这些问题，论文提出了一种名为M2IV的方法，其关键是用可学习的in-context向量（learnable In-context Vectors）替代显式的演示样本，并将这些向量直接集成到LVLMs中。通过结合多头注意力机制（multi-head attention, MHA）和多层感知机（multi-layer perceptron, MLP）的优势，M2IV实现了稳健的跨模态保真度和细粒度语义蒸馏，显著提升了不同LVLMs在多种任务上的性能，并高效扩展至多样本场景，同时绕过了上下文窗口的限制。此外，论文还引入了VLibrary来存储和检索M2IV，以支持灵活的任务导向模型调整，如跨模态对齐、定制化生成和安全性提升。实验结果表明，M2IV相比传统的Vanilla ICL及先前的表征工程方法，在七个基准测试集和三种LVLMs上的平均准确率提升了3.74%，并且具有显著的效率优势。

链接: https://arxiv.org/abs/2504.04633
作者: Yanshu Li,Hongyang He,Yi Cao,Qisen Cheng,Xiang Fu,Ruixiang Tang
机构: Brown University (布朗大学); University of Warwick (华威大学); Independent Researcher (独立研究员); Samsung (三星); Boston University (波士顿大学); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint, 28 pages, 10 figures, 15 tables

点击查看摘要

Abstract:Multimodal in-context learning (ICL) is a vital capability for Large Vision-Language Models (LVLMs), allowing task adaptation via contextual prompts without parameter retraining. However, its application is hindered by the token-intensive nature of inputs and the high complexity of cross-modal few-shot learning, which limits the expressive power of representation methods. To tackle these challenges, we propose \textbfM2IV, a method that substitutes explicit demonstrations with learnable \textbfIn-context \textbfVectors directly integrated into LVLMs. By exploiting the complementary strengths of multi-head attention (\textbfMHA) and multi-layer perceptrons (\textbfMLP), M2IV achieves robust cross-modal fidelity and fine-grained semantic distillation through training. This significantly enhances performance across diverse LVLMs and tasks and scales efficiently to many-shot scenarios, bypassing the context window limitations. We also introduce \textbfVLibrary, a repository for storing and retrieving M2IV, enabling flexible LVLM steering for tasks like cross-modal alignment, customized generation and safety improvement. Experiments across seven benchmarks and three LVLMs show that M2IV surpasses Vanilla ICL and prior representation engineering approaches, with an average accuracy gain of \textbf3.74% over ICL with the same shot count, alongside substantial efficiency advantages.
zh

[CV-86] Systematic Literature Review on Vehicular Collaborative Perception – A Computer Vision Perspective

【速读】：该论文旨在解决单辆自动驾驶车辆感知系统在视觉遮挡和长距离检测能力等方面的局限性问题。论文的关键解决方案在于通过协作感知（Collaborative Perception, CP）方法，利用车对车（Vehicle-to-Vehicle, V2V）和车对基础设施（Vehicle-to-Infrastructure, V2I）通信技术实现多车辆之间的信息协同，从而提升感知系统的可靠性和性能。此外，论文通过系统性文献综述分析了现有研究的工作模式、协作方案及核心感知任务，并揭示了当前评估方法与协作感知目标之间的不匹配问题，为未来研究提供了方向性建议。

链接: https://arxiv.org/abs/2504.04631
作者: Lei Wan,Jianxin Zhao,Andreas Wiedholz,Manuel Bied,Mateus Martinez de Lucena,Abhishek Dinkar Jagtap,Andreas Festag,Antônio Augusto Fröhlich,Hannan Ejaz Keen,Alexey Vinel
机构: XITASO Gmbh (XITASO有限公司), 86153 Augsburg, Germany; Karlsruhe Institute of Technology (KIT) (卡尔斯鲁厄理工学院(KIT)), 76133 Karlsruhe, Germany; Technische Hochschule Ingolstadt (THI) (英戈尔施塔特技术大学(THI)), CARISSMA Institute of Electric, COnnected and Secure Mobility (C-ECOS), 85049 Ingolstadt, Germany; Federal University of Santa Catarina (UFSC) (圣卡塔琳娜联邦大学(UFSC)), Brazil; Car2Car Communication Consortium (C2C-CC)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages, 25 figures

点击查看摘要

Abstract:The effectiveness of autonomous vehicles relies on reliable perception capabilities. Despite significant advancements in artificial intelligence and sensor fusion technologies, current single-vehicle perception systems continue to encounter limitations, notably visual occlusions and limited long-range detection capabilities. Collaborative Perception (CP), enabled by Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communication, has emerged as a promising solution to mitigate these issues and enhance the reliability of autonomous systems. Beyond advancements in communication, the computer vision community is increasingly focusing on improving vehicular perception through collaborative approaches. However, a systematic literature review that thoroughly examines existing work and reduces subjective bias is still lacking. Such a systematic approach helps identify research gaps, recognize common trends across studies, and inform future research directions. In response, this study follows the PRISMA 2020 guidelines and includes 106 peer-reviewed articles. These publications are analyzed based on modalities, collaboration schemes, and key perception tasks. Through a comparative analysis, this review illustrates how different methods address practical issues such as pose errors, temporal latency, communication constraints, domain shifts, heterogeneity, and adversarial attacks. Furthermore, it critically examines evaluation methodologies, highlighting a misalignment between current metrics and CP’s fundamental objectives. By delving into all relevant topics in-depth, this review offers valuable insights into challenges, opportunities, and risks, serving as a reference for advancing research in vehicular collaborative perception.
zh

[CV-87] argetless LiDAR-Camera Calibration with Anchored 3D Gaussians

【速读】：该论文致力于解决无标定目标情况下激光雷达与相机的联合校准问题，旨在从任意场景中同时优化传感器位姿和场景几何结构，而不依赖传统的标定目标（如棋盘格或球面反射器）。解决方案的关键在于采用基于3D高斯分布的场景表示方法，通过将可靠的激光雷达点冻结为锚点，并以可微的方式联合优化传感器位姿和辅助高斯参数，利用光度损失函数实现端到端的训练。这种方法显著减少了传感器间的错位问题，提升了渲染质量，并在KITTI-360和Waymo等数据集上的实验验证中表现出比传统标定结果更高的峰值信噪比（PSNR）。

链接: https://arxiv.org/abs/2504.04597
作者: Haebeom Jung,Namtae Kim,Jungwoo Kim,Jaesik Park
机构: Seoul National University (首尔国立大学); Konkuk University (建国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present a targetless LiDAR-camera calibration method that jointly optimizes sensor poses and scene geometry from arbitrary scenes, without relying on traditional calibration targets such as checkerboards or spherical reflectors. Our approach leverages a 3D Gaussian-based scene representation. We first freeze reliable LiDAR points as anchors, then jointly optimize the poses and auxiliary Gaussian parameters in a fully differentiable manner using a photometric loss. This joint optimization significantly reduces sensor misalignment, resulting in higher rendering quality and consistently improved PSNR compared to the carefully calibrated poses provided in popular datasets. We validate our method through extensive experiments on two real-world autonomous driving datasets, KITTI-360 and Waymo, each featuring distinct sensor configurations. Additionally, we demonstrate the robustness of our approach using a custom LiDAR-camera setup, confirming strong performance across diverse hardware configurations.
zh

[CV-88] Your Image Generator Is Your New Private Dataset

【速读】：该论文旨在解决在文本条件图像生成中构建有意义的提示、适应特定领域以及确保鲁棒性能等关键挑战，以有效利用生成扩散模型合成训练数据集。论文提出的Text-Conditioned Knowledge Recycling (TCKR) 管道通过动态图像描述生成、参数高效扩散模型微调及生成知识蒸馏技术相结合，创建了针对图像分类任务定制的合成数据集。其核心在于结合多种技术手段，实现既能保持高精度又能显著提升隐私保护的合成数据生成方法。

链接: https://arxiv.org/abs/2504.04582
作者: Nicolo Resmini,Eugenio Lomurno,Cristian Sbrolli,Matteo Matteucci
机构: Politecnico di Milano (米兰理工大学); Department of Electronics, Information and Bioengineering (电子、信息和生物工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative diffusion models have emerged as powerful tools to synthetically produce training data, offering potential solutions to data scarcity and reducing labelling costs for downstream supervised deep learning applications. However, effectively leveraging text-conditioned image generation for building classifier training sets requires addressing key issues: constructing informative textual prompts, adapting generative models to specific domains, and ensuring robust performance. This paper proposes the Text-Conditioned Knowledge Recycling (TCKR) pipeline to tackle these challenges. TCKR combines dynamic image captioning, parameter-efficient diffusion model fine-tuning, and Generative Knowledge Distillation techniques to create synthetic datasets tailored for image classification. The pipeline is rigorously evaluated on ten diverse image classification benchmarks. The results demonstrate that models trained solely on TCKR-generated data achieve classification accuracies on par with (and in several cases exceeding) models trained on real images. Furthermore, the evaluation reveals that these synthetic-data-trained models exhibit substantially enhanced privacy characteristics: their vulnerability to Membership Inference Attacks is significantly reduced, with the membership inference AUC lowered by 5.49 points on average compared to using real training data, demonstrating a substantial improvement in the performance-privacy trade-off. These findings indicate that high-fidelity synthetic data can effectively replace real data for training classifiers, yielding strong performance whilst simultaneously providing improved privacy protection as a valuable emergent property. The code and trained models are available in the accompanying open-source repository.
zh

[CV-89] Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

【速读】：本文旨在解决长视频精确检索中的多模态相关性问题，特别是处理未见过的词汇和场景时的复杂性，同时避免对特定数据集的先验训练。为应对这一挑战，论文提出了一种统一框架，结合视觉匹配流与听觉匹配流，并引入基于字幕的视频分割方法。听觉流的关键创新在于其互补的两阶段音频检索机制，显著提升了长视频检索的效果。此外，为了更好地评估长视频检索任务，论文还设计了一种新的评价方法。通过在YouCook2数据集上的实验验证，展示了所提方法的优越检索性能。

链接: https://arxiv.org/abs/2504.04572
作者: Mohamed Eltahir,Osamah Sarraj,Mohammed Bremoo,Mohammed Khurd,Abdulrahman Alfrihidi,Taha Alshatiri,Mohammad Almatrafi,Tanveer Hussain
机构: King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学); Edge Hill University (奥尔德汉姆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.
zh

[CV-90] DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation CVPR2025

【速读】：该论文旨在解决半监督学习在医学图像分割中面临的类不平衡（Class Imbalance）和病理变异引起的高不确定性（High Uncertainty from Pathology Variations）问题，这些问题导致现有方法在三维医学图像分割中的准确性不足。论文的关键解决方案是提出DyCON框架，这是一个动态不确定性感知一致性与对比学习框架。DyCON通过两种互补的损失函数增强一致性方法的泛化能力：不确定性感知一致性损失（Uncertainty-aware Consistency Loss, UnCL）和焦点熵感知对比损失（Focal Entropy-aware Contrastive Loss, FeCL）。UnCL通过动态加权每个体素对一致性损失的贡献来实现全局一致性，并保留高不确定性区域，而非直接过滤它们；初始阶段优先从不确定性较高的体素中学习以探索挑战性区域，随后逐步将惩罚转移到置信度较高的体素以优化预测并确保全局一致性。而FeCL则通过引入双重焦点机制和自适应置信调整，增强了局部特征在类别不平衡区域的区分能力，同时关注不确定样本对，从而有效捕捉类别不平衡下的细微病变变化。

链接: https://arxiv.org/abs/2504.04566
作者: Maregu Assefa,Muzammal Naseer,Iyyakutti Iyappan Ganapathi,Syed Sadaf Ali,Mohamed L Seghier,Naoufel Werghi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Semi-supervised learning in medical image segmentation leverages unlabeled data to reduce annotation burdens through consistency learning. However, current methods struggle with class imbalance and high uncertainty from pathology variations, leading to inaccurate segmentation in 3D medical images. To address these challenges, we present DyCON, a Dynamic Uncertainty-aware Consistency and Contrastive Learning framework that enhances the generalization of consistency methods with two complementary losses: Uncertainty-aware Consistency Loss (UnCL) and Focal Entropy-aware Contrastive Loss (FeCL). UnCL enforces global consistency by dynamically weighting the contribution of each voxel to the consistency loss based on its uncertainty, preserving high-uncertainty regions instead of filtering them out. Initially, UnCL prioritizes learning from uncertain voxels with lower penalties, encouraging the model to explore challenging regions. As training progress, the penalty shift towards confident voxels to refine predictions and ensure global consistency. Meanwhile, FeCL enhances local feature discrimination in imbalanced regions by introducing dual focal mechanisms and adaptive confidence adjustments into the contrastive principle. These mechanisms jointly prioritizes hard positives and negatives while focusing on uncertain sample pairs, effectively capturing subtle lesion variations under class imbalance. Extensive evaluations on four diverse medical image segmentation datasets (ISLES’22, BraTS’19, LA, Pancreas) show DyCON’s superior performance against SOTA methods.
zh

[CV-91] Advancing Egocentric Video Question Answering with Multimodal Large Language Models

【速读】：该论文旨在解决第一人称视角视频问答（Egocentric Video QA）任务中的长时序时空推理、第一人称视角理解以及频繁摄像机运动等挑战。为实现这一目标，论文引入了QaEgo4Dv2数据集，并系统性评估了多种多模态大语言模型（MLLMs）在开放问答（OpenQA）和闭合问答（CloseQA）设置下的表现。论文的关键解决方案在于采用零样本（zero-shot）和微调（fine-tuned）方法对模型进行评估，其中特别强调了Video-LLaVa-7B和Qwen2-VL-7B-Instruct模型的微调策略，这些模型在QaEgo4Dv2上的微调实现了新的性能基准，分别提升了+2.6%的ROUGE/METEOR分数（针对OpenQA）和+13%的准确性（针对CloseQA）。

链接: https://arxiv.org/abs/2504.04550
作者: Alkesh Patel,Vibhav Chitalia,Yinfei Yang
机构: Apple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages

点击查看摘要

Abstract:Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement. This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2 - a refined dataset of egocentric videos derived from QaEgo4D. Four popular MLLMs (GPT-4o, Gemini-1.5-Pro, Video-LLaVa-7B and Qwen2-VL-7B-Instruct) are assessed using zero-shot and fine-tuned approaches for both OpenQA and CloseQA settings. We introduce QaEgo4Dv2 to mitigate annotation noise in QaEgo4D, enabling more reliable comparison. Our results show that fine-tuned Video-LLaVa-7B and Qwen2-VL-7B-Instruct achieve new state-of-the-art performance, surpassing previous benchmarks by up to +2.6% ROUGE/METEOR (for OpenQA) and +13% accuracy (for CloseQA). We also present a thorough error analysis, indicating the model’s difficulty in spatial reasoning and fine-grained object recognition - key areas for future improvement.
zh

[CV-92] Opening the black box of deep learning: Validating the statistical association between explainable artificial intelligence (XAI) and clinical domain knowledge in fundus image-based glaucoma diagnosis

【速读】：该论文旨在解决深度学习模型在青光眼分类任务中的可解释性问题，其核心挑战在于深度学习模型的“黑箱”特性限制了其在实际医疗场景中的广泛应用。论文的关键解决方案是通过采用多种类激活图（Class Activation Map, CAM）技术（如Grad-CAM、XGrad-CAM、Score-CAM、Eigen-CAM和Layer-CAM），生成模型的关注区域，并将其与临床领域知识中青光眼解剖结构（如视杯、视盘和血管）进行对比分析。研究通过配对样本t检验验证模型关注区域内解剖结构比例显著高于整体图像的比例，并利用皮尔逊相关性和斯皮尔曼相关性测试进一步证明模型预测能力与关注区域内解剖结构比例之间的正相关关系。这一方法为揭示深度神经网络与人类临床医生决策逻辑的一致性提供了统计学依据，从而有助于缓解临床医生对深度学习应用于医疗领域的信任顾虑。代码和数据集已公开发布以促进研究复现。

链接: https://arxiv.org/abs/2504.04549
作者: Han Yuan,Lican Kang,Yong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While deep learning has exhibited remarkable predictive capabilities in various medical image tasks, its inherent black-box nature has hindered its widespread implementation in real-world healthcare settings. Our objective is to unveil the decision-making processes of deep learning models in the context of glaucoma classification by employing several Class Activation Map (CAM) techniques to generate model focus regions and comparing them with clinical domain knowledge of the anatomical area (optic cup, optic disk, and blood vessels). Four deep neural networks, including VGG-11, ResNet-18, DeiT-Tiny, and Swin Transformer-Tiny, were developed using binary diagnostic labels of glaucoma and five CAM methods (Grad-CAM, XGrad-CAM, Score-CAM, Eigen-CAM, and Layer-CAM) were employed to highlight the model focus area. We applied the paired-sample t-test to compare the percentage of anatomies in the model focus area to the proportion of anatomies in the entire image. After that, Pearson’s and Spearman’s correlation tests were implemented to examine the relationship between model predictive ability and the percentage of anatomical structures in the model focus area. On five public glaucoma datasets, all deep learning models consistently displayed statistically significantly higher percentages of anatomical structures in the focus area than the proportions of anatomical structures in the entire image. Also, we validated the positive relationship between the percentage of anatomical structures in the focus area and model predictive performance. Our study provides evidence of the convergence of decision logic between deep neural networks and human clinicians through rigorous statistical tests. We anticipate that it can help alleviate clinicians’ concerns regarding the trustworthiness of deep learning in healthcare. For reproducibility, the code and dataset have been released at GitHub.
zh

[CV-93] he Point the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models ?

【速读】：该论文旨在探究点云是否真正提升了三维大语言模型（3D LLMs）的空间推理能力。为解答这一研究问题，论文首先通过将点云替换为视觉或文本模态来评估不同输入模态下LLMs的空间推理能力；其次，提出了一种新的三维问答基准——ScanReQA，用于全面评估模型对二元空间关系的理解。关键在于通过对比实验揭示LLMs在缺乏点云输入时仍可能表现出色，并发现现有3D LLMs在理解二元空间关系以及利用点云结构坐标进行细粒度空间推理方面存在局限性。这些结论不仅有助于推动3D LLMs的发展，也为其他模态的基础模型提供了启示。论文同时公开了数据集及可复现代码。

链接: https://arxiv.org/abs/2504.04540
作者: Weichen Zhang,Ruiying Peng,Chen Gao,Jianjie Fang,Xin Zeng,Kaiyuan Li,Ziyou Wang,Jinqiang Cui,Xin Wang,Xinlei Chen,Yong Li
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the role of point clouds in 3D spatial reasoning remains under-explored. In this work, we comprehensively evaluate and analyze these models to answer the research question: \textitDoes point cloud truly boost the spatial reasoning capacities of 3D LLMs? We first evaluate the spatial reasoning capacity of LLMs with different input modalities by replacing the point cloud with the visual and text counterparts. We then propose a novel 3D QA (Question-answering) benchmark, ScanReQA, that comprehensively evaluates models’ understanding of binary spatial relationships. Our findings reveal several critical insights: 1) LLMs without point input could even achieve competitive performance even in a zero-shot manner; 2) existing 3D LLMs struggle to comprehend the binary spatial relationships; 3) 3D LLMs exhibit limitations in exploiting the structural coordinates in point clouds for fine-grained spatial reasoning. We think these conclusions can help the next step of 3D LLMs and also offer insights for foundation models in other modalities. We release datasets and reproducible codes in the anonymous project page: this https URL.
zh

[CV-94] SnapPix: Efficient-Coding–Inspired In-Sensor Compression for Edge Vision

【速读】：该论文旨在解决边缘设备上高效图像采集的问题，特别是在传感器节点计算能力较弱且需要将数据传输到远程服务器进行处理的遥感应用场景中。为降低边缘能耗，论文提出了一种名为SnapPix的传感-算法协同设计系统，通过在传感器内以模拟域方式压缩原始像素来实现节能。该方案的关键在于采用编码曝光（Coded Exposure, CE）作为传感器内的压缩策略，它能够在空间和时间上灵活选择性地曝光像素。具体而言，SnapPix提出了基于高效编码经典理论的任务无关采样/曝光模式学习方法，协同设计下游视觉模型以解决CE压缩图像特有的像素级非均匀性问题，并对图像传感器硬件进行轻量级增强以支持CE压缩。实验结果表明，SnapPix在动作识别和视频重建任务中，以相同速度超越了现有基于视频的方法，同时将能耗降低了高达15.4倍。

链接: https://arxiv.org/abs/2504.04535
作者: Weikai Lin,Tianrui Ma,Adith Boloor,Yu Feng,Ruofan Xing,Xuan Zhang,Yuhao Zhu
机构: University of Rochester (罗切斯特大学); Institute of Computing Technology, CAS (中国科学院计算技术研究所); Washington University in St. Louis (圣路易斯华盛顿大学); Shanghai Jiao Tong University (上海交通大学); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, Accepted to Design Automation Conference (DAC), 2025

点击查看摘要

Abstract:Energy-efficient image acquisition on the edge is crucial for enabling remote sensing applications where the sensor node has weak compute capabilities and must transmit data to a remote server/cloud for processing. To reduce the edge energy consumption, this paper proposes a sensor-algorithm co-designed system called SnapPix, which compresses raw pixels in the analog domain inside the sensor. We use coded exposure (CE) as the in-sensor compression strategy as it offers the flexibility to sample, i.e., selectively expose pixels, both spatially and temporally. SNAPPIX has three contributions. First, we propose a task-agnostic strategy to learn the sampling/exposure pattern based on the classic theory of efficient coding. Second, we co-design the downstream vision model with the exposure pattern to address the pixel-level non-uniformity unique to CE-compressed images. Finally, we propose lightweight augmentations to the image sensor hardware to support our in-sensor CE compression. Evaluating on action recognition and video reconstruction, SnapPix outperforms state-of-the-art video-based methods at the same speed while reducing the energy by up to 15.4x. We have open-sourced the code at: this https URL.
zh

[CV-95] SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation

【速读】：该论文旨在解决多目标跟踪（Multi-Object Tracking, MOT）的问题，通过扩展单对象跟踪方法Segment Anything 2 (SAM2)至多对象场景，提出了一种新的基于分割的跟踪范式SAM2MOT。解决方案的关键在于直接从分割掩模生成跟踪框，减少了对检测精度的依赖，并结合零样本泛化能力和强对象关联性，同时引入轨迹管理器系统以实现精确的对象添加与移除，以及跨对象交互模块来处理遮挡情况。实验结果表明，SAM2MOT在DanceTrack、UAVDT和BDD100K数据集上取得了最先进的性能，尤其在DanceTrack上实现了+2.1的HOTA和+4.5的IDF1提升。

链接: https://arxiv.org/abs/2504.04519
作者: Junjie Jiang,Zelin Wang,Manqi Zhao,Yin Li,DongSheng Jiang
机构: Huawei Cloud (华为云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. To extend this to multi-object tracking (MOT), we propose SAM2MOT, introducing a novel Tracking by Segmentation paradigm. Unlike Tracking by Detection or Tracking by Query, SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy. SAM2MOT has two key advantages: zero-shot generalization, allowing it to work across datasets without fine-tuning, and strong object association, inherited from SAM2. To further improve performance, we integrate a trajectory manager system for precise object addition and removal, and a cross-object interaction module to handle occlusions. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT.
zh

[CV-96] Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection

【速读】：该论文旨在解决跨域小样本目标检测（Cross-Domain Few-Shot Object Detection, CD-FSOD）任务中基础模型性能提升的问题。论文的关键解决方案在于结合基于图像的数据增强技术与基于网格的子域搜索策略，通过在广泛的基础模型预训练数据集上进行少量样本训练，有效优化模型在未见过的域中的表现。该方法利用GroundingDINO模型，引入多种图像增强方法并定义优化目标，从而高效搜索最佳子域配置，实现少样本目标检测任务的性能提升，并为视觉-语言模型在数据稀缺环境下的部署提供了重要参考，同时减少了重新训练的工作量。

链接: https://arxiv.org/abs/2504.04517
作者: Jiancheng Pan,Yanxing Liu,Xiao He,Long Peng,Jiahao Li,Yuze Sun,Xiaomeng Huang
机构: Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学); Wuhan University (武汉大学); University of Science and Technology of China (中国科学技术大学); AI4EarthLab (AI4地球实验室); NTIRE 2025 CD-FSOD Challenge (NTIRE 2025 CD-FSOD 挑战赛)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Foundation models pretrained on extensive datasets, such as GroundingDINO and LAE-DINO, have performed remarkably in the cross-domain few-shot object detection (CD-FSOD) task. Through rigorous few-shot training, we found that the integration of image-based data augmentation techniques and grid-based sub-domain search strategy significantly enhances the performance of these foundation models. Building upon GroundingDINO, we employed several widely used image augmentation methods and established optimization objectives to effectively navigate the expansive domain space in search of optimal sub-domains. This approach facilitates efficient few-shot object detection and introduces an approach to solving the CD-FSOD problem by efficiently searching for the optimal parameter configuration from the foundation model. Our findings substantially advance the practical deployment of vision-language models in data-scarce environments, offering critical insights into optimizing their cross-domain generalization capabilities without labor-intensive retraining. Code is available at this https URL.
zh

[CV-97] Attributed Synthetic Data Generation for Zero-shot Domain-specific Image Classification

【速读】：该论文旨在解决零样本领域特定图像分类问题，即在缺乏目标领域标注训练样本的情况下对真实图像进行分类的挑战。现有方法依赖于简单的提示策略来从文本生成合成训练图像，但这种方式限制了合成图像的多样性，导致性能逊色于真实图像。论文的关键解决方案是提出AttrSyn，它利用大型语言模型生成属性提示（attributed prompts），这些提示能够生成更加多样化且具有属性信息的合成图像。实验结果表明，使用AttrSyn生成的合成图像进行训练，在大多数情况下显著优于CLIP的零样本分类性能，并且始终超越简单的提示策略。

链接: https://arxiv.org/abs/2504.04510
作者: Shijian Wang,Linxin Song,Ryotaro Shimizu,Masayuki Goto,Hanqian Wu
机构: Southeast University (东南大学, China); Waseda University (早稻田大学, Japan); ZOZO Research (日本); University of California San Diego (加州大学圣地亚哥分校, USA)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot domain-specific image classification is challenging in classifying real images without ground-truth in-domain training examples. Recent research involved knowledge from texts with a text-to-image model to generate in-domain training images in zero-shot scenarios. However, existing methods heavily rely on simple prompt strategies, limiting the diversity of synthetic training images, thus leading to inferior performance compared to real images. In this paper, we propose AttrSyn, which leverages large language models to generate attributed prompts. These prompts allow for the generation of more diverse attributed synthetic images. Experiments for zero-shot domain-specific image classification on two fine-grained datasets show that training with synthetic images generated by AttrSyn significantly outperforms CLIP’s zero-shot classification under most situations and consistently surpasses simple prompt strategies.
zh

[CV-98] Active Learning with a Noisy Annotator

【速读】：该论文试图解决在标注预算有限（低预算）且存在噪声标签的情况下，传统主动学习（Active Learning, AL）方法性能下降的问题。解决方案的关键在于提出了一种名为噪声感知主动采样（Noise-Aware Active Sampling, NAS）的新框架，该框架扩展了现有的基于贪心覆盖的主动学习策略，使其能够处理噪声标注。NAS通过识别因选择噪声样本代表而未被充分覆盖的区域，并重新从这些区域采样，从而改善了模型的学习效果。此外，NAS还引入了一种简单有效的噪声过滤方法，可以在模型训练前用于降低噪声的影响，进一步提升主动学习方法在不同噪声类型和比率下的性能表现。

链接: https://arxiv.org/abs/2504.04506
作者: Netta Shafir,Guy Hacohen,Daphna Weinshall
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Active Learning (AL) aims to reduce annotation costs by strategically selecting the most informative samples for labeling. However, most active learning methods struggle in the low-budget regime where only a few labeled examples are available. This issue becomes even more pronounced when annotators provide noisy labels. A common AL approach for the low- and mid-budget regimes focuses on maximizing the coverage of the labeled set across the entire dataset. We propose a novel framework called Noise-Aware Active Sampling (NAS) that extends existing greedy, coverage-based active learning strategies to handle noisy annotations. NAS identifies regions that remain uncovered due to the selection of noisy representatives and enables resampling from these areas. We introduce a simple yet effective noise filtering approach suitable for the low-budget regime, which leverages the inner mechanism of NAS and can be applied for noise filtering before model training. On multiple computer vision benchmarks, including CIFAR100 and ImageNet subsets, NAS significantly improves performance for standard active learning methods across different noise types and rates.
zh

[CV-99] AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection

【速读】：该论文旨在解决视频异常检测在复杂环境下的信息不足及高误报率问题。传统基于视觉的方法难以应对这些挑战，尤其是在信息不充分的情况下。为了解决这些问题，论文提出了一种新的弱监督框架，利用音频-视觉协作来实现鲁棒的视频异常检测。该方案的关键创新在于两个方面：一是引入了一种高效的音频-视觉融合机制，通过轻量级参数化适应实现自适应跨模态集成，同时保持冻结的CLIP主干不变；二是设计了一个新颖的音频-视觉提示，动态增强文本嵌入以包含关键多模态信息，显著提升了CLIP在视频异常检测任务中的泛化能力。此外，为了提高推理过程中对模态缺失的鲁棒性，还开发了一种基于不确定性的特征蒸馏模块，从仅视觉输入中合成音频-视觉表示。此模块采用基于音频-视觉特征多样性的不确定性建模，在蒸馏过程中动态强调具有挑战性的特征。这些创新使得所提出的框架在多个基准数据集上表现出色，并显著提高了不同场景下的异常检测准确性。

链接: https://arxiv.org/abs/2504.04495
作者: Peng Wu,Wanshun Su,Guansong Pang,Yujia Sun,Qingsen Yan,Peng Wang,Yanning Zhang
机构: School of Computer Science, Northwestern Polytechnical University (西北工业大学计算机学院); School of Computing and Information Systems, Singapore Management University (新加坡管理大学计算与信息系统学院); School of Artifical Intelligence, Xidian University (西安电子科技大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, 6 tables

点击查看摘要

Abstract:With the increasing adoption of video anomaly detection in intelligent surveillance domains, conventional visual-based detection approaches often struggle with information insufficiency and high false-positive rates in complex environments. To address these limitations, we present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Capitalizing on the exceptional cross-modal representation learning capabilities of Contrastive Language-Image Pretraining (CLIP) across visual, audio, and textual domains, our framework introduces two major innovations: an efficient audio-visual fusion that enables adaptive cross-modal integration through lightweight parametric adaptation while maintaining the frozen CLIP backbone, and a novel audio-visual prompt that dynamically enhances text embeddings with key multimodal information based on the semantic correlation between audio-visual features and textual labels, significantly improving CLIP’s generalization for the video anomaly detection task. Moreover, to enhance robustness against modality deficiency during inference, we further develop an uncertainty-driven feature distillation module that synthesizes audio-visual representations from visual-only inputs. This module employs uncertainty modeling based on the diversity of audio-visual features to dynamically emphasize challenging features during the distillation process. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy in various scenarios. Notably, with unimodal data enhanced by uncertainty-driven distillation, our approach consistently outperforms current unimodal VAD methods.
zh

[CV-100] Skin Color Measurement from Dermatoscopic Images: An Evaluation on a Synthetic Dataset

【速读】：该论文旨在解决皮肤颜色测量方法在衍射图像中的鲁棒性和光照不变性问题。论文通过构建一个包含受控基础真值黑色素含量、病灶形状、毛发模型以及18种不同光照条件的合成数据集（S-SYNTH），评估了四种类别的图像色度学方法：基于分割的方法、基于补丁的方法、颜色量化方法和神经网络方法。这些方法用于从衍射图像中估计个体类型角度（ITA）和Fitzpatrick皮肤类型。关键在于发现基于分割的方法和颜色量化方法能够提供鲁棒且光照不变的估计，而基于补丁的方法存在显著的光照依赖偏差，需要校准。此外，结合重度模糊以减少过拟合的神经网络模型能够在一定程度上提供光照不变的Fitzpatrick预测，但其在真实世界图像中的泛化能力尚未验证。最终，论文提出了设计公平且可靠皮肤颜色估计算法的实际建议。

链接: https://arxiv.org/abs/2504.04494
作者: Marin Benčević,Robert Šojo,Irena Galić
机构: Faculty of Electrical Engineering, Computer Science and Information Technology (电气工程、计算机科学与信息技术学院), Osijek (奥斯ijek), Croatia (克罗地亚)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive evaluation of skin color measurement methods from dermatoscopic images using a synthetic dataset (S-SYNTH) with controlled ground-truth melanin content, lesion shapes, hair models, and 18 distinct lighting conditions. This allows for rigorous assessment of the robustness and invariance to lighting conditions. We assess four classes of image colorimetry approaches: segmentation-based, patch-based, color quantization, and neural networks. We use these methods to estimate the Individual Typology Angle (ITA) and Fitzpatrick types from dermatoscopic images. Our results show that segmentation-based and color quantization methods yield robust, lighting-invariant estimates, whereas patch-based approaches exhibit significant lighting-dependent biases that require calibration. Furthermore, neural network models, particularly when combined with heavy blurring to reduce overfitting, can provide light-invariant Fitzpatrick predictions, although their generalization to real-world images remains unverified. We conclude with practical recommendations for designing fair and reliable skin color estimation methods.
zh

[CV-101] Learning Conditionally Independent Transformations using Normal Subgroups in Group Theory

【速读】：该论文旨在解决无监督表征学习中分离不同变换（transformations）的核心挑战，特别是那些条件独立但非交换的变换。现有方法主要基于代数独立性来分解变换，但这些方法局限于交换变换，无法有效处理非交换情况。论文的关键创新在于从伽罗瓦理论中汲取灵感，利用正规子群（normal subgroups）分解群结构的方法，提供了一种分析和分离条件独立变换的新途径。这种方法在无需交换性假设的情况下实现了变换的有效分类，并通过图像几何变换的实验验证了其有效性，表明正规子群的群分解与表征学习中的变换分类之间存在紧密联系。

链接: https://arxiv.org/abs/2504.04490
作者: Kayato Nishitsunoi,Yoshiyuki Ohmura,Takayuki Komatsu,Yasuo Kuniyoshi
机构: School of Information Science and Technology, The University of Tokyo (东京大学信息科学技术学院, 东大); Next Generation Artificial Intelligence Research Center (AI Center), The University of Tokyo (东京大学下一代人工智能研究中心, 东大)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures, conference paper

点击查看摘要

Abstract:Humans develop certain cognitive abilities to recognize objects and their transformations without explicit supervision, highlighting the importance of unsupervised representation learning. A fundamental challenge in unsupervised representation learning is to separate different transformations in learned feature representations. Although algebraic approaches have been explored, a comprehensive theoretical framework remains underdeveloped. Existing methods decompose transformations based on algebraic independence, but these methods primarily focus on commutative transformations and do not extend to cases where transformations are conditionally independent but noncommutative. To extend current representation learning frameworks, we draw inspiration from Galois theory, where the decomposition of groups through normal subgroups provides an approach for the analysis of structured transformations. Normal subgroups naturally extend commutativity under certain conditions and offer a foundation for the categorization of transformations, even when they do not commute. In this paper, we propose a novel approach that leverages normal subgroups to enable the separation of conditionally independent transformations, even in the absence of commutativity. Through experiments on geometric transformations in images, we show that our method successfully categorizes conditionally independent transformations, such as rotation and translation, in an unsupervised manner, suggesting a close link between group decomposition via normal subgroups and transformation categorization in representation learning.
zh

[CV-102] Building LLM Agents by Incorporating Insights from Computer Systems

【速读】：该论文旨在解决大型语言模型（LLM）驱动的自主智能体设计缺乏系统性原则的问题，导致现有智能体结构在通用性和可扩展性方面的局限性。论文的关键在于提出了一种基于冯·诺依曼架构（von Neumann Architecture）启发的结构化框架，强调模块化设计与普适性原则，以实现更系统化的LLM智能体设计。通过从计算机系统视角全面回顾LLM智能体，并结合计算机系统设计理念识别挑战与未来方向，同时探索超越传统计算机系统的LLM学习机制，为LLM智能体的理论基础和未来发展提供支持。

链接: https://arxiv.org/abs/2504.04485
作者: Yapeng Mi,Zhi Gao,Xiaojian Ma,Qing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LLM-driven autonomous agents have emerged as a promising direction in recent years. However, many of these LLM agents are designed empirically or based on intuition, often lacking systematic design principles, which results in diverse agent structures with limited generality and scalability. In this paper, we advocate for building LLM agents by incorporating insights from computer systems. Inspired by the von Neumann architecture, we propose a structured framework for LLM agentic systems, emphasizing modular design and universal principles. Specifically, this paper first provides a comprehensive review of LLM agents from the computer system perspective, then identifies key challenges and future directions inspired by computer system design, and finally explores the learning mechanisms for LLM agents beyond the computer system. The insights gained from this comparative analysis offer a foundation for systematic LLM agent design and advancement.
zh

[CV-103] Statistical Guarantees Of False Discovery Rate In Medical Instance Segmentation Tasks Based on Conformal Risk Control

【速读】：本文旨在解决医学图像分析中实例分割模型在高风险场景应用时面临的置信度校准问题，这些问题可能导致误诊。论文的关键解决方案在于提出了一种基于一致性预测理论的鲁棒质量控制框架，该框架创新性地构建了一个风险感知动态阈值机制，能够根据临床需求自适应调整分割决策边界。此外，设计了一种与校准相关的损失函数，依据用户定义的风险水平 ( \alpha ) 动态调节分割阈值，利用可交换校准数据确保测试数据上的期望假阴率（FNR）或假阳性率（FDR）低于 ( \alpha ) 的概率较高。此方法兼容主流分割模型（如 Mask R-CNN、BlendMask+ResNet-50-FPN）及数据集（PASCAL VOC 格式），无需修改模型架构即可实现。实验结果表明，通过所开发的校准框架严格限制了测试集上的 FDR 指标。

链接: https://arxiv.org/abs/2504.04482
作者: Mengxia Dai,Wenqian Luo,Tianyang Li
机构: Yanbian University (延边大学); Shenzhen Technology University (深圳技术大学); Hebei University of Economics and Business (河北经贸大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instance segmentation plays a pivotal role in medical image analysis by enabling precise localization and delineation of lesions, tumors, and anatomical structures. Although deep learning models such as Mask R-CNN and BlendMask have achieved remarkable progress, their application in high-risk medical scenarios remains constrained by confidence calibration issues, which may lead to misdiagnosis. To address this challenge, we propose a robust quality control framework based on conformal prediction theory. This framework innovatively constructs a risk-aware dynamic threshold mechanism that adaptively adjusts segmentation decision boundaries according to clinical this http URL, we design a \textbfcalibration-aware loss function that dynamically tunes the segmentation threshold based on a user-defined risk level \alpha . Utilizing exchangeable calibration data, this method ensures that the expected FNR or FDR on test data remains below \alpha with high probability. The framework maintains compatibility with mainstream segmentation models (e.g., Mask R-CNN, BlendMask+ResNet-50-FPN) and datasets (PASCAL VOC format) without requiring architectural modifications. Empirical results demonstrate that we rigorously bound the FDR metric marginally over the test set via our developed calibration framework.
zh

[CV-104] VideoAgent 2: Enhancing the LLM -Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

【速读】：该论文旨在解决长视频理解中的两个主要问题：(1) 当前基于大型语言模型（Large Language Models, LLMs）的方法在处理长视频场景时缺乏专门的推理增强机制；(2) 对外部工具引入的错误或噪声较为敏感。为了解决这些问题，论文提出了一种专为长视频分析设计的链式思维（Chain-of-Thought, CoT）过程，并引入计划调整模式，使LLM能够逐步规划和调整信息收集策略。此外，通过结合LLM与外部工具的启发式不确定性估计，进一步指导CoT过程，从而提高新收集信息的可靠性，并优化信息收集策略以做出更稳健的决策。关键在于通过不确定性感知的CoT过程减少外部工具噪声的影响，同时提升最终输出的可靠性。这一方法被集成到名为VideoAgent2的系统中，并在三个专用长视频基准测试中展现出比先前最先进的基于代理的方法高出平均13.1%的性能。

链接: https://arxiv.org/abs/2504.04471
作者: Zhuo Zhi,Qiangqiang Wu,Minghe shen,Wenbo Li,Yinchuan Li,Kun Shao,Kaiwen Zhou
机构: University College London (伦敦大学学院); City University of Hong Kong (香港城市大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long video understanding has emerged as an increasingly important yet challenging task in computer vision. Agent-based approaches are gaining popularity for processing long videos, as they can handle extended sequences and integrate various tools to capture fine-grained information. However, existing methods still face several challenges: (1) they often rely solely on the reasoning ability of large language models (LLMs) without dedicated mechanisms to enhance reasoning in long video scenarios; and (2) they remain vulnerable to errors or noise from external tools. To address these issues, we propose a specialized chain-of-thought (CoT) process tailored for long video analysis. Our proposed CoT with plan-adjust mode enables the LLM to incrementally plan and adapt its information-gathering strategy. We further incorporate heuristic uncertainty estimation of both the LLM and external tools to guide the CoT process. This allows the LLM to assess the reliability of newly collected information, refine its collection strategy, and make more robust decisions when synthesizing final answers. Empirical experiments show that our uncertainty-aware CoT effectively mitigates noise from external tools, leading to more reliable outputs. We implement our approach in a system called VideoAgent2, which also includes additional modules such as general context acquisition and specialized tool design. Evaluation on three dedicated long video benchmarks (and their subsets) demonstrates that VideoAgent2 outperforms the previous state-of-the-art agent-based method, VideoAgent, by an average of 13.1% and achieves leading performance among all zero-shot approaches
zh

[CV-105] Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering

【速读】：该论文旨在解决跨域泛化（Domain Generalization, DG）在人脸活体检测（Face Anti-Spoofing, FAS）中的挑战，即域特定信号对微妙欺骗线索的显著干扰。为缓解这一问题，近期一些基于CLIP的方法通过调整视觉分类器权重来减少干扰，但这些方法存在两个不足：(1) 面部类别（如真实或欺骗）对CLIP模型缺乏语义信息，难以学习准确的类别描述；(2) 单一形式的提示无法充分表达多种类型的欺骗。论文的关键解决方案是提出了一种新的基于内容感知的复合提示工程（Content-aware Composite Prompt Engineering, CCPE），它生成实例级复合提示，包括固定模板和可学习提示。具体而言，CCPE从两方面构建内容感知提示：(1) 基于指令的大语言模型（Large Language Model, LLM）的内在内容提示能够充分利用丰富的迁移知识；(2) 可学习的内容提示通过Q-Former隐式提取最具信息量的视觉内容。此外，设计了一个跨模态引导模块（Cross-Modal Guidance Module, CGM），动态调整单模态特征以实现更好的融合效果。最终，实验验证表明，CCPE在多个跨域任务中表现出色，并达到了当前最先进的性能。

链接: https://arxiv.org/abs/2504.04470
作者: Jiabao Guo,Ajian Liu,Yunfeng Diao,Jin Zhang,Hui Ma,Bo Zhao,Richang Hong,Meng Wang
机构: School of Computer Science, Hefei University of Technology (合肥工业大学计算机科学学院); State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统国家重点实验室); School of Electronic information Engineering, Taiyuan University of Technology (太原理工大学电子信息技术工程学院); School of Computer Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology (澳门科技大学计算机与工程学院创新工程系); School of Cyber Science and Engineering, Wuhan University (武汉大学网络空间安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The challenge of Domain Generalization (DG) in Face Anti-Spoofing (FAS) is the significant interference of domain-specific signals on subtle spoofing clues. Recently, some CLIP-based algorithms have been developed to alleviate this interference by adjusting the weights of visual classifiers. However, our analysis of this class-wise prompt engineering suffers from two shortcomings for DG FAS: (1) The categories of facial categories, such as real or spoof, have no semantics for the CLIP model, making it difficult to learn accurate category descriptions. (2) A single form of prompt cannot portray the various types of spoofing. In this work, instead of class-wise prompts, we propose a novel Content-aware Composite Prompt Engineering (CCPE) that generates instance-wise composite prompts, including both fixed template and learnable prompts. Specifically, our CCPE constructs content-aware prompts from two branches: (1) Inherent content prompt explicitly benefits from abundant transferred knowledge from the instruction-based Large Language Model (LLM). (2) Learnable content prompts implicitly extract the most informative visual content via Q-Former. Moreover, we design a Cross-Modal Guidance Module (CGM) that dynamically adjusts unimodal features for fusion to achieve better generalized FAS. Finally, our CCPE has been validated for its effectiveness in multiple cross-domain experiments and achieves state-of-the-art (SOTA) results.
zh

[CV-106] Spatial-Geometry Enhanced 3D Dynamic Snake Convolutional Neural Network for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱图像分类中的几个关键挑战，包括复杂且稀疏的地物分布、小而聚集的结构以及多分支延伸特征导致的漏检问题。为更好地适应地物分布并实现自适应动态特征响应，同时跳过冗余信息，论文提出了一种基于改进3D-DenseNet模型的空-几何增强三维动态蛇网络（Spatial-Geometry Enhanced 3D Dynamic Snake Network, SG-DSCNet）。其核心解决方案在于引入动态蛇卷积（Dynamic Snake Convolution, DSCConv），通过约束自学习引入可变形偏移量以增强核灵活性，从而提升对地物区域的感知能力。此外，论文还提出了一种多视角特征融合策略，利用DSCConv生成多种形态的核模板，从不同视角观察目标结构，并通过总结关键特性实现高效特征融合。这种方法使模型能够更灵活地关注不同区域的关键空间结构，而非依赖单一静态核固定的感受野。DSC模块通过动态核聚合增强了模型的表征能力，且未增加网络深度或宽度。实验结果表明，该方法在IN、UP和KSC数据集上的表现优于主流高光谱分类方法。

链接: https://arxiv.org/abs/2504.04463
作者: Guandong Li,Mengxia Ye
机构: aiFLYTEK (科大讯飞); Aegon THTF (安信信托)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks face several challenges in hyperspectral image classification, including complex and sparse ground object distributions, small clustered structures, and elongated multi-branch features that often lead to missing detections. To better adapt to ground object distributions and achieve adaptive dynamic feature responses while skipping redundant information, this paper proposes a Spatial-Geometry Enhanced 3D Dynamic Snake Network (SG-DSCNet) based on an improved 3D-DenseNet model. The network employs Dynamic Snake Convolution (DSCConv), which introduces deformable offsets to enhance kernel flexibility through constrained self-learning, thereby improving regional perception of ground objects. Additionally, we propose a multi-view feature fusion strategy that generates multiple morphological kernel templates from DSCConv to observe target structures from different perspectives and achieve efficient feature fusion through summarizing key characteristics. This dynamic approach enables the model to focus more flexibly on critical spatial structures when processing different regions, rather than relying on fixed receptive fields of single static kernels. The DSC module enhances model representation capability through dynamic kernel aggregation without increasing network depth or width. Experimental results demonstrate superior performance on the IN, UP, and KSC datasets, outperforming mainstream hyperspectral classification methods.
zh

[CV-107] VSLAM-LAB: A Comprehensive Framework for Visual SLAM Methods and Datasets

【速读】：本文针对视觉同时定位与建图（VSLAM）研究中存在的工具链碎片化、系统配置复杂以及评估方法不一致等问题提出了解决方案。论文的关键在于引入了一个名为VSLAM-LAB的统一框架，该框架通过提供无缝集成和配置VSLAM算法的能力、自动化数据集下载与预处理、标准化实验设计与执行及评估等功能，简化了整个工作流。所有这些功能都可以通过一个命令行界面访问。此外，VSLAM-LAB支持多种VSLAM系统和数据集，具有广泛的兼容性和可扩展性，并通过一致的评估指标和分析工具促进了结果的可重复性。通过降低实现复杂度并减少配置开销，VSLAM-LAB使研究人员能够专注于推进VSLAM方法论的发展，并加速向实际应用迈进的步伐。

链接: https://arxiv.org/abs/2504.04457
作者: Alejandro Fontan,Tobias Fischer,Javier Civera,Michael Milford
机构: QUT Centre for Robotics, School of Electrical Engineering and Robotics, Queensland University of Technology (昆士兰科技大学); I3A, Universidad de Zaragoza (西班牙萨拉戈萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Simultaneous Localization and Mapping (VSLAM) research faces significant challenges due to fragmented toolchains, complex system configurations, and inconsistent evaluation methodologies. To address these issues, we present VSLAM-LAB, a unified framework designed to streamline the development, evaluation, and deployment of VSLAM systems. VSLAM-LAB simplifies the entire workflow by enabling seamless compilation and configuration of VSLAM algorithms, automated dataset downloading and preprocessing, and standardized experiment design, execution, and evaluation–all accessible through a single command-line interface. The framework supports a wide range of VSLAM systems and datasets, offering broad compatibility and extendability while promoting reproducibility through consistent evaluation metrics and analysis tools. By reducing implementation complexity and minimizing configuration overhead, VSLAM-LAB empowers researchers to focus on advancing VSLAM methodologies and accelerates progress toward scalable, real-world solutions. We demonstrate the ease with which user-relevant benchmarks can be created: here, we introduce difficulty-level-based categories, but one could envision environment-specific or condition-specific categories.
zh

[CV-108] PRISM: Probabilistic Representation for Integrated Shape Modeling and Generation

【速读】：该论文旨在解决在3D形状生成中准确建模复杂几何结构与语义信息的挑战，特别是在形状部件数量各异的情况下。当前方法难以有效将三维形状的上下文和结构信息整合到生成过程中。为了解决这些问题，论文提出了一种名为PRISM的新颖组合式方法，该方法将类别扩散模型与统计形状模型（Statistical Shape Model, SSM）以及高斯混合模型（Gaussian Mixture Model, GMM）相结合。关键在于利用组合式SSM捕捉部件级别的几何变化，并使用GMM以连续空间表示部件语义，从而实现生成形状的高保真度和多样性，同时保持结构一致性。通过广泛的形状生成和操作任务实验，证明了该方法在部件级别操作的质量和可控性方面显著优于先前的方法。

链接: https://arxiv.org/abs/2504.04454
作者: Lei Cheng,Mahdi Saleh,Qing Cheng,Lu Sang,Hongli Xu,Daniel Cremers,Federico Tombari
机构: Technical University of Munich (慕尼黑工业大学); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite the advancements in 3D full-shape generation, accurately modeling complex geometries and semantics of shape parts remains a significant challenge, particularly for shapes with varying numbers of parts. Current methods struggle to effectively integrate the contextual and structural information of 3D shapes into their generative processes. We address these limitations with PRISM, a novel compositional approach for 3D shape generation that integrates categorical diffusion models with Statistical Shape Models (SSM) and Gaussian Mixture Models (GMM). Our method employs compositional SSMs to capture part-level geometric variations and uses GMM to represent part semantics in a continuous space. This integration enables both high fidelity and diversity in generated shapes while preserving structural coherence. Through extensive experiments on shape generation and manipulation tasks, we demonstrate that our approach significantly outperforms previous methods in both quality and controllability of part-level operations. Our code will be made publicly available.
zh

[CV-109] hermoxels: a voxel-based method to generate simulation-ready 3D thermal models

【速读】：该论文旨在解决现有建筑能耗评估方法在支持数据驱动的节能决策方面的局限性。当前方法主要依赖定性的热成像，缺乏精确的定量分析，而基于有限元分析（Finite Element Analysis, FEA）的定量评估虽然提供精准洞察，但需要繁琐且易出错的手动CAD设计。论文提出的关键解决方案是Thermoxels，这是一种基于体素的新方法，能够从稀疏的RGB和热图像生成兼容FEA的模型，包括几何结构和温度信息。通过优化处理，Thermoxels生成的模型可转化为适用于FEA的四面体网格，并展示了优于现有最先进方法的能力。其关键是实现了从图像到FEA兼容模型的自动化转换，同时保持几何与温度信息的精确性。

链接: https://arxiv.org/abs/2504.04448
作者: Etienne Chassaing,Florent Forest,Olga Fink,Malcolm Mielle
机构: IMOS, École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院), Lausanne, Switzerland; Schindler EPFL Lab(施耐德EPFL实验室), Lausanne, Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:In the European Union, buildings account for 42% of energy use and 35% of greenhouse gas emissions. Since most existing buildings will still be in use by 2050, retrofitting is crucial for emissions reduction. However, current building assessment methods rely mainly on qualitative thermal imaging, which limits data-driven decisions for energy savings. On the other hand, quantitative assessments using finite element analysis (FEA) offer precise insights but require manual CAD design, which is tedious and error-prone. Recent advances in 3D reconstruction, such as Neural Radiance Fields (NeRF) and Gaussian Splatting, enable precise 3D modeling from sparse images but lack clearly defined volumes and the interfaces between them needed for FEA. We propose Thermoxels, a novel voxel-based method able to generate FEA-compatible models, including both geometry and temperature, from a sparse set of RGB and thermal images. Using pairs of RGB and thermal images as input, Thermoxels represents a scene’s geometry as a set of voxels comprising color and temperature information. After optimization, a simple process is used to transform Thermoxels’ models into tetrahedral meshes compatible with FEA. We demonstrate Thermoxels’ capability to generate RGB+Thermal meshes of 3D scenes, surpassing other state-of-the-art methods. To showcase the practical applications of Thermoxels’ models, we conduct a simple heat conduction simulation using FEA, achieving convergence from an initial state defined by Thermoxels’ thermal reconstruction. Additionally, we compare Thermoxels’ image synthesis abilities with current state-of-the-art methods, showing competitive results, and discuss the limitations of existing metrics in assessing mesh quality.
zh

[CV-110] Evaluation framework for Image Segmentation Algorithms

【速读】：本文旨在构建一个全面的图像分割算法评估框架，涵盖简单方法、机器学习方法以及深度学习技术，以解决现有图像分割算法在不同应用场景下的性能评估与优化问题。论文的关键在于提出了一种结合算法与用户交互的综合解决方案，通过三种主要方式——算法辅助用户、用户辅助算法及混合方法，实现图像分割精度的提升。同时，论文利用交并比（Intersection over Union, IoU）、计算时间和用户交互时间等评价指标，对多种分割方法进行系统性比较分析，揭示其优势、局限及权衡。关键创新点在于强调交互式分割的作用，并探索如何通过实时反馈、弱监督与自监督学习等手段进一步提高分割算法的准确性和效率。

链接: https://arxiv.org/abs/2504.04435
作者: Tatiana Merkulova,Bharani Jayakumar
机构: Technische Universität Ilmenau (伊尔梅瑙工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive evaluation framework for image segmentation algorithms, encompassing naive methods, machine learning approaches, and deep learning techniques. We begin by introducing the fundamental concepts and importance of image segmentation, and the role of interactive segmentation in enhancing accuracy. A detailed background theory section explores various segmentation methods, including thresholding, edge detection, region growing, feature extraction, random forests, support vector machines, convolutional neural networks, U-Net, and Mask R-CNN. The implementation and experimental setup are thoroughly described, highlighting three primary approaches: algorithm assisting user, user assisting algorithm, and hybrid methods. Evaluation metrics such as Intersection over Union (IoU), computation time, and user interaction time are employed to measure performance. A comparative analysis presents detailed results, emphasizing the strengths, limitations, and trade-offs of each method. The paper concludes with insights into the practical applicability of these approaches across various scenarios and outlines future work, focusing on expanding datasets, developing more representative approaches, integrating real-time feedback, and exploring weakly supervised and self-supervised learning paradigms to enhance segmentation accuracy and efficiency. Keywords: Image Segmentation, Interactive Segmentation, Machine Learning, Deep Learning, Computer Vision
zh

[CV-111] FluentLip: A Phonemes-Based Two-stage Approach for Audio-Driven Lip Synthesis with Optical Flow Consistency

【速读】：该论文旨在解决音频驱动的唇部运动连续图像生成任务中的唇同步性、可理解性以及视频流畅性等挑战。现有研究虽在同步性和视觉质量方面取得进展，但唇部动作的清晰度与视频流畅性仍存在不足。为应对这些问题，论文提出了一种名为FluentLip的两阶段方法，其关键在于结合三种特色策略：通过引入音素提取器和编码器融合音频与音素信息以实现多模态学习，从而提升唇同步性和可理解性；利用光流一致性损失确保图像帧间自然过渡；在生成对抗网络（GANs）训练过程中引入扩散链以增强模型的稳定性和效率。实验结果表明，FluentLip在多个指标上表现出色，尤其在Fréchet Inception Distance (FID) 和自定义的Phoneme Error Rate (PER) 上分别提升了16.3%和35.2%，显著改善了生成视频的平滑度和自然度。

链接: https://arxiv.org/abs/2504.04427
作者: Shiyan Liu,Rui Qu,Yan Jin
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generating consecutive images of lip movements that align with a given speech in audio-driven lip synthesis is a challenging task. While previous studies have made strides in synchronization and visual quality, lip intelligibility and video fluency remain persistent challenges. This work proposes FluentLip, a two-stage approach for audio-driven lip synthesis, incorporating three featured strategies. To improve lip synchronization and intelligibility, we integrate a phoneme extractor and encoder to generate a fusion of audio and phoneme information for multimodal learning. Additionally, we employ optical flow consistency loss to ensure natural transitions between image frames. Furthermore, we incorporate a diffusion chain during the training of Generative Adversarial Networks (GANs) to improve both stability and efficiency. We evaluate our proposed FluentLip through extensive experiments, comparing it with five state-of-the-art (SOTA) approaches across five metrics, including a proposed metric called Phoneme Error Rate (PER) that evaluates lip pose intelligibility and video fluency. The experimental results demonstrate that our FluentLip approach is highly competitive, achieving significant improvements in smoothness and naturalness. In particular, it outperforms these SOTA approaches by approximately \textbf16.3% in Fréchet Inception Distance (FID) and \textbf35.2% in PER.
zh

[CV-112] UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding CVPR2025

【速读】：该论文旨在解决视觉理解与图像生成任务之间的割裂问题，提出一种能够统一处理这两种任务的模型。解决方案的关键在于引入UniToken模型，它通过结合离散和连续表示来编码视觉输入，从而实现视觉理解和图像生成的无缝集成。与依赖单一视觉表征的传统方法不同，UniToken的统一视觉编码框架既能捕捉高层语义又能保留低层细节，提供多维度信息以支持异构任务根据自身特性选择性地吸收领域特定知识。这一创新方法通过实验验证了其在多个基准数据集上的卓越性能，超越现有技术，确立了其作为未来研究的坚实基础。

链接: https://arxiv.org/abs/2504.04423
作者: Yang Jiao,Haibo Qiu,Zequn Jie,Shaoxiang Chen,Jingjing Chen,Lin Ma,Yu-Gang Jiang
机构: Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University (复旦大学计算机学院智能信息处理重点实验室); Shanghai Collaborative Innovation Center on Intelligent Visual Computing (上海智能视觉创新中心); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accpeted to CVPR 2025 workshop

点击查看摘要

Abstract:We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks. Unlike previous approaches that rely on unilateral visual representations, our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information that empowers heterogeneous tasks to selectively assimilate domain-specific knowledge based on their inherent characteristics. Through in-depth experiments, we uncover key principles for developing a unified model capable of both visual understanding and image generation. Extensive evaluations across a diverse range of prominent benchmarks demonstrate that UniToken achieves state-of-the-art performance, surpassing existing approaches. These results establish UniToken as a robust foundation for future research in this domain. The code and models are available at this https URL.
zh

[CV-113] Hypothesis Testing for Progressive Kernel Estimation and VCM Framework

【速读】：该论文旨在解决无偏核估计的半径选择问题，这对于辐射度估计的效率至关重要，但确定合适的半径及保证无偏性仍面临重大挑战。论文的关键解决方案在于首先提出了一种基于逐步核估计的光子样本及其贡献的统计模型，并证明若该模型的零假设成立，则核估计无偏。接着，通过方差分析中的F检验方法，提出了判断是否拒绝统计总体（即光子样本）零假设的方法，以此实现逐步光子映射(PPM)算法中核半径的确定，用于无偏辐射度估计。此外，论文还提出了VCM+，即Vertex Connection and Merging (VCM) 的增强版本，并推导出其理论上无偏的公式。VCM+结合了基于假设检验的PPM与双向路径追踪(BDPT)，并通过多重重要性采样(MIS)整合两者贡献，其中核半径能够充分利用PPM和BDPT的优势。实验结果表明，所提方法能够缓解现有辐射度估计算法中的光照泄漏和视觉模糊伪影，并在多种场景下观察到整体性能的提升。

链接: https://arxiv.org/abs/2504.04411
作者: Zehui Lin,Chenxiao Hu,Jinzhu Jia,Sheng Li
机构: School of Computer Science, Peking University, China (北京大学计算机学院); Dept. of Biostatistics and Center for Statistical Science, Peking University, China (北京大学生物统计学系和统计科学中心)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been published in IEEE Transactions on Visualization and Computer Graphics. This version is a preprint one

点击查看摘要

Abstract:Identifying an appropriate radius for unbiased kernel estimation is crucial for the efficiency of radiance estimation. However, determining both the radius and unbiasedness still faces big challenges. In this paper, we first propose a statistical model of photon samples and associated contributions for progressive kernel estimation, under which the kernel estimation is unbiased if the null hypothesis of this statistical model stands. Then, we present a method to decide whether to reject the null hypothesis about the statistical population (i.e., photon samples) by the F-test in the Analysis of Variance. Hereby, we implement a progressive photon mapping (PPM) algorithm, wherein the kernel radius is determined by this hypothesis test for unbiased radiance estimation. Secondly, we propose VCM+, a reinforcement of Vertex Connection and Merging (VCM), and derive its theoretically unbiased formulation. VCM+ combines hypothesis testing-based PPM with bidirectional path tracing (BDPT) via multiple importance sampling (MIS), wherein our kernel radius can leverage the contributions from PPM and BDPT. We test our new algorithms, improved PPM and VCM+, on diverse scenarios with different lighting settings. The experimental results demonstrate that our method can alleviate light leaks and visual blur artifacts of prior radiance estimate algorithms. We also evaluate the asymptotic performance of our approach and observe an overall improvement over the baseline in all testing scenarios.
zh

[CV-114] Future-Proof Yourself: An AI Era Survival Guide

【速读】：该论文旨在帮助读者理解和适应快速发展的人工智能技术在日常生活中的应用与影响。其核心问题是如何以易于理解的方式向普通读者介绍人工智能的基本原理、发展历程及其未来趋势。论文的关键解决方案在于采用简单且贴近生活的语言，逐步从基础的数据学习概念引入现代人工智能的方法，并通过回顾历史突破和展望新兴趋势（如数字孪生、可穿戴设备和虚拟环境的结合），使复杂的技术思想变得清晰易懂，从而让非专业人士也能掌握即将塑造未来的科技知识。

链接: https://arxiv.org/abs/2504.04378
作者: Taehoon Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 chapters, 259 pages, Textbook for “Data AI” and “Artificial Intelligence” at Sogang University Graduate School of Metaverse

点击查看摘要

Abstract:Future-Proof Yourself is a practical guide that helps readers navigate the fast-changing world of artificial intelligence in everyday life. The book begins by explaining how computers learn from data in simple, relatable terms, and gradually introduces the methods used in modern AI. It shows how basic ideas in machine learning evolve into advanced systems that can recognize images, understand language, and even make decisions. The guide also reviews the history of AI and highlights the major breakthroughs that have shaped its growth. Looking ahead, the book explores emerging trends such as the integration of AI with digital twins, wearable devices, and virtual environments. Designed for a general audience, the text avoids heavy technical jargon and presents complex ideas in clear, straightforward language so that anyone can gain a solid understanding of the technology that is set to transform our future.
zh

[CV-115] OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

【速读】：本文旨在解决将视觉-语言模型（Vision-Language Models, VLMs）在二维（2D）场景中的强大推理能力扩展到三维（3D）全空间理解的问题，以满足自动驾驶领域真实应用场景的需求。为应对这一挑战，论文提出OmniDrive，这是一个整合代理模型与3D驾驶任务的综合视觉-语言数据集，通过反事实推理（Counterfactual Reasoning）实现对潜在场景及其结果的评估，类似于人类驾驶员考虑替代行为的过程。解决方案的关键在于利用基于反事实的合成数据标注流程生成大规模高质量数据集，提供更密集的监督信号，从而弥合规划轨迹与基于语言推理之间的差距。此外，通过设计Omni-L和Omni-Q两种高级框架，进一步探索视觉-语言对齐与3D感知的重要性，为构建高效的大规模语言模型（LLM）代理提供了重要见解，并在DriveLM Q\A基准测试和nuScenes开环规划任务中验证了方法的有效性。

链接: https://arxiv.org/abs/2504.04348
作者: Shihao Wang,Zhiding Yu,Xiaohui Jiang,Shiyi Lan,Min Shi,Nadine Chang,Jan Kautz,Ying Li,Jose M. Alvarez
机构: NVIDIA(英伟达); The Hong Kong Polytechnic University(香港理工大学); Beijing Institute of Technology(北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q\A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.
zh

[CV-116] AnomalyHybrid: A Domain-agnostic Generative Framework for General Anomaly Detection CVPR2025

【速读】：该论文旨在解决异常检测任务中数据稀缺的问题，通过提出一种跨领域的异常生成方法，以生成真实且多样化的异常样本，从而缓解特定应用领域（如工业）之外的异常生成难题。论文的关键解决方案在于提出了一种名为AnomalyHybrid的领域无关框架，基于生成对抗网络（GAN），利用两个解码器分别将参考图像的外观特征融入目标图像的深度结构和边缘结构中。这种设计不仅能够实现具有深度变化（如凸起或凹陷）的异常的真实生成，还通过放松边缘解码器的精细结构控制，增加了生成结果的多样性。此外，AnomalyHybrid无需标注信息，仅依赖于具有不同增强方式的彩色、深度和边缘图像集合即可进行训练，从而有效应对多领域异常生成及其下游分类、检测和分割任务。

链接: https://arxiv.org/abs/2504.04340
作者: Ying Zhao
机构: Ricoh Software Research Center (Beijing) Co., Ltd. (理光软件研究院（北京）有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 workshop on Harnessing Generative Models for Synthetic Visual Datasets (SyntaGen)

点击查看摘要

Abstract:Anomaly generation is an effective way to mitigate data scarcity for anomaly detection task. Most existing works shine at industrial anomaly generation with multiple specialists or large generative models, rarely generalizing to anomalies in other applications. In this paper, we present AnomalyHybrid, a domain-agnostic framework designed to generate authentic and diverse anomalies simply by combining the reference and target images. AnomalyHybrid is a Generative Adversarial Network(GAN)-based framework having two decoders that integrate the appearance of reference image into the depth and edge structures of target image respectively. With the help of depth decoders, AnomalyHybrid achieves authentic generation especially for the anomalies with depth values changing, such a s protrusion and dent. More, it relaxes the fine granularity structural control of the edge decoder and brings more diversity. Without using annotations, AnomalyHybrid is easily trained with sets of color, depth and edge of same images having different augmentations. Extensive experiments carried on HeliconiusButterfly, MVTecAD and MVTec3D datasets demonstrate that AnomalyHybrid surpasses the GAN-based state-of-the-art on anomaly generation and its downstream anomaly classification, detection and segmentation tasks. On MVTecAD dataset, AnomalyHybrid achieves 2.06/0.32 IS/LPIPS for anomaly generation, 52.6 Acc for anomaly classification with ResNet34, 97.3/72.9 AP for image/pixel-level anomaly detection with a simple UNet.
zh

[CV-117] NCL-CIR: Noise-aware Contrastive Learning for Composed Image Retrieval ICASSP2025

【速读】：该论文旨在解决现有合成图像检索（Composed Image Retrieval, CIR）方法在处理查询与目标图像部分或完全不匹配时易产生噪声对（False Positive Pairs, FFPs）的问题。这些问题源于修改文本不准确、目标图像质量低下以及标注错误等实际情况。忽视这些不匹配会导致模型过拟合，从而降低性能。为了解决这一挑战，论文提出了针对CIR的噪声感知对比学习框架（Noise-aware Contrastive Learning for CIR, NCL-CIR），其关键是包含两个核心组件：权重补偿块（Weight Compensation Block, WCB）和噪声对过滤块（Noise-pair Filter Block, NFB）。WCB通过多样化的权重图确保多模态查询和目标图像的稳定表示；而NFB结合高斯混合模型（Gaussian Mixture Model, GMM）预测噪声对，并生成软标签，进而设计基于软标签的噪声对比估计（Noise Contrastive Estimation, NCE）损失函数。最终，该架构有效减轻了不匹配样本的影响，在基准数据集上的实验结果验证了其卓越性能。

链接: https://arxiv.org/abs/2504.04339
作者: Peng Gao,Yujian Lee,Zailong Chen,Hui zhang,Xubo Liu,Yiyang Hu,Guquang Jing
机构: Hong Kong Baptist University (香港浸会大学); BNU-HKBU United International College (北京师范大学-香港浸会大学联合国际学院, 中国); University of Wollongong (卧龙岗大学, 澳大利亚); University of Surrey (萨里大学, 英国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Has been accepted by ICASSP2025

点击查看摘要

Abstract:Composed Image Retrieval (CIR) seeks to find a target image using a multi-modal query, which combines an image with modification text to pinpoint the target. While recent CIR methods have shown promise, they mainly focus on exploring relationships between the query pairs (image and text) through data augmentation or model design. These methods often assume perfect alignment between queries and target images, an idealized scenario rarely encountered in practice. In reality, pairs are often partially or completely mismatched due to issues like inaccurate modification texts, low-quality target images, and annotation errors. Ignoring these mismatches leads to numerous False Positive Pair (FFPs) denoted as noise pairs in the dataset, causing the model to overfit and ultimately reducing its performance. To address this problem, we propose the Noise-aware Contrastive Learning for CIR (NCL-CIR), comprising two key components: the Weight Compensation Block (WCB) and the Noise-pair Filter Block (NFB). The WCB coupled with diverse weight maps can ensure more stable token representations of multi-modal queries and target images. Meanwhile, the NFB, in conjunction with the Gaussian Mixture Model (GMM) predicts noise pairs by evaluating loss distributions, and generates soft labels correspondingly, allowing for the design of the soft-label based Noise Contrastive Estimation (NCE) loss function. Consequently, the overall architecture helps to mitigate the influence of mismatched and partially matched samples, with experimental results demonstrating that NCL-CIR achieves exceptional performance on the benchmark datasets.
zh

[CV-118] Data Scaling Laws for End-to-End Autonomous Driving CVPR2025

【速读】：该论文试图解决传统自动驾驶（Autonomous Vehicle, AV）堆栈中因模块化分解设计导致的信息丢失、计算开销增加以及误差累积的问题。为应对这些挑战，论文提出了一种端到端可微分模型的集成架构，强调通过数据工程而非软件集成来实现系统性能的整体优化。解决方案的关键在于构建一个端到端的驾驶架构，并通过大规模训练数据提升模型性能，从而在开放环路（open-loop）指标和闭环仿真（closed-loop simulation）中评估其效果，同时研究达到特定性能增益所需的额外训练数据量，以指导自动驾驶开发中的数据驱动决策。

链接: https://arxiv.org/abs/2504.04338
作者: Alexander Naumann,Xunjiang Gu,Tolga Dimlioglu,Mariusz Bojarski,Alperen Degirmenci,Alexander Popov,Devansh Bisla,Marco Pavone,Urs Müller,Boris Ivanovic
机构: NVIDIA(英伟达); University of Toronto(多伦多大学); New York University(纽约大学); Stanford University(斯坦福大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 11 figures, 4 tables, CVPR 2025 Workshop on Autonomous Driving

点击查看摘要

Abstract:Autonomous vehicle (AV) stacks have traditionally relied on decomposed approaches, with separate modules handling perception, prediction, and planning. However, this design introduces information loss during inter-module communication, increases computational overhead, and can lead to compounding errors. To address these challenges, recent works have proposed architectures that integrate all components into an end-to-end differentiable model, enabling holistic system optimization. This shift emphasizes data engineering over software integration, offering the potential to enhance system performance by simply scaling up training resources. In this work, we evaluate the performance of a simple end-to-end driving architecture on internal driving datasets ranging in size from 16 to 8192 hours with both open-loop metrics and closed-loop simulations. Specifically, we investigate how much additional training data is needed to achieve a target performance gain, e.g., a 5% improvement in motion prediction accuracy. By understanding the relationship between model performance and training dataset size, we aim to provide insights for data-driven decision-making in autonomous driving development.
zh

[CV-119] MedM-VL: What Makes a Good Medical LVLM?

【速读】：该论文旨在解决医学视觉-语言任务中传统浅层模型在复杂性和可扩展性方面的局限性问题，特别是在多模态临床应用场景下的挑战。论文的关键在于基于广泛采用的LLaVA框架，提出了一种遵循编码器-连接器-大型语言模型（Encoder-Connector-LLM）范式的医学大型视觉-语言模型（Medical Large Vision-Language Model, LVLM）架构设计。通过构建分别针对二维（2D）和三维（3D）模态的两种特定模型，论文实现了支持通用医学任务以及领域特定微调的能力，从而作为有效的基础模型服务于临床实践。为促进可重复研究与进一步探索，作者开发了一个模块化且可扩展的代码库MedM-VL，并发布了两个LVLM变体：MedM-VL-2D用于二维医学图像分析，MedM-VL-CT-Chest用于基于三维CT的胸部应用。

链接: https://arxiv.org/abs/2504.04323
作者: Yiming Shi,Shaoshuai Yang,Xun Zhu,Haoyu Wang,Miao Li,Ji Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image analysis is a fundamental component. As deep learning progresses, the focus has shifted from single-task applications, such as classification and segmentation, to more complex multimodal tasks, including medical visual question answering and report generation. Traditional shallow and task-specific models are increasingly limited in addressing the complexity and scalability required in clinical practice. The emergence of large language models (LLMs) has driven the development of medical Large Vision-Language Models (LVLMs), offering a unified solution for diverse vision-language tasks. In this study, we investigate various architectural designs for medical LVLMs based on the widely adopted LLaVA framework, which follows an encoder-connector-LLM paradigm. We construct two distinct models targeting 2D and 3D modalities, respectively. These models are designed to support both general-purpose medical tasks and domain-specific fine-tuning, thereby serving as effective foundation models. To facilitate reproducibility and further research, we develop a modular and extensible codebase, MedM-VL, and release two LVLM variants: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications. The code and models are available at: this https URL
zh

[CV-120] Variational Self-Supervised Learning NEURIPS2025

【速读】：该论文试图解决传统变分自编码器（Variational Autoencoders, VAEs）在表示学习中的局限性，即依赖于解码器进行输入重建的问题，这通常导致计算开销较大且限制了模型的适用范围。此外，现有自监督学习方法在高维潜在空间中对语义对齐的效果仍有提升空间。为了解决这些问题，论文提出了一种名为变分自监督学习（Variational Self-Supervised Learning, VSSL）的新框架。其关键是通过引入两个对称耦合的高斯输出编码器，用动量更新的教师网络定义动态数据相关先验，同时学生编码器从增强视图生成近似后验，并以跨视图去噪目标替代证据下界（ELBO）中的重建项，同时采用基于余弦的KL散度和对数似然项公式来增强高维潜在空间中的语义一致性。这一方案实现了无生成重建的高效表示学习，弥合了变分建模与现代自监督技术之间的差距。

链接: https://arxiv.org/abs/2504.04318
作者: Mehmet Can Yavuz,Berrin Yanikoglu
机构: Istanbul K University (I≈Яƒ±k University); Sabanci University
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to NeurIPS 2025

点击查看摘要

Abstract:We present Variational Self-Supervised Learning (VSSL), a novel framework that combines variational inference with self-supervised learning to enable efficient, decoder-free representation learning. Unlike traditional VAEs that rely on input reconstruction via a decoder, VSSL symmetrically couples two encoders with Gaussian outputs. A momentum-updated teacher network defines a dynamic, data-dependent prior, while the student encoder produces an approximate posterior from augmented views. The reconstruction term in the ELBO is replaced with a cross-view denoising objective, preserving the analytical tractability of Gaussian KL divergence. We further introduce cosine-based formulations of KL and log-likelihood terms to enhance semantic alignment in high-dimensional latent spaces. Experiments on CIFAR-10, CIFAR-100, and ImageNet-100 show that VSSL achieves competitive or superior performance to leading self-supervised methods, including BYOL and MoCo V3. VSSL offers a scalable, probabilistically grounded approach to learning transferable representations without generative reconstruction, bridging the gap between variational modeling and modern self-supervised techniques.
zh

[CV-121] 3R-GS: Best Practice in Optimizing Camera Poses Along with 3DGS

【速读】：该论文旨在解决基于3D Gaussian Splatting (3DGS) 的神经渲染方法在依赖 Structure-from-Motion (SfM) 系统提供的相机姿态时所面临的挑战，特别是在纹理不足场景下的鲁棒性和相机参数估计精度问题。论文的关键创新在于提出了一种名为3R-GS的新框架，通过结合优化后的3D高斯点和相机参数，利用大型重建先验MASt3R-SfM进行联合优化。该方案克服了传统方法对SfM初始化质量敏感以及全局优化能力有限的问题，实现了即使在相机注册不完美的情况下也能进行稳健场景重建的能力。

链接: https://arxiv.org/abs/2504.04294
作者: Zhisheng Huang,Peng Wang,Jingdong Zhang,Yuan Liu,Xin Li,Wenping Wang
机构: Texas A&M University; Hong Kong University; Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has revolutionized neural rendering with its efficiency and quality, but like many novel view synthesis methods, it heavily depends on accurate camera poses from Structure-from-Motion (SfM) systems. Although recent SfM pipelines have made impressive progress, questions remain about how to further improve both their robust performance in challenging conditions (e.g., textureless scenes) and the precision of camera parameter estimation simultaneously. We present 3R-GS, a 3D Gaussian Splatting framework that bridges this gap by jointly optimizing 3D Gaussians and camera parameters from large reconstruction priors MASt3R-SfM. We note that naively performing joint 3D Gaussian and camera optimization faces two challenges: the sensitivity to the quality of SfM initialization, and its limited capacity for global optimization, leading to suboptimal reconstruction results. Our 3R-GS, overcomes these issues by incorporating optimized practices, enabling robust scene reconstruction even with imperfect camera registration. Extensive experiments demonstrate that 3R-GS delivers high-quality novel view synthesis and precise camera pose estimation while remaining computationally efficient. Project page: this https URL
zh

[CV-122] ADA-Net: Attention-Guided Domain Adaptation Network with Contrastive Learning for Standing Dead Tree Segmentation Using Aerial Imagery

【速读】：本文旨在解决因气候变迁导致的大规模树木死亡事件在大范围地理区域中难以被检测的问题，特别是在森林生态系统功能与恢复力理解方面缺乏关于直立死树信息的挑战。为应对这一难题，研究提出了一种利用航空多光谱正射影像分割直立死树的新方法。鉴于森林遥感领域获取标注数据集存在显著困难（需要森林专业知识），研究引入了一种通过领域适应进行领域转移的方法，以从源域X学习到目标域Y的转换。在图像到图像翻译任务中，通过预训练分割网络利用目标域中的可用标注。当引入无标注的新研究站点图像（源域X）时，这些图像被转换为目标域，并通过在领域适应后的图像上应用迁移学习来实现模型推理。除了探讨现有领域适应方法在此目标上的可行性外，还提出了一个名为注意力引导领域适应网络（ADA-Net）的新方法，该方法增强了对比学习能力。因此，ADA-Net 方法提供了新的最先进的领域适应性能水平，优于现有方法。研究使用来自芬兰和美国的两个数据集评估了所提出的方案，显示了合成的美国到芬兰数据集与芬兰域图像具有相似特征。软件实现可在指定链接获得，数据集也公开可访问。关键在于结合领域适应技术和增强对比学习的创新性ADA-Net架构。

链接: https://arxiv.org/abs/2504.04271
作者: Mete Ahishali,Anis Ur Rahman,Einari Heinaro,Samuli Junttila
机构: School of Forest Sciences, Faculty of Science, Forestry and Technology, University of Eastern Finland (东芬兰大学), Finland
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Information on standing dead trees is important for understanding forest ecosystem functioning and resilience but has been lacking over large geographic regions. Climate change has caused large-scale tree mortality events that can remain undetected due to limited data. In this study, we propose a novel method for segmenting standing dead trees using aerial multispectral orthoimages. Because access to annotated datasets has been a significant problem in forest remote sensing due to the need for forest expertise, we introduce a method for domain transfer by leveraging domain adaptation to learn a transformation from a source domain X to target domain Y. In this Image-to-Image translation task, we aim to utilize available annotations in the target domain by pre-training a segmentation network. When images from a new study site without annotations are introduced (source domain X), these images are transformed into the target domain. Then, transfer learning is applied by inferring the pre-trained network on domain-adapted images. In addition to investigating the feasibility of current domain adaptation approaches for this objective, we propose a novel approach called the Attention-guided Domain Adaptation Network (ADA-Net) with enhanced contrastive learning. Accordingly, the ADA-Net approach provides new state-of-the-art domain adaptation performance levels outperforming existing approaches. We have evaluated the proposed approach using two datasets from Finland and the US. The USA images are converted to the Finland domain, and we show that the synthetic USA2Finland dataset exhibits similar characteristics to the Finland domain images. The software implementation is shared at this https URL. The data is publicly available at this https URL.
zh

[CV-123] LOGLO-FNO: Efficient Learning of Local and Global Features in Fourier Neural Operators ICLR2025 ISCA

【速读】：该论文旨在解决科学机器学习中高频信息建模的挑战，特别是针对高雷诺数（Re > 3500）下湍流流动引起的高频信号重建问题。传统深度神经网络在频谱学习上表现出低频偏向性，而Fourier神经算子（FNOs）虽然在部分偏微分方程（PDE）基准问题上取得了显著成果，但在捕捉非主导频率的局部特征方面表现欠佳。论文指出，这种局限性源于神经网络固有的频谱偏向以及FNO及其变体中显式排除高频模式的设计。

为了解决上述问题并提升FNO的频谱学习能力以覆盖更广泛的频率成分，论文提出了两项关键架构改进：(i) 并行分支执行局部频谱卷积；(ii) 高频传播模块。此外，还引入了一种基于径向分箱频谱误差的新颖频率敏感损失项。这些改进不仅减少了高达50%的可训练参数数量，同时保持了与仅依赖全局卷积的基线FNO相当的预测精度。实验结果表明，所提出的方法在流体力学和生物模式形成中的三个具有挑战性的PDE问题上优于现有最先进的神经算子基线。

链接: https://arxiv.org/abs/2504.04260
作者: Marimuthu Kalimuthu,David Holzmüller,Mathias Niepert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)
备注: Accepted for Oral Presentation at the ICLR 2025 Workshop on Machine Learning Multiscale Processes (MLMP), Singapura

点击查看摘要

Abstract:Modeling high-frequency information is a critical challenge in scientific machine learning. For instance, fully turbulent flow simulations of Navier-Stokes equations at Reynolds numbers 3500 and above can generate high-frequency signals due to swirling fluid motions caused by eddies and vortices. Faithfully modeling such signals using neural networks depends on accurately reconstructing moderate to high frequencies. However, it has been well known that deep neural nets exhibit the so-called spectral bias toward learning low-frequency components. Meanwhile, Fourier Neural Operators (FNOs) have emerged as a popular class of data-driven models in recent years for solving Partial Differential Equations (PDEs) and for surrogate modeling in general. Although impressive results have been achieved on several PDE benchmark problems, FNOs often perform poorly in learning non-dominant frequencies characterized by local features. This limitation stems from the spectral bias inherent in neural networks and the explicit exclusion of high-frequency modes in FNOs and their variants. Therefore, to mitigate these issues and improve FNO’s spectral learning capabilities to represent a broad range of frequency components, we propose two key architectural enhancements: (i) a parallel branch performing local spectral convolutions (ii) a high-frequency propagation module. Moreover, we propose a novel frequency-sensitive loss term based on radially binned spectral errors. This introduction of a parallel branch for local convolutions reduces number of trainable parameters by up to 50% while achieving the accuracy of baseline FNO that relies solely on global convolutions. Experiments on three challenging PDE problems in fluid mechanics and biological pattern formation, and the qualitative and spectral analysis of predictions show the effectiveness of our method over the state-of-the-art neural operator baselines.
zh

[CV-124] Progressive Multi-Source Domain Adaptation for Personalized Facial Expression Recognition

【速读】：本文旨在解决个性化面部表情识别（FER）中的无监督多源领域自适应（MSDA）问题，特别是在目标主体与多个源主体之间存在显著分布差异的情况下。传统方法在适应目标主体时，通过整合所有源数据来减少域间差距，但可能导致负迁移（negative transfer）以及计算成本增加和目标对齐困难。为了解决这些问题，论文提出了一种渐进式多源领域自适应（Progressive MSDA）方法，其关键是根据源主体与目标主体之间的相似性逐步引入信息，从而仅选择最相关的源主体以避免负迁移，并通过基于密度的记忆机制保留重要的历史源样本，缓解灾难性遗忘（catastrophic forgetting）。实验结果验证了所提方法在Biovid和UNBC-McMaster疼痛数据集上的有效性。

链接: https://arxiv.org/abs/2504.04252
作者: Muhammad Osama Zeeshan,Marco Pedersoli,Alessandro Lameiras Koerich,Eric Grange
机构: ETS Montreal (蒙特利尔理工学院), LIVIA (未知中文), ILLS (未知中文)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personalized facial expression recognition (FER) involves adapting a machine learning model using samples from labeled sources and unlabeled target domains. Given the challenges of recognizing subtle expressions with considerable interpersonal variability, state-of-the-art unsupervised domain adaptation (UDA) methods focus on the multi-source UDA (MSDA) setting, where each domain corresponds to a specific subject, and improve model accuracy and robustness. However, when adapting to a specific target, the diverse nature of multiple source domains translates to a large shift between source and target data. State-of-the-art MSDA methods for FER address this domain shift by considering all the sources to adapt to the target representations. Nevertheless, adapting to a target subject presents significant challenges due to large distributional differences between source and target domains, often resulting in negative transfer. In addition, integrating all sources simultaneously increases computational costs and causes misalignment with the target. To address these issues, we propose a progressive MSDA approach that gradually introduces information from subjects based on their similarity to the target subject. This will ensure that only the most relevant sources from the target are selected, which helps avoid the negative transfer caused by dissimilar sources. We first exploit the closest sources to reduce the distribution shift with the target and then move towards the furthest while only considering the most relevant sources based on the predetermined threshold. Furthermore, to mitigate catastrophic forgetting caused by the incremental introduction of source subjects, we implemented a density-based memory mechanism that preserves the most relevant historical source samples for adaptation. Our experiments show the effectiveness of our proposed method on pain datasets: Biovid and UNBC-McMaster.
zh

[CV-125] Loss Functions in Deep Learning: A Comprehensive Review

【速读】：该论文旨在系统性地探讨深度学习中损失函数的核心作用及其在不同任务中的应用，试图解决如何选择和设计适合特定应用场景的损失函数，以提升模型的收敛性、泛化能力和整体性能。论文的关键在于全面回顾传统与先进的损失函数（如均方误差、交叉熵、对抗损失及扩散损失等），深入分析其数学基础、对模型训练的影响以及在计算机视觉、表格数据预测和时间序列预测等领域的适配策略。此外，论文强调了多模态数据、类别不平衡及实际约束下的复杂场景，并呼吁开发更具适应性和鲁棒性的损失函数，以增强模型的可解释性、扩展性和泛化能力。因此，论文的核心解决方案在于通过综合分析现有损失函数的优缺点，提出未来研究方向，推动更高效且稳健的深度学习模型发展。

链接: https://arxiv.org/abs/2504.04242
作者: Omar Elharrouss,Yasir Mahmood,Yassine Bechqito,Mohamed Adel Serhani,Elarbi Badidi,Jamal Riffi,Hamid Tairi
机构: Department of Computer Science and Software Engineering, College of Information Technology, United Arab Emirates University (阿联酋大学信息技术学院计算机科学与软件工程系); Department of Information Systems, College of Computing and Informatics, University of Sharjah (沙迦大学计算与信息科学学院信息系统系), Sharjah, United Arab Emirates; Department of Informatics, Faculty of Sciences Dhar El Mahraz, Sidi Mohamed Ben Abdellah University (西迪·穆罕默德·本·阿卜杜拉大学科学学院信息学系), Fez, Morocco
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Loss functions are at the heart of deep learning, shaping how models learn and perform across diverse tasks. They are used to quantify the difference between predicted outputs and ground truth labels, guiding the optimization process to minimize errors. Selecting the right loss function is critical, as it directly impacts model convergence, generalization, and overall performance across various applications, from computer vision to time series forecasting. This paper presents a comprehensive review of loss functions, covering fundamental metrics like Mean Squared Error and Cross-Entropy to advanced functions such as Adversarial and Diffusion losses. We explore their mathematical foundations, impact on model training, and strategic selection for various applications, including computer vision (Discriminative and generative), tabular data prediction, and time series forecasting. For each of these categories, we discuss the most used loss functions in the recent advancements of deep learning techniques. Also, this review explore the historical evolution, computational efficiency, and ongoing challenges in loss function design, underlining the need for more adaptive and robust solutions. Emphasis is placed on complex scenarios involving multi-modal data, class imbalances, and real-world constraints. Finally, we identify key future directions, advocating for loss functions that enhance interpretability, scalability, and generalization, leading to more effective and resilient deep learning models.
zh

[CV-126] Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images

【速读】：该论文试图解决现代AI模型在数据分布意外偏移的现实场景中表现不佳的问题，即领域泛化（Domain Generalization, DG）挑战。论文的关键解决方案是通过严格评估基于掩码图像建模（Masked Image Modelling, MIM）的视觉Transformer（尤其是BEIT架构）在合成域外分布（Out-of-Distribution, OOD）基准上的性能来克服这一限制。为此，作者提出了一种新颖的框架，利用网格模式（25%、50%、75%遮挡）战略性地遮蔽图像中的物体区域，并结合零样本分割技术确保精确的对象定位。实验结果表明，尽管存在显著遮挡，BEIT在PACS和Office-Home数据集上仍分别保持了94%和87%的准确率，优于卷积神经网络（CNNs）和其他视觉Transformer多达37个百分点。进一步分析表明，BEIT对全局特征的依赖与其鲁棒性密切相关。此外，合成基准揭示了模型的关键失效模式，即当遮挡破坏物体形状时，模型性能会急剧下降。这项工作提供了两个重要进展：一是通过可控噪声生成OOD基准的方法；二是实证证据表明，MIM和视觉Transformer中的自注意力机制通过学习不变特征增强了领域泛化能力。这些见解弥合了实验室训练模型与实际部署之间的差距，为构建在不确定性下可靠泛化的AI系统提供了蓝图。

链接: https://arxiv.org/abs/2504.04225
作者: Hamza Riaz,Alan F. Smeaton
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages

点击查看摘要

Abstract:Modern AI models excel in controlled settings but often fail in real-world scenarios where data distributions shift unpredictably - a challenge known as domain generalisation (DG). This paper tackles this limitation by rigorously evaluating vision tramsformers, specifically the BEIT architecture which is a model pre-trained with masked image modelling (MIM), against synthetic out-of-distribution (OOD) benchmarks designed to mimic real-world noise and occlusions. We introduce a novel framework to generate OOD test cases by strategically masking object regions in images using grid patterns (25%, 50%, 75% occlusion) and leveraging cutting-edge zero-shot segmentation via Segment Anything and Grounding DINO to ensure precise object localisation. Experiments across three benchmarks (PACS, Office-Home, DomainNet) demonstrate BEIT’s known robustness while maintaining 94% accuracy on PACS and 87% on Office-Home, despite significant occlusions, outperforming CNNs and other vision transformers by margins of up to 37%. Analysis of self-attention distances reveals that the BEIT dependence on global features correlates with its resilience. Furthermore, our synthetic benchmarks expose critical failure modes: performance degrades sharply when occlusions disrupt object shapes e.g. 68% drop for external grid masking vs. 22% for internal masking. This work provides two key advances (1) a scalable method to generate OOD benchmarks using controllable noise, and (2) empirical evidence that MIM and self-attention mechanism in vision transformers enhance DG by learning invariant features. These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.
zh

[CV-127] Evaluating Graphical Perception with Multimodal LLM s

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在图表数值回归任务中的表现不足问题，并探索其在图形感知任务中的性能。论文的关键解决方案是通过复现Cleveland和McGill于1984年的重要实验，并将MLLMs的性能与人类任务表现进行对比，评估微调模型、预训练模型以及零样本提示方法在数据可视化任务中的表现，以确定其是否能够接近或超越人类的图形感知能力。

链接: https://arxiv.org/abs/2504.04221
作者: Rami Huu Nguyen,Kenichi Maeda,Mahsa Geshvadi,Daniel Haehn
机构: University of Massachusetts Boston (波士顿马萨诸塞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, 1 teaser, IEEE Pacific Visualization 2025 Conference

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images. Despite these advancements, accurately regressing values in charts remains an underexplored area for MLLMs. For visualization, how do MLLMs perform when applied to graphical perception tasks? Our paper investigates this question by reproducing Cleveland and McGill’s seminal 1984 experiment and comparing it against human task performance. Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception. Our findings highlight that MLLMs outperform human task performance in some cases but not in others. We highlight the results of all experiments to foster an understanding of where MLLMs succeed and fail when applied to data visualization.
zh

[CV-128] he Effects of Grouped Structural Global Pruning of Vision Transformers on Domain Generalisation

【速读】：该论文旨在解决在有限计算资源设备上部署大型预训练视觉变换器（如ViT、BeiT和DeiT）以应对领域泛化（Domain Generalization, DG）任务时所面临的挑战。论文的关键创新在于提出了一种名为分组结构剪枝（Grouped Structural Pruning）的新方法，通过依赖图分析识别并移除变换器内部冗余的神经元组、权重组、滤波器组或注意力头，并采用多种选择指标进行优化。此方法在PACS和Office-Home DG基准数据集上验证，通过对模型施加50%、75%和95%的不同剪枝比率，随后在DG基准选定分布上微调模型，从而评估其在DG任务中的整体表现。实验结果表明，该方法能够在保持较小精度损失的同时显著提升推理速度和微调时间。例如，在PACS基准测试中，使用Hessian度量对ViT、BeiT和DeiT模型进行50%剪枝后，分别仅导致-2.94%、-1.42%和-1.72%的准确率下降，但实现了2.5倍、1.81倍和2.15倍的速度提升。这证明了该方法在平衡模型效率与领域泛化性能方面的有效性。

链接: https://arxiv.org/abs/2504.04196
作者: Hamza Riaz,Alan F. Smeaton
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages

点击查看摘要

Abstract:With the growing sizes of AI models like large language models (LLMs) and vision transformers, deploying them on devices with limited computational resources is a significant challenge particularly when addressing domain generalisation (DG) tasks. This paper introduces a novel grouped structural pruning method for pre-trained vision transformers (ViT, BeiT, and DeiT), evaluated on the PACS and Office-Home DG benchmarks. Our method uses dependency graph analysis to identify and remove redundant groups of neurons, weights, filters, or attention heads within transformers, using a range of selection metrics. Grouped structural pruning is applied at pruning ratios of 50%, 75% and 95% and the models are then fine-tuned on selected distributions from DG benchmarks to evaluate their overall performance in DG tasks. Results show significant improvements in inference speed and fine-tuning time with minimal trade-offs in accuracy and DG task performance. For instance, on the PACS benchmark, pruning ViT, BeiT, and DeiT models by 50% using the Hessian metric resulted in accuracy drops of only -2.94%, -1.42%, and -1.72%, respectively, while achieving speed boosts of 2.5x, 1.81x, and 2.15x. These findings demonstrate the effectiveness of our approach in balancing model efficiency with domain generalisation performance.
zh

[CV-129] GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

【速读】：该论文旨在解决模拟智能体在开放词汇物理技能学习中的关键挑战：现有强化学习方法因手工设计奖励缺乏可扩展性，而基于演示的方法难以超越训练分布进行泛化。论文提出GROVE框架，其核心在于利用大语言模型（LLMs）和视觉语言模型（VLMs）提供互补指导——LLMs生成精确的任务约束，VLMs评估运动语义和自然性，并通过基于VLM反馈的迭代优化过程不断改进LLMs生成的约束，构建自提升的奖励系统。此外，为弥合仿真与自然图像之间的领域差距，开发了Pose2CLIP映射器，高效将智能体姿态投影到语义特征空间。实验结果表明，GROVE在多样化的模拟环境中显著提升了运动自然性和任务完成度，并加速训练8.4倍，奠定了物理技能可扩展学习的新基础。

链接: https://arxiv.org/abs/2504.04191
作者: Jieming Cui,Tengyu Liu,Ziyu Meng,Jiale Yu,Ran Song,Wei Zhang,Yixin Zhu,Siyuan Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Learning open-vocabulary physical skills for simulated agents presents a significant challenge in artificial intelligence. Current reinforcement learning approaches face critical limitations: manually designed rewards lack scalability across diverse tasks, while demonstration-based methods struggle to generalize beyond their training distribution. We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations. Our key insight is that Large Language Models(LLMs) and Vision Language Models(VLMs) provide complementary guidance – LLMs generate precise physical constraints capturing task requirements, while VLMs evaluate motion semantics and naturalness. Through an iterative design process, VLM-based feedback continuously refines LLM-generated constraints, creating a self-improving reward system. To bridge the domain gap between simulation and natural images, we develop Pose2CLIP, a lightweight mapper that efficiently projects agent poses directly into semantic feature space without computationally expensive rendering. Extensive experiments across diverse embodiments and learning paradigms demonstrate GROVE’s effectiveness, achieving 22.2% higher motion naturalness and 25.7% better task completion scores while training 8.4x faster than previous methods. These results establish a new foundation for scalable physical skill acquisition in simulated environments.
zh

[CV-130] Interpretable Single-View 3D Gaussian Splatting using Unsupervised Hierarchical Disentangled Representation Learning

【速读】：该论文致力于解决现有高斯点云（Gaussian Splatting, GS）方法在理解底层3D语义方面的不足，这限制了模型的可控性和可解释性。为了解决这一问题，论文提出了一种名为3DisGS的可解释单视角3D GS框架。其关键是通过分层解缠表示学习（Disentangled Representation Learning, DRL），发现粗粒度和细粒度的3D语义。具体而言，模型采用双分支架构，包括点云初始化分支和三平面-高斯生成分支，以分离3D几何与视觉外观特征实现粗粒度解缠。进一步地，利用基于DRL的编码器适配器，在每个模态内探索细粒度语义表示。据作者所知，这是首个实现无监督可解释3D GS的工作。评估表明，该模型在保持高质量和快速重建的同时实现了3D解缠。

链接: https://arxiv.org/abs/2504.04190
作者: Yuyang Zhang,Baao Xie,Hu Zhu,Qi Wang,Huanting Guo,Xin Jin,Wenjun Zeng
机构: Ningbo Institute of Digital Twin, Eastern Institute of Technology (宁波数字孪生研究院, 东方理工学院); Shanghai Jiao Tong University (上海交通大学); Zhejiang Key Laboratory of Industrial Intelligence and Digital Twin, Eastern Institute of Technology (浙江省工业智能与数字孪生重点实验室, 东方理工学院); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaussian Splatting (GS) has recently marked a significant advancement in 3D reconstruction, delivering both rapid rendering and high-quality results. However, existing 3DGS methods pose challenges in understanding underlying 3D semantics, which hinders model controllability and interpretability. To address it, we propose an interpretable single-view 3DGS framework, termed 3DisGS, to discover both coarse- and fine-grained 3D semantics via hierarchical disentangled representation learning (DRL). Specifically, the model employs a dual-branch architecture, consisting of a point cloud initialization branch and a triplane-Gaussian generation branch, to achieve coarse-grained disentanglement by separating 3D geometry and visual appearance features. Subsequently, fine-grained semantic representations within each modality are further discovered through DRL-based encoder-adapters. To our knowledge, this is the first work to achieve unsupervised interpretable 3DGS. Evaluations indicate that our model achieves 3D disentanglement while preserving high-quality and rapid reconstruction.
zh

[CV-131] SDEIT: Semantic-Driven Electrical Impedance Tomography

【速读】：该论文致力于解决医学成像领域中病态反问题（如 Electrical Impedance Tomography, EIT）中的正则化设计与先验信息整合难题，这些问题因解剖结构的复杂性和变异性而更具挑战性。论文的关键解决方案是提出了一种名为 SDEIT 的新颖语义驱动框架，该框架首次将大规模文本到图像生成模型（Stable Diffusion 3.5）集成到 EIT 中。SDEIT 利用自然语言提示作为语义先验指导重建过程，并通过耦合隐式神经表示（INR）网络与一种利用 Stable Diffusion 生成图像作为生成先验的即插即用优化方案，显著提升了结构一致性并恢复了精细细节。此外，该方法无需依赖配对训练数据集，从而增强了其在不同 EIT 场景中的适应性。

链接: https://arxiv.org/abs/2504.04185
作者: Dong Liu,Yuanchao Wu,Bowen Tong,Jiansong Deng
机构: University of Science and Technology of China (中国科学技术大学), Hefei, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Regularization methods using prior knowledge are essential in solving ill-posed inverse problems such as Electrical Impedance Tomography (EIT). However, designing effective regularization and integrating prior information into EIT remains challenging due to the complexity and variability of anatomical structures. In this work, we introduce SDEIT, a novel semantic-driven framework that integrates Stable Diffusion 3.5 into EIT, marking the first use of large-scale text-to-image generation models in EIT. SDEIT employs natural language prompts as semantic priors to guide the reconstruction process. By coupling an implicit neural representation (INR) network with a plug-and-play optimization scheme that leverages SD-generated images as generative priors, SDEIT improves structural consistency and recovers fine details. Importantly, this method does not rely on paired training datasets, increasing its adaptability to varied EIT scenarios. Extensive experiments on both simulated and experimental data demonstrate that SDEIT outperforms state-of-the-art techniques, offering superior accuracy and robustness. This work opens a new pathway for integrating multimodal priors into ill-posed inverse problems like EIT.
zh

[CV-132] Learning about the Physical World through Analytic Concepts

【速读】：该论文旨在解决两个核心问题：（1）如何为机器智能提供适当的物理世界通用概念抽象；（2）如何系统性地将结构化先验知识与神经网络集成，以约束人工智能系统遵守物理定律。论文的关键解决方案在于引入“分析概念”（Analytic Concept）这一理念，通过数学过程程序的编程方式表征与物理世界相关的概念，从而为机器智能提供感知、推理和交互物理世界的能力。此外，论文还详细阐述了分析概念的设计哲学、应用指导以及围绕其构建的基础设施。

链接: https://arxiv.org/abs/2504.04170
作者: Jianhua Sun,Cewu Lu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Reviewing the progress in artificial intelligence over the past decade, various significant advances (e.g. object detection, image generation, large language models) have enabled AI systems to produce more semantically meaningful outputs and achieve widespread adoption in internet scenarios. Nevertheless, AI systems still struggle when it comes to understanding and interacting with the physical world. This reveals an important issue: relying solely on semantic-level concepts learned from internet data (e.g. texts, images) to understand the physical world is far from sufficient – machine intelligence currently lacks an effective way to learn about the physical world. This research introduces the idea of analytic concept – representing the concepts related to the physical world through programs of mathematical procedures, providing machine intelligence a portal to perceive, reason about, and interact with the physical world. Except for detailing the design philosophy and providing guidelines for the application of analytic concepts, this research also introduce about the infrastructure that has been built around analytic concepts. I aim for my research to contribute to addressing these questions: What is a proper abstraction of general concepts in the physical world for machine intelligence? How to systematically integrate structured priors with neural networks to constrain AI systems to comply with physical laws?
zh

[CV-133] JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration

【速读】：该论文旨在解决视觉为中心的感知系统在现实世界中面临的不可预测且耦合的天气退化问题。当前解决方案通常受限于特定的退化先验或存在显著的领域差距。为实现真实条件下的鲁棒和自主运行，论文提出了一种名为JarvisIR的方法，它利用视觉语言模型（VLM）作为控制器来管理多个专家级恢复模型。解决方案的关键在于采用了一种新颖的两阶段框架，包括有监督微调和人类反馈对齐，以增强系统的鲁棒性、减少幻觉并提高实际恶劣天气下的泛化能力。此外，为了支持JarvisIR的训练和评估，还引入了一个名为CleanBench的大规模高质量指令-响应对数据集。实验结果表明，与现有方法相比，JarvisIR在CleanBench-Real的所有感知指标平均值上提高了50%。

链接: https://arxiv.org/abs/2504.04158
作者: Yunlong Lin,Zixu Lin,Haoyu Chen,Panwang Pan,Chenxin Li,Sixiang Chen,Yeying Jin,Wenbo Li,Xinghao Ding
机构: Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (厦门大学); The Hong Kong University of Science and Technology (香港科技大学); Bytedance’s Pico (字节跳动Pico); Tencent (腾讯); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 15 figures

点击查看摘要

Abstract:Vision-centric perception systems struggle with unpredictable and coupled weather degradations in the wild. Current solutions are often limited, as they either depend on specific degradation priors or suffer from significant domain gaps. To enable robust and autonomous operation in real-world conditions, we propose JarvisIR, a VLM-powered agent that leverages the VLM as a controller to manage multiple expert restoration models. To further enhance system robustness, reduce hallucinations, and improve generalizability in real-world adverse weather, JarvisIR employs a novel two-stage framework consisting of supervised fine-tuning and human feedback alignment. Specifically, to address the lack of paired data in real-world scenarios, the human feedback alignment enables the VLM to be fine-tuned effectively on large-scale real-world data in an unsupervised manner. To support the training and evaluation of JarvisIR, we introduce CleanBench, a comprehensive dataset consisting of high-quality and large-scale instruction-responses pairs, including 150K synthetic entries and 80K real entries. Extensive experiments demonstrate that JarvisIR exhibits superior decision-making and restoration capabilities. Compared with existing methods, it achieves a 50% improvement in the average of all perception metrics on CleanBench-Real. Project page: this https URL.
zh

[CV-134] CoMBO: Conflict Mitigation via Branched Optimization for Class Incremental Segmentation CVPR2025

【速读】：该论文旨在解决在Class Incremental Segmentation (CIS)任务中同时缓解灾难性遗忘（catastrophic forgetting）与确保足够可塑性以整合新类别的固有冲突问题。论文的关键在于提出了一种名为Conflict Mitigation via Branched Optimization (CoMBO)的新方法。CoMBO通过引入Query Conflict Reduction模块来显式优化新类别的查询，并利用轻量级、类别特定的适配器实现这一目标，同时保留原始查询用于蒸馏。此外，论文开发了两种策略进一步缓解冲突：一是Half-Learning Half-Distillation (HDHL)，通过选择性学习新类别的分类概率并在不匹配时对齐到旧类别的概率，确保保留旧知识的同时吸收新类别；二是Importance-Based Knowledge Distillation (IKD)，基于查询特征与旧类别的匹配程度评估重要性，优先蒸馏关键特征以允许非关键特征演化。实验结果验证了CoMBO方法在Class Incremental Panoptic和Semantic Segmentation中的优越性能。

链接: https://arxiv.org/abs/2504.04156
作者: Kai Fang,Anqi Zhang,Guangyu Gao,Jianbo Jiao,Chi Harold Liu,Yunchao Wei
机构: Beijing Institute of Technology (北京理工大学); University of Birmingham (伯明翰大学); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Effective Class Incremental Segmentation (CIS) requires simultaneously mitigating catastrophic forgetting and ensuring sufficient plasticity to integrate new classes. The inherent conflict above often leads to a back-and-forth, which turns the objective into finding the balance between the performance of previous~(old) and incremental~(new) classes. To address this conflict, we introduce a novel approach, Conflict Mitigation via Branched Optimization~(CoMBO). Within this approach, we present the Query Conflict Reduction module, designed to explicitly refine queries for new classes through lightweight, class-specific adapters. This module provides an additional branch for the acquisition of new classes while preserving the original queries for distillation. Moreover, we develop two strategies to further mitigate the conflict following the branched structure, \textiti.e., the Half-Learning Half-Distillation~(HDHL) over classification probabilities, and the Importance-Based Knowledge Distillation~(IKD) over query features. HDHL selectively engages in learning for classification probabilities of queries that match the ground truth of new classes, while aligning unmatched ones to the corresponding old probabilities, thus ensuring retention of old knowledge while absorbing new classes via learning negative samples. Meanwhile, IKD assesses the importance of queries based on their matching degree to old classes, prioritizing the distillation of important features and allowing less critical features to evolve. Extensive experiments in Class Incremental Panoptic and Semantic Segmentation settings have demonstrated the superior performance of CoMBO. Project page: this https URL.
zh

[CV-135] Video4DGen: Enhancing Video and 4D Generation through Mutual Optimization

【速读】：该论文致力于解决动态四维（4D，即序列三维）内容生成的问题，旨在从单个或多个生成视频中创建高保真度的虚拟内容，同时确保空间和时间的一致性。论文的关键创新在于提出了Video4DGen框架，并结合Dynamic Gaussian Surfels (DGS) 表示方法，通过优化时变扭曲函数将静态高斯曲面元转换为动态扭曲状态，从而实现对物体姿态、运动及形变的精确描述。此外，为了处理多视频输入并捕捉空间、时间及姿态维度上的表示，设计了多视频对齐、根姿态优化及姿态引导帧采样策略。最终，通过引入连续扭曲场和基于置信度过滤的DGS，进一步提升了新视角视频生成的质量，为虚拟现实和动画等领域提供了强大的工具支持。

链接: https://arxiv.org/abs/2504.04153
作者: Yikai Wang,Guangce Liu,Xinzhou Wang,Zilong Chen,Jiafang Li,Xin Liang,Fuchun Sun,Jun Zhu
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in TPAMI 2025. Code: this https URL , Project page: this https URL

点击查看摘要

Abstract:The advancement of 4D (i.e., sequential 3D) generation opens up new possibilities for lifelike experiences in various applications, where users can explore dynamic objects or characters from any viewpoint. Meanwhile, video generative models are receiving particular attention given their ability to produce realistic and imaginative frames. These models are also observed to exhibit strong 3D consistency, indicating the potential to act as world simulators. In this work, we present Video4DGen, a novel framework that excels in generating 4D representations from single or multiple generated videos as well as generating 4D-guided videos. This framework is pivotal for creating high-fidelity virtual contents that maintain both spatial and temporal coherence. The 4D outputs generated by Video4DGen are represented using our proposed Dynamic Gaussian Surfels (DGS), which optimizes time-varying warping functions to transform Gaussian surfels (surface elements) from a static state to a dynamically warped state. We design warped-state geometric regularization and refinements on Gaussian surfels, to preserve the structural integrity and fine-grained appearance details. To perform 4D generation from multiple videos and capture representation across spatial, temporal, and pose dimensions, we design multi-video alignment, root pose optimization, and pose-guided frame sampling strategies. The leveraging of continuous warping fields also enables a precise depiction of pose, motion, and deformation over per-video frames. Further, to improve the overall fidelity from the observation of all camera poses, Video4DGen performs novel-view video generation guided by the 4D content, with the proposed confidence-filtered DGS to enhance the quality of generated sequences. With the ability of 4D and video generation, Video4DGen offers a powerful tool for applications in virtual reality, animation, and beyond.
zh

[CV-136] Scaling Federated Learning Solutions with Kubernetes for Synthesizing Histopathology Images

【速读】：该论文旨在解决组织病理学领域中因数据稀缺和隐私保护带来的挑战。具体而言，研究针对结直肠癌相关组织图像的获取成本高且涉及敏感医疗信息的问题，提出了一种结合视觉Transformer（Vision Transformer）与生成对抗网络（Generative Adversarial Networks, GAN）的方法，用于生成高质量的组织病理学图像，并通过增强训练数据集提升分类准确性。关键在于利用生成的图像扩充数据集，同时在联邦学习框架下通过多节点Kubernetes部署模拟分布式医院环境，确保数据隐私的同时实现模型性能的提升。

链接: https://arxiv.org/abs/2504.04130
作者: Andrei-Alexandru Preda,Iulian-Marius Tăiatu,Dumitru-Clementin Cercel
机构: National University of Science and Technology POLITEHNICA Bucharest (国立布加勒斯特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the field of deep learning, large architectures often obtain the best performance for many tasks, but also require massive datasets. In the histological domain, tissue images are expensive to obtain and constitute sensitive medical information, raising concerns about data scarcity and privacy. Vision Transformers are state-of-the-art computer vision models that have proven helpful in many tasks, including image classification. In this work, we combine vision Transformers with generative adversarial networks to generate histopathological images related to colorectal cancer and test their quality by augmenting a training dataset, leading to improved classification accuracy. Then, we replicate this performance using the federated learning technique and a realistic Kubernetes setup with multiple nodes, simulating a scenario where the training dataset is split among several hospitals unable to share their information directly due to privacy concerns.
zh

[CV-137] Multi-identity Human Image Animation with Structural Video Diffusion

【速读】：该论文旨在解决从单张图像生成高质量且精确可控的人类视频的难题，特别是在涉及多身份交互和物体互动的复杂场景中。现有方法虽在单人场景中有效，但难以处理多身份间的复杂交互，主要因为其无法正确关联人类外观与姿态条件对，并建模三维感知动态分布。为克服这些限制，论文提出了Structural Video Diffusion框架，用于生成逼真的多人视频。其关键创新包括：引入身份特定嵌入以保持个体间一致的外观，以及结合深度和表面法线线索的结构学习机制以建模人类-物体交互。此外，通过扩展包含25K新视频的数据集，涵盖了多样化的多人及物体交互场景，为训练提供了坚实基础。实验结果表明，该方法在生成具有动态和丰富交互的逼真、连贯多人视频方面表现出色，推动了以人为中心的视频生成技术的发展。

链接: https://arxiv.org/abs/2504.04126
作者: Zhenzhi Wang,Yixuan Li,Yanhong Zeng,Yuwei Guo,Dahua Lin,Tianfan Xue,Bo Dai
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present Structural Video Diffusion, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embeddings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation.
zh

[CV-138] EMF: Event Meta Formers for Event-based Real-time Traffic Object Detection

【速读】：本文旨在解决事件相机在性能关键应用（如自动驾驶）中未能取代传统RGB相机的问题。尽管事件相机具有更高的时间分辨率且存储和带宽需求更低，但基于事件的方法在性能上仍显不足。现有基于事件的目标检测方法通过采用计算成本高昂的Transformer模型试图缩小这一差距，但由于其资源密集型组件，无法高效利用事件数据的稀疏性和高时间分辨率，并且缺乏针对事件相机的特定优化。为了解决这些问题，论文提出了一种新颖的事件目标检测骨干网络，其关键在于引入了专为事件数据设计的Event Progression Extractor模块，并结合基于卷积的高效元形式器（Metaformer）架构。这种创新设计不仅提升了检测性能，还显著降低了推理时间，同时增强了模型的泛化能力和数据扩展性。

链接: https://arxiv.org/abs/2504.04124
作者: Muhammad Ahmed Ullah Khan,Abdul Hannan Khan,Andreas Dengel
机构: Department of Computer Science, RPTU Kaiserslautern-Landau (莱茵兰-普法尔茨技术大学计算机科学系); German Research Center for Artificial Intelligence (DFKI GmbH)(德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Event cameras have higher temporal resolution, and require less storage and bandwidth compared to traditional RGB cameras. However, due to relatively lagging performance of event-based approaches, event cameras have not yet replace traditional cameras in performance-critical applications like autonomous driving. Recent approaches in event-based object detection try to bridge this gap by employing computationally expensive transformer-based solutions. However, due to their resource-intensive components, these solutions fail to exploit the sparsity and higher temporal resolution of event cameras efficiently. Moreover, these solutions are adopted from the vision domain, lacking specificity to the event cameras. In this work, we explore efficient and performant alternatives to recurrent vision transformer models and propose a novel event-based object detection backbone. The proposed backbone employs a novel Event Progression Extractor module, tailored specifically for event data, and uses Metaformer concept with convolution-based efficient components. We evaluate the resultant model on well-established traffic object detection benchmarks and conduct cross-dataset evaluation to test its ability to generalize. The proposed model outperforms the state-of-the-art on Prophesee Gen1 dataset by 1.6 mAP while reducing inference time by 14%. Our proposed EMF becomes the fastest DNN-based architecture in the domain by outperforming most efficient event-based object detectors. Moreover, the proposed model shows better ability to generalize to unseen data and scales better with the abundance of data.
zh

[CV-139] ARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在实际应用中因幻觉（hallucinations）问题而受到的限制。幻觉问题源于语言模型固有的生成偏差、视觉编码器在感知上的局限性以及多模态数据引入的偏见。为应对这一挑战，论文探索了多种缓解幻觉的方法，并提出了一种新的训练-free方法——Temporal Attention Real-time Accumulative Connection (TARAC)。TARAC的关键在于通过动态累积和更新模型在生成过程中的图像标记（image tokens）注意力，增强模型对图像信息的关注，从而有效减轻因注意力衰减导致的幻觉现象。实验验证表明，与对比方法VCD相比，TARAC在CHAIR基准测试中显著减少了C_S（语义错误）25.2点和C_I（图像错误）8.7点，证明了其有效性。

链接: https://arxiv.org/abs/2504.04099
作者: Chunzhao Xie,Tongxuan Liu,Lei Jiang,Yuting Zeng,jinrong Guo,Yunheng Shen,Weizhe Huang,Jing Li,Xiaohua Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models have demonstrated remarkable performance across various tasks; however, the challenge of hallucinations constrains their practical applications. The hallucination problem arises from multiple factors, including the inherent hallucinations in language models, the limitations of visual encoders in perception, and biases introduced by multimodal data. Extensive research has explored ways to mitigate hallucinations. For instance, OPERA prevents the model from overly focusing on “anchor tokens”, thereby reducing hallucinations, whereas VCD mitigates hallucinations by employing a contrastive decoding approach. In this paper, we investigate the correlation between the decay of attention to image tokens and the occurrence of hallucinations. Based on this finding, we propose Temporal Attention Real-time Accumulative Connection (TARAC), a novel training-free method that dynamically accumulates and updates LVLMs’ attention on image tokens during generation. By enhancing the model’s attention to image tokens, TARAC mitigates hallucinations caused by the decay of attention on image tokens. We validate the effectiveness of TARAC across multiple models and datasets, demonstrating that our approach substantially mitigates hallucinations. In particular, TARAC reduces C_S by 25.2 and C_I by 8.7 compared to VCD on the CHAIR benchmark.
zh

[CV-140] DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning CVPR2025

【速读】：该论文旨在解决文档图像分割领域因文档格式多样性和任务多样性导致的挑战，现有方法通常将不同任务单独处理，造成泛化能力有限且资源浪费的问题。为应对这一挑战，论文提出DocSAM（基于Transformer的统一框架），通过将多种文档图像分割任务（如文档版面分析、多粒度文本分割和表格结构识别）建模为实例分割与语义分割的组合来实现统一处理。其关键在于利用Sentence-BERT将各数据集中的类别名称映射为语义查询，并与实例查询通过注意力机制交互，同时与图像特征进行交叉注意以预测实例和语义分割掩码。实例类别的预测依赖于实例查询与语义查询之间的点积计算及softmax归一化。这种方法使DocSAM能够在异构数据集上联合训练，提升模型的鲁棒性、泛化能力和效率，同时减少计算和存储开销。实验结果表明，DocSAM在准确性、效率和适应性方面超越现有方法，展示了其在推动文档图像理解和分割方面的潜力。

链接: https://arxiv.org/abs/2504.04085
作者: Xiao-Hui Li,Fei Yin,Cheng-Lin Liu
机构: Institute of Automation of Chinese Academy of Sciences (自动化研究所, 中科院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by CVPR 2025

点击查看摘要

Abstract:Document image segmentation is crucial for document analysis and recognition but remains challenging due to the diversity of document formats and segmentation tasks. Existing methods often address these tasks separately, resulting in limited generalization and resource wastage. This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks, such as document layout analysis, multi-granularity text segmentation, and table structure recognition, by modelling these tasks as a combination of instance and semantic segmentation. Specifically, DocSAM employs Sentence-BERT to map category names from each dataset into semantic queries that match the dimensionality of instance queries. These two sets of queries interact through an attention mechanism and are cross-attended with image features to predict instance and semantic segmentation masks. Instance categories are predicted by computing the dot product between instance and semantic queries, followed by softmax normalization of scores. Consequently, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computational and storage resources. Comprehensive evaluations show that DocSAM surpasses existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation across various applications. Codes are available at this https URL.
zh

[CV-141] UniRVQA: A Unified Framework for Retrieval-Augmented Vision Question Answering via Self-Reflective Joint Training

【速读】：本文针对知识驱动的视觉问答（KB-VQA）系统中复杂视觉问题的回答挑战展开研究。现有方法通常采用分离式的检索器与生成器框架，并通过有限的参数共享实现协作，但这种分离可能导致任务间信息理解不充分，从而影响整体性能。此外，多模态信息的高效整合也是一个重要难题，通用多模态预训练模型虽擅长跨模态表征学习，但在细粒度知识密集型问题的精确检索方面表现不足，而专用模型虽可缓解此问题，却面临计算开销大的挑战。为解决上述问题，论文提出了一种统一的检索增强型视觉问答框架（UniRVQA）。其关键在于通过统一框架适配通用多模态预训练模型，实现跨任务的参数级知识共享及现有多模态表征能力的扩展；同时引入反思性回答机制以显式评估和优化模型的知识边界，并在检索增强生成联合训练过程中加入晚期交互，提升查询与文档的细粒度理解能力。最终，该方法在性能上显著优于现有技术，将回答准确率提升了4.7%，并使基础多模态语言模型（MLLMs）在VQA任务上的性能平均提高了7.5%。

链接: https://arxiv.org/abs/2504.04065
作者: Jiaqi Deng,Kaize Shi,Zonghan Wu,Huan Huo,Dingxian Wang,Guandong Xu
机构: The University of Technology Sydney(Sydney, New South Wales, Australia); East China Normal University(华东师范大学, Shanghai, China)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions requiring external knowledge, such as web-sourced encyclopedia articles. Existing methods often use sequential and separate frameworks for the retriever and the generator with limited parametric knowledge sharing. However, since both retrieval and generation tasks require accurate understanding of contextual and external information, such separation can potentially lead to suboptimal system performance. Another key challenge is the integration of multimodal information. General-purpose multimodal pre-trained models, while adept at multimodal representation learning, struggle with fine-grained retrieval required for knowledge-intensive visual questions. Recent specialized pre-trained models mitigate the issue, but are computationally expensive. To bridge the gap, we propose a Unified Retrieval-Augmented VQA framework (UniRVQA). UniRVQA adapts general multimodal pre-trained models for fine-grained knowledge-intensive tasks within a unified framework, enabling cross-task parametric knowledge sharing and the extension of existing multimodal representation learning capability. We further introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Additionally, we integrate late interaction into the retrieval-augmented generation joint training process to enhance fine-grained understanding of queries and documents. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7% improvement in answering accuracy, and brings an average 7.5% boost in base MLLMs’ VQA performance.
zh

[CV-142] Can You Count to Nine? A Human Evaluation Benchmark for Counting Limits in Modern Text-to-Video Models

【速读】：该论文旨在解决当前最先进的文本到视频（Text-to-Video, T2V）生成模型在遵循人类指令，特别是处理简单的数值约束任务时存在的根本性挑战。论文提出了一套名为T2VCountBench的专门基准测试，用于评估这些模型在2025年之前对计数能力的表现。该基准通过严格的主观评价来测量生成对象的数量，并涵盖了开源与商业模型的广泛范围。研究发现，现有所有模型在生成包含9个或更少对象的视频时几乎总是失败。此外，论文还探讨了视频风格、时间动态以及多语言输入等因素如何影响计数性能，并尝试通过任务分解等提示优化技术缓解这些问题，但结果表明这种方法未能轻易克服现有局限性。论文的关键在于揭示了当前文本到视频生成领域的重要挑战，并为未来改进模型遵守基本数值约束的能力提供了方向性指导。

链接: https://arxiv.org/abs/2504.04051
作者: Xuyang Guo,Zekai Huang,Jiayan Huo,Yingyu Liang,Zhenmei Shi,Zhao Song,Jiahao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models have driven significant progress in a variety of AI tasks, including text-to-video generation, where models like Video LDM and Stable Video Diffusion can produce realistic, movie-level videos from textual instructions. Despite these advances, current text-to-video models still face fundamental challenges in reliably following human commands, particularly in adhering to simple numerical constraints. In this work, we present T2VCountBench, a specialized benchmark aiming at evaluating the counting capability of SOTA text-to-video models as of 2025. Our benchmark employs rigorous human evaluations to measure the number of generated objects and covers a diverse range of generators, covering both open-source and commercial models. Extensive experiments reveal that all existing models struggle with basic numerical tasks, almost always failing to generate videos with an object count of 9 or fewer. Furthermore, our comprehensive ablation studies explore how factors like video style, temporal dynamics, and multilingual inputs may influence counting performance. We also explore prompt refinement techniques and demonstrate that decomposing the task into smaller subtasks does not easily alleviate these limitations. Our findings highlight important challenges in current text-to-video generation and provide insights for future research aimed at improving adherence to basic numerical constraints.
zh

[CV-143] A Survey of Pathology Foundation Model: Progress and Future Directions

【速读】：该论文旨在解决计算病理学领域中基于多实例学习框架的病理切片图像分析模型在特征提取与聚合方面的性能瓶颈问题，并系统性地填补现有病理基础模型（Pathology Foundation Models, PFMs）缺乏统一分析框架的空白。论文的关键解决方案在于提出了一种自顶向下的分层分类方法，通过模型范围、预训练方式及模型设计三个维度组织和分析PFMs，同时系统化地将PFM评估任务划分为幻灯片级、Patch级、多模态及生物学任务，构建全面的基准评测标准。此外，论文深入剖析了PFM开发（如病理特异性方法、端到端预训练、数据-模型可扩展性等）与应用（如有效适配、模型维护）中的关键挑战，为未来研究提供了明确方向。

链接: https://arxiv.org/abs/2504.04045
作者: Conghao Xiong,Hao Chen,Joseph J. Y. Sung
机构: Department of Computer Science and Engineering, The Chinese University of Hong Kong (香港中文大学); Department of Computer Science and Engineering and Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology (香港科技大学); Lee Kong Chian School of Medicine, Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Computational pathology, analyzing whole slide images for automated cancer diagnosis, relies on the multiple instance learning framework where performance heavily depends on the feature extractor and aggregator. Recent Pathology Foundation Models (PFMs), pretrained on large-scale histopathology data, have significantly enhanced capabilities of extractors and aggregators but lack systematic analysis frameworks. This survey presents a hierarchical taxonomy organizing PFMs through a top-down philosophy that can be utilized to analyze FMs in any domain: model scope, model pretraining, and model design. Additionally, we systematically categorize PFM evaluation tasks into slide-level, patch-level, multimodal, and biological tasks, providing comprehensive benchmarking criteria. Our analysis identifies critical challenges in both PFM development (pathology-specific methodology, end-to-end pretraining, data-model scalability) and utilization (effective adaptation, model maintenance), paving the way for future directions in this promising field. Resources referenced in this survey are available at this https URL.
zh

[CV-144] UCS: A Universal Model for Curvilinear Structure Segmentation

【速读】：本文旨在解决现有曲线结构分割（CSS）方法在特定领域表现优异但泛化能力有限的问题，同时指出如Segment Anything Model (SAM)等大规模模型虽具有强大的泛化能力，但未针对CSS任务进行优化。为填补这一空白，论文提出了一种名为Universal Curvilinear structure Segmentation (\textitUCS)的新模型，通过适配SAM以完成CSS任务并增强其泛化性能。\textitUCS的关键创新包括：引入Sparse Adapter以继承预训练SAM编码器的泛化能力且减少微调参数；设计Prompt Generation模块利用快速傅里叶变换与高通滤波生成特定曲线提示；采用无需人工交互的mask解码器，包含Hierarchical Feature Compression模块用于增强细节保留及Guidance Feature Compression模块提取压缩图像引导特征。综合评估表明，\textitUCS在涵盖多种自然曲线结构的多领域数据集上实现了最先进的泛化和开集分割性能，确立了通用CSS的新基准。

链接: https://arxiv.org/abs/2504.04034
作者: Dianshuo Li,Li Chen,Yunxiang Cao,Kai Zhu,Jun Cheng
机构: School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China (武汉科技大学计算机科学与技术学院, 中国, 武汉); Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Wuhan, 430065, China (湖北省智能信息处理与实时工业系统重点实验室, 武汉科技大学, 中国, 武汉); Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (ASTAR), 1 Fusionopolis Way, #21-01, Connexis South Tower, Singapore 138632, Republic of Singapore (新加坡信息通信研究院 (I2R), 新加坡科技研究局 (ASTAR), 新加坡)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Curvilinear structure segmentation (CSS) is vital in various domains, including medical imaging, landscape analysis, industrial surface inspection, and plant analysis. While existing methods achieve high performance within specific domains, their generalizability is limited. On the other hand, large-scale models such as Segment Anything Model (SAM) exhibit strong generalization but are not optimized for curvilinear structures. Existing adaptations of SAM primarily focus on general object segmentation and lack specialized design for CSS tasks. To bridge this gap, we propose the Universal Curvilinear structure Segmentation (\textitUCS) model, which adapts SAM to CSS tasks while enhancing its generalization. \textitUCS features a novel encoder architecture integrating a pretrained SAM encoder with two innovations: a Sparse Adapter, strategically inserted to inherit the pre-trained SAM encoder’s generalization capability while minimizing the number of fine-tuning parameters, and a Prompt Generation module, which leverages Fast Fourier Transform with a high-pass filter to generate curve-specific prompts. Furthermore, the \textitUCS incorporates a mask decoder that eliminates reliance on manual interaction through a dual-compression module: a Hierarchical Feature Compression module, which aggregates the outputs of the sampled encoder to enhance detail preservation, and a Guidance Feature Compression module, which extracts and compresses image-driven guidance features. Evaluated on a comprehensive multi-domain dataset, including an in-house dataset covering eight natural curvilinear structures, \textitUCS demonstrates state-of-the-art generalization and open-set segmentation performance across medical, engineering, natural, and plant imagery, establishing a new benchmark for universal CSS.
zh

[CV-145] Simultaneous Motion And Noise Estimation with Event Cameras

【速读】：该论文试图解决事件相机数据去噪的问题，同时考虑了运动估计这一固有特性。传统方法通常将去噪与其他任务（如运动估计）分开处理，而事件数据中的运动（如自运动、光流等）是其内在属性，没有运动场景边缘无法被感知。论文的关键在于提出了一种同时估计运动和噪声的方法，并且该方法具有灵活性，可以将常用的对比度最大化框架中的单步运动估计替换为其他运动估计算法，例如深度神经网络。实验表明，该方法在E-MLB去噪基准上达到了最先进的结果，在DND21基准上也取得了竞争性的结果，同时在运动估计和强度重建任务中表现出有效性。

链接: https://arxiv.org/abs/2504.04029
作者: Shintaro Shiba,Yoshimitsu Aoki,Guillermo Gallego
机构: Keio University (庆应义塾大学), Japan; Technische Universität Berlin (柏林工业大学); Einstein Center Digital Future, Robotics Institute Germany, and Science of Intelligence Excellence Cluster, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 13 pages, 13 figures, 6 tables

点击查看摘要

Abstract:Event cameras are emerging vision sensors, whose noise is challenging to characterize. Existing denoising methods for event cameras consider other tasks such as motion estimation separately (i.e., sequentially after denoising). However, motion is an intrinsic part of event data, since scene edges cannot be sensed without motion. This work proposes, to the best of our knowledge, the first method that simultaneously estimates motion in its various forms (e.g., ego-motion, optical flow) and noise. The method is flexible, as it allows replacing the 1-step motion estimation of the widely-used Contrast Maximization framework with any other motion estimator, such as deep neural networks. The experiments show that the proposed method achieves state-of-the-art results on the E-MLB denoising benchmark and competitive results on the DND21 benchmark, while showing its efficacy on motion estimation and intensity reconstruction tasks. We believe that the proposed approach contributes to strengthening the theory of event-data denoising, as well as impacting practical denoising use-cases, as we release the code upon acceptance. Project page: this https URL
zh

[CV-146] Artificial intelligence application in lymphoma diagnosis: from Convolutional Neural Network to Vision Transformer

【速读】：该论文旨在探索视觉Transformer模型在诊断间变性大细胞淋巴瘤与经典霍奇金淋巴瘤方面的应用，并将其分类性能与之前设计的卷积神经网络（Convolutional Neural Network, CNN）进行直接比较。论文的关键在于利用病理全片图像（Whole Slide Images, WSIs）中的组织学染色切片（HE slides），通过从每张全片图像中提取60个大小为100×100像素、放大倍率为20的图像块，构建了一个包含1200个图像块的数据集。研究结果表明，尽管卷积神经网络具有更成熟的架构且通常在大规模预训练数据不可得时表现更优，但视觉Transformer模型在相对较小的数据集上仍展现出与卷积神经网络相当的卓越诊断准确性（100%）。因此，该研究的关键在于验证视觉Transformer模型在有限数据条件下的潜力及其与传统卷积神经网络的直接对比。

链接: https://arxiv.org/abs/2504.04025
作者: Daniel Rivera,Jacob Huddin,Alexander Banerjee,Rongzhen Zhang,Brenda Mai,Hanadi El Achi,Jacob Armstrong,Amer Wahed,Andy Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 6 figures, 1 table

点击查看摘要

Abstract:Recently, vision transformers were shown to be capable of outperforming convolutional neural networks when pretrained on sufficiently large datasets. Vision transformer models show good accuracy on large scale datasets, with features of multi-modal training. Due to their promising feature detection, we aim to explore vision transformer models for diagnosis of anaplastic large cell lymphoma versus classical Hodgkin lymphoma using pathology whole slide images of HE slides. We compared the classification performance of the vision transformer to our previously designed convolutional neural network on the same dataset. The dataset includes whole slide images of HE slides for 20 cases, including 10 cases in each diagnostic category. From each whole slide image, 60 image patches having size of 100 by 100 pixels and at magnification of 20 were obtained to yield 1200 image patches, from which 90 percent were used for training, 9 percent for validation, and 10 percent for testing. The test results from the convolutional neural network model had previously shown an excellent diagnostic accuracy of 100 percent. The test results from the vision transformer model also showed a comparable accuracy at 100 percent. To the best of the authors’ knowledge, this is the first direct comparison of predictive performance between a vision transformer model and a convolutional neural network model using the same dataset of lymphoma. Overall, convolutional neural network has a more mature architecture than vision transformer and is usually the best choice when large scale pretraining is not an available option. Nevertheless, our current study shows comparable and excellent accuracy of vision transformer compared to that of convolutional neural network even with a relatively small dataset of anaplastic large cell lymphoma and classical Hodgkin lymphoma.
zh

[CV-147] Window Token Concatenation for Efficient Visual Large Language Models

【速读】：该论文旨在有效减少视觉大语言模型（Visual Large Language Models, VLLMs）中的视觉token数量，以提升模型效率。为解决这一问题，论文提出了一种名为Window Token Concatenation (WiCo) 的新方法，其关键是通过滑动窗口将空间相邻的视觉token拼接起来，并进一步微调视觉编码器的最后几层，使同一窗口内的视觉token特征更加一致，从而避免直接拼接导致的细节丢失。此外，为了增强细粒度视觉理解任务的表现，论文还提出了WiCo+，在大语言模型的后几层分解视觉token，既利用了大语言模型的感知范围优势，又保持较少的视觉token数量以实现高效推理。代码已开源。

链接: https://arxiv.org/abs/2504.04024
作者: Yifan Li,Wentao Bao,Botao Ye,Zhen Tan,Tianlong Chen,Huan Liu,Yu Kong
机构: Michigan State Univerisity (密歇根州立大学); ETH Zürich (瑞士联邦理工学院); Arizona State University (亚利桑那州立大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors. The code is available: this https URL.
zh

[CV-148] Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAVTarget Detection CVPR2025

【速读】：该论文旨在解决红外无人机（UAV）图像因温度依赖性低频非均匀性导致对比度降低的问题，并在此非均匀条件下优化无人机目标检测性能。现有方法通常将红外非均匀性校正（NUC）作为检测的预处理步骤，但这种分离的方式会导致次优性能。论文的关键在于提出了一种名为UniCD的联合框架，通过端到端方式同时处理红外NUC和UAV目标检测任务。其核心解决方案包括：1）将NUC建模为一个由先验知识和数据驱动的小参数估计问题，生成有利于检测的图像；2）在红外UAV目标检测网络中引入带目标掩码监督的新辅助损失，以强化目标特征并抑制背景；3）设计检测引导的自监督损失，减少校正与检测任务之间的特征差异，从而增强检测对不同非均匀程度的鲁棒性。此外，论文构建了一个包含50,000张不同类型非均匀红外图像、多尺度无人机目标及丰富背景的新基准数据集IRBFD，验证了UniCD在实时处理能力下的鲁棒性。

链接: https://arxiv.org/abs/2504.04012
作者: Houzhang Fang,Xiaolin Wang,Zengyang Li,Lu Wang,Qingshan Li,Yi Chang,Luxin Yan
机构: Xidian University (西安电子科技大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Infrared unmanned aerial vehicle (UAV) images captured using thermal detectors are often affected by temperature dependent low-frequency nonuniformity, which significantly reduces the contrast of the images. Detecting UAV targets under nonuniform conditions is crucial in UAV surveillance applications. Existing methods typically treat infrared nonuniformity correction (NUC) as a preprocessing step for detection, which leads to suboptimal performance. Balancing the two tasks while enhancing detection beneficial information remains challenging. In this paper, we present a detection-friendly union framework, termed UniCD, that simultaneously addresses both infrared NUC and UAV target detection tasks in an end-to-end manner. We first model NUC as a small number of parameter estimation problem jointly driven by priors and data to generate detection-conducive images. Then, we incorporate a new auxiliary loss with target mask supervision into the backbone of the infrared UAV target detection network to strengthen target features while suppressing the background. To better balance correction and detection, we introduce a detection-guided self-supervised loss to reduce feature discrepancies between the two tasks, thereby enhancing detection robustness to varying nonuniformity levels. Additionally, we construct a new benchmark composed of 50,000 infrared images in various nonuniformity types, multi-scale UAV targets and rich backgrounds with target annotations, called IRBFD. Extensive experiments on IRBFD demonstrate that our UniCD is a robust union framework for NUC and UAV target detection while achieving real-time processing capabilities. Dataset can be available at this https URL.
zh

[CV-149] DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

【速读】：该论文旨在解决生成自然且细腻的倾听者动作以支持长时间互动这一开放性问题。现有方法通常依赖于低维运动编码来生成面部行为，并通过照片级真实感渲染，这限制了视觉保真度和表现力的丰富性。为应对这些挑战，论文引入了DiTaiListener，这是一种基于带有多模态条件的视频扩散模型的方法。解决方案的关键在于首先使用DiTaiListener-Gen根据演讲者的语音和面部动作生成倾听者头部肖像的短片段，通过引入因果时间多模态适配器（CTM-Adapter）使Diffusion Transformer (DiT) 适应倾听者头部肖像生成任务；随后利用DiTaiListener-Edit优化过渡帧，实现平滑过渡。此外，为了长视频生成，还提出了DiTaiListener-Edit，这是一种视频到视频扩散模型，用于融合视频片段为流畅连续的视频，确保在合并由DiTaiListener-Gen生成的短视频片段时，面部表情和图像质量的时间一致性。

链接: https://arxiv.org/abs/2504.04010
作者: Maksim Siniukov,Di Chang,Minh Tran,Hongkun Gong,Ashutosh Chaubey,Mohammad Soleymani
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker’s speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers’ auditory and visual cues. CTM-Adapter integrates speakers’ input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.
zh

[CV-150] Edge Approximation Text Detector

【速读】：该论文旨在解决现有场景文本检测模型在表示不规则文本形状时存在的粗略轮廓或复杂pipeline的问题。为应对这些挑战，论文引入EdgeText方法，通过将文本的两个长边视为平滑曲线，并利用参数化曲线拟合函数来紧凑地拟合文本轮廓，从而减少不必要的轮廓重建过程。关键在于通过边缘逼近问题的公式化处理以及设计双边增强感知（Bilateral Enhanced Perception, BEP）模块来加强边缘特征的识别，同时引入比例积分损失（Proportional Integral Loss, PI-loss）以加速曲线函数参数的学习并聚焦于曲线分布，而非受文本尺度干扰。

链接: https://arxiv.org/abs/2504.04001
作者: Chuang Yang,Xu Han,Tao Han,Han Han,Bingxuan Zhao,Qi Wang
机构: School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University (西北工业大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pursuing efficient text shape representations helps scene text detection models focus on compact foreground regions and optimize the contour reconstruction steps to simplify the whole detection pipeline. Current approaches either represent irregular shapes via box-to-polygon strategy or decomposing a contour into pieces for fitting gradually, the deficiency of coarse contours or complex pipelines always exists in these models. Considering the above issues, we introduce EdgeText to fit text contours compactly while alleviating excessive contour rebuilding processes. Concretely, it is observed that the two long edges of texts can be regarded as smooth curves. It allows us to build contours via continuous and smooth edges that cover text regions tightly instead of fitting piecewise, which helps avoid the two limitations in current models. Inspired by this observation, EdgeText formulates the text representation as the edge approximation problem via parameterized curve fitting functions. In the inference stage, our model starts with locating text centers, and then creating curve functions for approximating text edges relying on the points. Meanwhile, truncation points are determined based on the location features. In the end, extracting curve segments from curve functions by using the pixel coordinate information brought by truncation points to reconstruct text contours. Furthermore, considering the deep dependency of EdgeText on text edges, a bilateral enhanced perception (BEP) module is designed. It encourages our model to pay attention to the recognition of edge features. Additionally, to accelerate the learning of the curve function parameters, we introduce a proportional integral loss (PI-loss) to force the proposed model to focus on the curve distribution and avoid being disturbed by text scales.
zh

[CV-151] View2CAD: Reconstructing View-Centric CAD Models from Single RGB-D Scans

【速读】：该论文旨在解决从单目RGB-D图像精确重建参数化CAD模型（Boundary Representation, B-Rep）的问题。传统方法依赖于完整且无噪声的3D数据以恢复B-Reps，但这类数据获取困难且成本高昂。论文的关键创新在于提出了一种新颖的视角中心B-Rep（View-Centric B-Rep, VB-Rep）表示方法，通过引入可见性处理结构与几何不确定性编码，解决了仅基于单一视图的部分观测难题，并避免了错误几何的生成。此外，结合全景图像分割与迭代几何优化技术，进一步提升了重建质量。实验结果表明，该方法能够在合成及真实RGB-D数据上实现高质量的CAD形状重建，有效弥合现实与数字化之间的差距。

链接: https://arxiv.org/abs/2504.04000
作者: James Noeckel,Benjamin Jones,Adriana Schulz,Brian Curless
机构: University of Washington (华盛顿大学) (Seattle WA USA); Massachusetts Institute of Technology (麻省理工学院) (Cambridge MA USA)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parametric CAD models, represented as Boundary Representations (B-reps), are foundational to modern design and manufacturing workflows, offering the precision and topological breakdown required for downstream tasks such as analysis, editing, and fabrication. However, B-Reps are often inaccessible due to conversion to more standardized, less expressive geometry formats. Existing methods to recover B-Reps from measured data require complete, noise-free 3D data, which are laborious to obtain. We alleviate this difficulty by enabling the precise reconstruction of CAD shapes from a single RGB-D image. We propose a method that addresses the challenge of reconstructing only the observed geometry from a single view. To allow for these partial observations, and to avoid hallucinating incorrect geometry, we introduce a novel view-centric B-rep (VB-Rep) representation, which incorporates structures to handle visibility limits and encode geometric uncertainty. We combine panoptic image segmentation with iterative geometric optimization to refine and improve the reconstruction process. Our results demonstrate high-quality reconstruction on synthetic and real RGB-D data, showing that our method can bridge the reality gap.
zh

[CV-152] GraphX: Tensor-Aware Graph Neural Network for Multi-Dimensional Feature Learning

【速读】：该论文旨在解决传统卷积神经网络（CNNs）在视觉推理任务中缺乏建模对象间关系能力，以及图神经网络（GNNs）通常丢弃空间细节的问题。论文的关键创新在于通过CNN生成保留局部空间语义的多维节点特征（如(3128128)张量），并将这些特征集成到图结构中，利用1*1卷积进行消息传递以融合相邻特征并保持结构信息。此外，采用带有残差连接的深层CNN聚合器来稳健地优化融合后的消息，确保梯度稳定流动及端到端可训练性。这种方案不仅弥合了空间特征提取与关系推理之间的差距，还在目标检测细化和组合推理方面展现了显著提升。

链接: https://arxiv.org/abs/2504.03953
作者: Arash Sajjadi,Mark Eramian
机构: University of Saskatchewan (萨斯喀彻温大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to arXiv. Code repository: this https URL ||| this https URL

点击查看摘要

Abstract:TGraphX presents a novel paradigm in deep learning by unifying convolutional neural networks (CNNs) with graph neural networks (GNNs) to enhance visual reasoning tasks. Traditional CNNs excel at extracting rich spatial features from images but lack the inherent capability to model inter-object relationships. Conversely, conventional GNNs typically rely on flattened node features, thereby discarding vital spatial details. TGraphX overcomes these limitations by employing CNNs to generate multi-dimensional node features (e.g., (3128128) tensors) that preserve local spatial semantics. These spatially aware nodes participate in a graph where message passing is performed using 1*1 convolutions, which fuse adjacent features while maintaining their structure. Furthermore, a deep CNN aggregator with residual connections is used to robustly refine the fused messages, ensuring stable gradient flow and end-to-end trainability. Our approach not only bridges the gap between spatial feature extraction and relational reasoning but also demonstrates significant improvements in object detection refinement and ensemble reasoning.
zh

[CV-153] ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition

【速读】：该论文致力于解决开放世界第一人称活动识别问题，这一任务因其无约束的特性而具有挑战性，要求模型能够从一个庞大且部分观测到的搜索空间中推断未见过的活动。论文的关键创新在于提出了ProbRes（概率残差搜索框架），它基于跳跃扩散过程，通过平衡先验引导的探索与基于似然驱动的利用，在搜索空间中高效导航。ProbRes的核心在于结合结构化的常识先验构建语义一致的搜索空间，并使用视觉语言模型（Vision-Language Models, VLMs）自适应地优化预测，同时采用随机搜索机制定位高似然的活动标签，从而有效减少穷举搜索的需求。

链接: https://arxiv.org/abs/2504.03948
作者: Sanjoy Kundu,Shanmukha Vellamchetti,Sathyanarayanan N. Aakur
机构: CSSE Department, Auburn University (计算机与软件工程系, 奥本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures, 3 tables. Under review

点击查看摘要

Abstract:Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0 - L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding. Our results highlight the importance of structured search strategies, paving the way for scalable and efficient open-world activity recognition.
zh

[CV-154] Improving Brain Disorder Diagnosis with Advanced Brain Function Representation and Kolmogorov-Arnold Networks

【速读】：该论文试图解决传统基于预定义脑图谱量化功能连接（Functional Connectivity, FC）在诊断脑部疾病时存在的选择偏倚以及忽视特异性的局限性。论文提出了一种基于Transformer的新型分类网络AFBR-KAN，其关键在于利用Kolmogorov-Arnold Network (KAN) 块替代传统的多层感知机（Multi-Layer Perceptron, MLP）组件，以更有效地表征脑功能，从而辅助自闭症谱系障碍（Autism Spectrum Disorder, ASD）的诊断，并通过实验验证了该方法在不同模型架构配置下的有效性。

链接: https://arxiv.org/abs/2504.03923
作者: Tyler Ward,Abdullah-Al-Zubaer Imran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted at MIDL 2025

点击查看摘要

Abstract:Quantifying functional connectivity (FC), a vital metric for the diagnosis of various brain disorders, traditionally relies on the use of a pre-defined brain atlas. However, using such atlases can lead to issues regarding selection bias and lack of regard for specificity. Addressing this, we propose a novel transformer-based classification network (AFBR-KAN) with effective brain function representation to aid in diagnosing autism spectrum disorder (ASD). AFBR-KAN leverages Kolmogorov-Arnold Network (KAN) blocks replacing traditional multi-layer perceptron (MLP) components. Thorough experimentation reveals the effectiveness of AFBR-KAN in improving the diagnosis of ASD under various configurations of the model architecture. Our code is available at this https URL
zh

[CV-155] Leverag ing Gait Patterns as Biomarkers: An attention-guided Deep Multiple Instance Learning Network for Scoliosis Classification

【速读】：该论文致力于解决青少年脊柱侧弯（Scoliosis）早期检测困难的问题，特别是传统方法依赖临床经验且X射线成像存在辐射风险，限制了大规模筛查的应用。论文提出了一种基于注意引导的深度多实例学习方法（Gait-MIL），其灵感来源于ScoNet-MT利用步态模式进行检测的开创性工作。该方法的关键在于有效提取步态模式中的判别特征，并在首个基于步态模式的大规模数据集上验证了其在脊柱侧弯分类中的性能。实验结果表明，Gait-MIL显著提升了步态作为生物标志物的检测性能，尤其在Neutral情况下表现突出，同时在类别不平衡场景下也表现出稳健性，为大规模脊柱侧弯筛查提供了有前景的工具。

链接: https://arxiv.org/abs/2504.03894
作者: Haiqing Li,Yuzhi Guo,Feng Jiang,Qifeng Zhou,Hehuan Ma,Junzhou Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Scoliosis is a spinal curvature disorder that is difficult to detect early and can compress the chest cavity, impacting respiratory function and cardiac health. Especially for adolescents, delayed detection and treatment result in worsening compression. Traditional scoliosis detection methods heavily rely on clinical expertise, and X-ray imaging poses radiation risks, limiting large-scale early screening. We propose an Attention-Guided Deep Multi-Instance Learning method (Gait-MIL) to effectively capture discriminative features from gait patterns, which is inspired by ScoNet-MT’s pioneering use of gait patterns for scoliosis detection. We evaluate our method on the first large-scale dataset based on gait patterns for scoliosis classification. The results demonstrate that our study improves the performance of using gait as a biomarker for scoliosis detection, significantly enhances detection accuracy for the particularly challenging Neutral cases, where subtle indicators are often overlooked. Our Gait-MIL also performs robustly in imbalanced scenarios, making it a promising tool for large-scale scoliosis screening.
zh

[CV-156] WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments

【速读】：本文针对传统单目 RGB SLAM 系统在动态环境（Dynamic Environments）下性能下降的问题，提出了一种鲁棒且高效的单目 RGB SLAM 系统——WildGS-SLAM。其核心解决方案在于引入一种不确定性感知的几何映射方法，通过结合深度信息与不确定性信息来增强跟踪、建图及渲染性能，特别是在存在移动物体的情况下。关键创新点包括：(1) 利用浅层多层感知机（Shallow Multi-Layer Perceptron）和 DINOv2 特征预测不确定性图（Uncertainty Map），用于指导动态对象移除；(2) 将该不确定性图应用于密集束调整（Dense Bundle Adjustment）和高斯地图优化（Gaussian Map Optimization），从而提升重建精度。实验结果表明，WildGS-SLAM 在动态场景中的表现优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.03886
作者: Jianhao Zheng,Zihan Zhu,Valentin Bieri,Marc Pollefeys,Songyou Peng,Iro Armeni
机构: Stanford University (斯坦福大学); ETH Zürich (瑞士联邦理工学院); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We present WildGS-SLAM, a robust and efficient monocular RGB SLAM system designed to handle dynamic environments by leveraging uncertainty-aware geometric mapping. Unlike traditional SLAM systems, which assume static scenes, our approach integrates depth and uncertainty information to enhance tracking, mapping, and rendering performance in the presence of moving objects. We introduce an uncertainty map, predicted by a shallow multi-layer perceptron and DINOv2 features, to guide dynamic object removal during both tracking and mapping. This uncertainty map enhances dense bundle adjustment and Gaussian map optimization, improving reconstruction accuracy. Our system is evaluated on multiple datasets and demonstrates artifact-free view synthesis. Results showcase WildGS-SLAM’s superior performance in dynamic environments compared to state-of-the-art methods.
zh

[CV-157] 3D Scene Understanding Through Local Random Access Sequence Modeling

【速读】：该论文旨在解决单图像三维场景理解的问题，这一问题是计算机视觉领域的重要挑战，广泛应用于图形学、增强现实和机器人等领域。现有基于扩散模型的方法虽有潜力，但在保持物体与场景一致性方面表现欠佳，特别是在复杂的真实场景中。为克服这些局限性，论文提出了一种名为局部随机访问序列（Local Random Access Sequence, LRAS）的自回归生成方法。LRAS的关键在于采用局部补丁量化和随机顺序序列生成策略，并利用光流作为中间表示进行三维场景编辑。实验结果表明，LRAS在新视角合成和三维物体操作任务上达到了最先进的性能。此外，通过简单调整序列设计，该框架还能自然扩展至自监督深度估计任务。因此，LRAS提供了一个统一且高效的框架，用于构建下一代三维视觉模型。

链接: https://arxiv.org/abs/2504.03875
作者: Wanhee Lee,Klemen Kotar,Rahul Mysore Venkatesh,Jared Watrous,Honglin Chen,Khai Loong Aw,Daniel L. K. Yamins
机构: Stanford University (斯坦福大学); OpenAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL

点击查看摘要

Abstract:3D scene understanding from single images is a pivotal problem in computer vision with numerous downstream applications in graphics, augmented reality, and robotics. While diffusion-based modeling approaches have shown promise, they often struggle to maintain object and scene consistency, especially in complex real-world scenarios. To address these limitations, we propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling, which uses local patch quantization and randomly ordered sequence generation. By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities. Furthermore, we show that our framework naturally extends to self-supervised depth estimation through a simple modification of the sequence design. By achieving strong performance on multiple 3D scene understanding tasks, LRAS provides a unified and effective framework for building the next generation of 3D vision models.
zh

[CV-158] Control Map Distribution using Map Query Bank for Online Map Generation

【速读】：该论文旨在解决基于视觉的在线高清地图生成（Visual-based Online Map Generation, OMG）任务中，初始地图查询分布质量受限于有限查询数量的问题。现有方法依赖预训练的初始地图查询分布，但其容量有限，导致在特定场景下无法实现最优的地图预测。为了解决这一问题，论文提出将整个高清地图分布分解为一组点表示，即地图查询银行（Map Query Bank, MQBank）。同时，通过引入低成本的标准定义地图（Standard Definition Map, SD map）数据作为先验知识，构建针对不同场景的具体初始地图查询分布。此外，为了在地图解码器网络的每一层保留实例级别的地图查询特征的同时保持点级别的详细信息，论文进一步利用地图查询银行方法来保持点级别的信息交互。最终实验表明，该方法不仅提供了关于标准定义地图先验的新见解，还在OpenLaneV2基准测试中取得了40.5%和45.7%的mAP性能记录，分别针对车辆车道和行人区域。

链接: https://arxiv.org/abs/2504.03868
作者: Ziming Liu,Leichen Wang,Ge Yang,Xinrun Li,Xingtao Hu,Hao Sun,Guangyu Gao
机构: AID-OMG team, Bosch Research (博世研究), Shanghai, China; University of Stuttgart (斯图加特大学), Stuttgart, German; Newcastle University (纽卡斯尔大学), Newcastle upon Tyne, England; Beijing Institute of Technology (北京理工大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable autonomous driving systems require high-definition (HD) map that contains detailed map information for planning and navigation. However, pre-build HD map requires a large cost. Visual-based Online Map Generation (OMG) has become an alternative low-cost solution to build a local HD map. Query-based BEV Transformer has been a base model for this task. This model learns HD map predictions from an initial map queries distribution which is obtained by offline optimization on training set. Besides the quality of BEV feature, the performance of this model also highly relies on the capacity of initial map query distribution. However, this distribution is limited because the limited query number. To make map predictions optimal on each test sample, it is essential to generate a suitable initial distribution for each specific scenario. This paper proposes to decompose the whole HD map distribution into a set of point representations, namely map query bank (MQBank). To build specific map query initial distributions of different scenarios, low-cost standard definition map (SD map) data is introduced as a kind of prior knowledge. Moreover, each layer of map decoder network learns instance-level map query features, which will lose detailed information of each point. However, BEV feature map is a point-level dense feature. It is important to keep point-level information in map queries when interacting with BEV feature map. This can also be solved with map query bank method. Final experiments show a new insight on SD map prior and a new record on OpenLaneV2 benchmark with 40.5%, 45.7% mAP on vehicle lane and pedestrian area.
zh

[CV-159] Can ChatGPT Learn My Life From a Week of First-Person Video?

【速读】：该论文旨在探索基础模型（Foundation Models）通过第一人称相机数据（First-Person Camera Data）学习佩戴者个人生活的能力。论文的关键解决方案在于设计了一个实验，作者在一周内佩戴相机头戴设备累计54小时，生成不同长度的摘要（如分钟级、小时级和天级摘要），并将这些摘要用于微调GPT-4o和GPT-4o-mini模型。通过查询微调后的模型，研究者能够评估模型从数据中学到了哪些信息。关键之处在于利用生成的摘要层次结构作为训练数据，从而揭示模型在学习佩戴者个人信息方面的表现与局限性。

链接: https://arxiv.org/abs/2504.03857
作者: Keegan Harris
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Motivated by recent improvements in generative AI and wearable camera devices (e.g. smart glasses and AI-enabled pins), I investigate the ability of foundation models to learn about the wearer’s personal life through first-person camera data. To test this, I wore a camera headset for 54 hours over the course of a week, generated summaries of various lengths (e.g. minute-long, hour-long, and day-long summaries), and fine-tuned both GPT-4o and GPT-4o-mini on the resulting summary hierarchy. By querying the fine-tuned models, we are able to learn what the models learned about me. The results are mixed: Both models learned basic information about me (e.g. approximate age, gender). Moreover, GPT-4o correctly deduced that I live in Pittsburgh, am a PhD student at CMU, am right-handed, and have a pet cat. However, both models also suffered from hallucination and would make up names for the individuals present in the video footage of my life.
zh

[CV-160] Detection Limits and Statistical Separability of Tree Ring Watermarks in Rectified Flow-based Text-to-Image Generation Models

【速读】：该论文旨在解决树环水印（Tree-Ring Watermarking）技术在修正流模型（rectified flow-based models）中有效性未被充分探索的问题，特别是这些模型中存在的潜码反演（latent inversion）噪声挑战。论文通过广泛实验评估并比较了SD 2.1和FLUX.1-dev模型中水印检测与可分离性的性能，分析了不同文本引导配置及增强攻击下的表现，揭示了潜码反演局限性对水印恢复以及带水印与未带水印图像统计分离的影响。研究的关键在于强调现有最先进的（SOTA）模型中树环水印的局限性，并指出需要改进潜码反演方法以实现可靠的水印检测与可分离性。所有实验结果及相关资源已公开发布。

链接: https://arxiv.org/abs/2504.03850
作者: Ved Umrajkar,Aakash Kumar Singh
机构: Department of Mathematics (数学系), Indian Institute of Technology Roorkee (印度理工学院勒克瑙分校); Mehta Family School of Data Science and Artificial Intelligence (梅hta家族数据科学与人工智能学院), Indian Institute of Technology Roorkee (印度理工学院勒克瑙分校); Data Science Group, IIT Roorkee (印度理工学院勒克瑙分校数据科学组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Tree-Ring Watermarking is a significant technique for authenticating AI-generated images. However, its effectiveness in rectified flow-based models remains unexplored, particularly given the inherent challenges of these models with noise latent inversion. Through extensive experimentation, we evaluated and compared the detection and separability of watermarks between SD 2.1 and FLUX.1-dev models. By analyzing various text guidance configurations and augmentation attacks, we demonstrate how inversion limitations affect both watermark recovery and the statistical separation between watermarked and unwatermarked images. Our findings provide valuable insights into the current limitations of Tree-Ring Watermarking in the current SOTA models and highlight the critical need for improved inversion methods to achieve reliable watermark detection and separability. The official implementation, dataset release and all experimental results are available at this \hrefthis https URL\textbflink.
zh

[CV-161] A Hybrid Wavelet-Fourier Method for Next-Generation Conditional Diffusion Models

【速读】：该论文旨在解决传统扩散模型（Diffusion Models）在生成高保真图像时空间定位能力不足的问题，并提出一种结合多尺度频域表示的新框架。论文的关键创新在于引入了一种基于小波-傅里叶混合变换的扩散模型，即Wavelet-Fourier-Diffusion。该方法通过将小波子带分解与部分傅里叶步骤相结合，利用小波的空间局部化特性补充传统的傅里叶分析，从而更有效地捕捉全局结构和细粒度特征。这种策略在正向和逆向扩散过程中于混合频域内逐步降解和重构图像，显著提升了生成图像的空间定位精度与细节表现力。此外，论文进一步扩展该方法至条件图像生成任务，通过跨注意力机制集成嵌入或条件特征，实现对生成结果的精细控制。实验表明，该方法在CIFAR-10、CelebA-HQ及ImageNet数据集上的表现优于基准扩散模型和最先进的生成对抗网络（GANs），特别是在Fréchet Inception Distance (FID) 和Inception Score (IS) 指标上。

链接: https://arxiv.org/abs/2504.03821
作者: Andrew Kiruluta,Andreas Lemos
机构: School of Infomation, University of California (加州大学信息学院); School of Infomation, University of California (加州大学信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present a novel generative modeling framework,Wavelet-Fourier-Diffusion, which adapts the diffusion paradigm to hybrid frequency representations in order to synthesize high-quality, high-fidelity images with improved spatial localization. In contrast to conventional diffusion models that rely exclusively on additive noise in pixel space, our approach leverages a multi-transform that combines wavelet sub-band decomposition with partial Fourier steps. This strategy progressively degrades and then reconstructs images in a hybrid spectral domain during the forward and reverse diffusion processes. By supplementing traditional Fourier-based analysis with the spatial localization capabilities of wavelets, our model can capture both global structures and fine-grained features more effectively. We further extend the approach to conditional image generation by integrating embeddings or conditional features via cross-attention. Experimental evaluations on CIFAR-10, CelebA-HQ, and a conditional ImageNet subset illustrate that our method achieves competitive or superior performance relative to baseline diffusion models and state-of-the-art GANs, as measured by Fréchet Inception Distance (FID) and Inception Score (IS). We also show how the hybrid frequency-based representation improves control over global coherence and fine texture synthesis, paving the way for new directions in multi-scale generative modeling.
zh

[CV-162] From Keypoints to Realism: A Realistic and Accurate Virtual Try-on Network from 2D Images

【速读】：该论文旨在解决图像虚拟试穿中目标服装细节无法有效再现以及泛化能力不足的问题。现有方法通常难以精确保留目标服装的细节特征，并且在处理新场景时缺乏适应性。论文提出的方法通过完全移除人物初始服装后，利用预测的关键点进行精确的形变操作，以实现目标服装与人物身体结构及姿态的精准对齐。关键在于基于形变后的服装生成更准确的身体分割图，并通过与分割图中预测服装区域对齐感知的分割归一化技术消除不匹配区域。最终，生成器输出具有高视觉质量的图像，精确重建目标服装的整体形状和纹理特征。此方法强调保持服装特性并提高对多样化姿态的适应性，从而提升其在不同应用场景中的通用性。

链接: https://arxiv.org/abs/2504.03807
作者: Maliheh Toozandehjani,Ali Mousavi,Reza Taheri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in Persian language

点击查看摘要

Abstract:The aim of image-based virtual try-on is to generate realistic images of individuals wearing target garments, ensuring that the pose, body shape and characteristics of the target garment are accurately preserved. Existing methods often fail to reproduce the fine details of target garments effectively and lack generalizability to new scenarios. In the proposed method, the person’s initial garment is completely removed. Subsequently, a precise warping is performed using the predicted keypoints to fully align the target garment with the body structure and pose of the individual. Based on the warped garment, a body segmentation map is more accurately predicted. Then, using an alignment-aware segment normalization, the misaligned areas between the warped garment and the predicted garment region in the segmentation map are removed. Finally, the generator produces the final image with high visual quality, reconstructing the precise characteristics of the target garment, including its overall shape and texture. This approach emphasizes preserving garment characteristics and improving adaptability to various poses, providing better generalization for diverse applications.
zh

[CV-163] FAST: Federated Active Learning with Foundation Models for Communication-efficient Sampling and Training

【速读】：本文旨在解决在人机协作学习中减少通信成本的同时最小化标注工作量的最佳实践问题。现有联邦主动学习（Federated Active Learning, FAL）方法通常依赖于将主动采样与联邦更新分离的迭代标注过程，导致高昂的通信和标注开销，尤其是在跨机构（cross-silo）设置中，当客户端拥有大量本地数据时尤为明显。为应对这一挑战，论文提出了一种名为FAST的两阶段联邦主动学习框架，其关键在于利用基础模型（foundation models）进行初步弱标注，随后聚焦于最具不确定性的样本进行精炼。通过整合基础模型的知识表示以及将精炼步骤融入高效的工作流，FAST显著降低了由迭代主动采样带来的开销。实验结果表明，在有限的5%标注预算下，FAST相比现有FAL方法平均提升了4.36%的性能，并将通信轮次数减少了八倍。

链接: https://arxiv.org/abs/2504.03783
作者: Haoyuan Li,Jindong Wang,Mathias Funk,Aaqib Saeed
机构: Department of Industrial Design, Eindhoven University of Technology (埃因霍芬理工大学); Department of Arts & Sciences, College of William & Mary (威廉与玛丽学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Federated Active Learning (FAL) has emerged as a promising framework to leverage large quantities of unlabeled data across distributed clients while preserving data privacy. However, real-world deployments remain limited by high annotation costs and communication-intensive sampling processes, particularly in a cross-silo setting, when clients possess substantial local datasets. This paper addresses the crucial question: What is the best practice to reduce communication costs in human-in-the-loop learning with minimal annotator effort? Existing FAL methods typically rely on iterative annotation processes that separate active sampling from federated updates, leading to multiple rounds of expensive communication and annotation. In response, we introduce FAST, a two-pass FAL framework that harnesses foundation models for weak labeling in a preliminary pass, followed by a refinement pass focused exclusively on the most uncertain samples. By leveraging representation knowledge from foundation models and integrating refinement steps into a streamlined workflow, FAST substantially reduces the overhead incurred by iterative active sampling. Extensive experiments on diverse medical and natural image benchmarks demonstrate that FAST outperforms existing FAL methods by an average of 4.36% while reducing communication rounds eightfold under a limited 5% labeling budget.
zh

[CV-164] A Study on Adversarial Robustness of Discriminative Prototypical Learning

【速读】：该论文旨在解决深度神经网络在对抗性扰动下的显著脆弱性问题，特别是在关键应用中的潜在风险，并指出当前对抗性训练方法虽提升了模型的鲁棒性，但通常以牺牲原始干净数据上的准确性为代价。此外，现有方法未充分利用潜在空间中的几何结构。为应对这些问题，论文提出了一种名为“对抗性深度正负原型（Adv-DPNP）”的新框架。其关键在于将基于判别原型的学习与对抗性训练相结合，通过统一的类原型同时作为分类器权重和鲁棒锚点，增强潜在空间内的类内紧凑性和类间分离度。此外，双分支训练机制确保原型仅用干净数据更新以保持稳定性，而特征提取层则结合干净和对抗样本进行学习以抵抗扰动。论文还设计了一个复合损失函数，结合正原型对齐、负原型排斥及一致性正则化，进一步提升判别能力、对抗鲁棒性以及清洁数据精度。实验验证表明，Adv-DPNP在标准基准数据集上优于现有技术，实现了更高的清洁数据准确率和竞争性的鲁棒性。

链接: https://arxiv.org/abs/2504.03782
作者: Ramin Zarei Sabzevar,Hamed Mohammadzadeh,Tahmineh Tavakoli,Ahad Harati
机构: University of Mashhad (UM)(马什哈德大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks demonstrate significant vulnerability to adversarial perturbations, posing risks for critical applications. Current adversarial training methods predominantly focus on robustness against attacks without explicitly leveraging geometric structures in the latent space, usually resulting in reduced accuracy on the original clean data. To address these issues, we propose a novel adversarial training framework named Adversarial Deep Positive-Negative Prototypes (Adv-DPNP), which integrates disriminative prototype-based learning with adversarial training. Adv-DPNP uses unified class prototypes serving dual roles as classifier weights and robust anchors, enhancing both intra-class compactness and inter-class separation in the latent space. Moreover, a novel dual-branch training mechanism maintains stable prototypes by updating them exclusively with clean data; while the feature extractor layers are learned using both clean and adversarial data to remain invariant against adversarial perturbations. In addition, our approach utilizes a composite loss function combining positive prototype alignment, negative prototype repulsion, and consistency regularization to further enhance discrimination, adversarial robustness, and clean accuracy. Extensive experiments conducted on standard benchmark datasets confirm the effectiveness of Adv-DPNP compared to state-of-the-art methods, achieving higher clean accuracy and competitive robustness under adversarial perturbations and common corruptions. Our code is available at this https URL
zh

[CV-165] Improved visual-information-driven model for crowd simulation and its modular application

【速读】：该论文旨在解决数据驱动人群仿真模型在跨场景通用性（Generalizibility）方面的不足，当前大多数数据驱动方法仅针对单一场景设计，鲜有模型能在两个以上场景中表现良好。论文的关键在于有效且准确地捕捉不同场景下支配行人导航的核心共同影响特征，特别是通过改进视觉信息提取方法及出口线索来增强模型的泛化能力。研究验证了所提模型在瓶颈、走廊、拐角和T字路口四种基本模块中的表现，并通过与经典知识驱动模型对比以及复合场景应用结果表明，该模型不仅在单场景中性能优异，还能较好匹配真实世界的人群运动模式，在复合场景中的仿真轨迹和基本图表也显示出与实际模式的高度一致性。

链接: https://arxiv.org/abs/2504.03758
作者: Xuanwen Liang,Jiayu Chen,Eric Wai Ming Lee,Wei Xie
机构: Department of Architecture and Civil Engineering (建筑与土木工程系), City University of Hong Kong (香港城市大学), Hong Kong; Department of Construction Management (建设管理系), Tsinghua University (清华大学), Beijing, China; Sichuan University-The Hong Kong Polytechnic University Institute for Disaster Management and Reconstruction (四川大学-香港理工大学灾害管理与重建研究所), Sichuan University (四川大学), Chengdu, China
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Data-driven crowd simulation models offer advantages in enhancing the accuracy and realism of simulations, and improving their generalizability is essential for promoting application. Current data-driven approaches are primarily designed for a single scenario, with very few models validated across more than two scenarios. It is still an open question to develop data-driven crowd simulation models with strong generalizibility. We notice that the key to addressing this challenge lies in effectively and accurately capturing the core common influential features that govern pedestrians’ navigation across diverse scenarios. Particularly, we believe that visual information is one of the most dominant influencing features. In light of this, this paper proposes a data-driven model incorporating a refined visual information extraction method and exit cues to enhance generalizability. The proposed model is examined on four common fundamental modules: bottleneck, corridor, corner and T-junction. The evaluation results demonstrate that our model performs excellently across these scenarios, aligning with pedestrian movement in real-world experiments, and significantly outperforms the classical knowledge-driven model. Furthermore, we introduce a modular approach to apply our proposed model in composite scenarios, and the results regarding trajectories and fundamental diagrams indicate that our simulations closely match real-world patterns in the composite scenario. The research outcomes can provide inspiration for the development of data-driven crowd simulation models with high generalizability and advance the application of data-driven approaches.
zh

[CV-166] Semi-Self Representation Learning for Crowdsourced WiFi Trajectories

【速读】：本文旨在解决基于WiFi指纹的轨迹（Trajectory-based）定位中大规模无标注WiFi轨迹数据自动标注的问题。传统方法依赖于大量人工标注的数据，而轨迹数据集的规模通常随物理环境大小呈指数级增长，导致人工标注代价高昂。为缓解这一问题，论文提出了一种半自表示学习（semi-self representation learning）方案，其关键是通过一种基于“剪切与翻转”（“cut-and-flip”）增强策略的中间相遇（meet-in-the-middle）范式，利用小规模标注数据集 $\tilde{C}$ 来自动标注大规模无标注数据集 $C$ 。该方案首先通过两阶段学习对未标注数据进行轨迹嵌入和端点嵌入，然后借助标注数据集 $\tilde{C}$ 的辅助完成表示学习，并将其与神经网络定位模型结合。此方法显著降低了轨迹定位中的人工标注需求，同时保持了较高的定位精度。

链接: https://arxiv.org/abs/2504.03756
作者: Yu-Lin Kuo,Yu-Chee Tseng,Ting-Hui Chiang,Yan-Ann Chen
机构: Department of Computer Science, National Yang Ming Chiao Tung University (国立阳明交通大学计算机科学系); Advanced Technology Laboratory, Chunghwa Telecom Laboratories (中华电信先进技术实验室); Department of Computer Science and Engineering, Yuan Ze University (元智大学计算机科学与工程系)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by VTC2025-Spring

点击查看摘要

Abstract:WiFi fingerprint-based localization has been studied intensively. Point-based solutions rely on position annotations of WiFi fingerprints. Trajectory-based solutions, however, require end-position annotations of WiFi trajectories, where a WiFi trajectory is a multivariate time series of signal features. A trajectory dataset is much larger than a pointwise dataset as the number of potential trajectories in a field may grow exponentially with respect to the size of the field. This work presents a semi-self representation learning solution, where a large dataset C of crowdsourced unlabeled WiFi trajectories can be automatically labeled by a much smaller dataset \tilde C of labeled WiFi trajectories. The size of \tilde C only needs to be proportional to the size of the physical field, while the unlabeled C could be much larger. This is made possible through a novel ``cut-and-flip’’ augmentation scheme based on the meet-in-the-middle paradigm. A two-stage learning consisting of trajectory embedding followed by endpoint embedding is proposed for the unlabeled C . Then the learned representations are labeled by \tilde C and connected to a neural-based localization network. The result, while delivering promising accuracy, significantly relieves the burden of human annotations for trajectory-based localization.
zh

[CV-167] Attention in Diffusion Model: A Survey

【速读】：该论文旨在系统性地分析注意力机制在扩散模型中的作用、设计模式及其操作方式，特别是在不同模态和任务中的表现。论文的关键在于提出了一种统一的分类法，将与注意力相关的改进按其影响的结构组件进行分类，以清晰地展示其功能多样性。此外，论文探讨了注意力机制如何提升多种应用中的性能，并指出了当前存在的局限性和未充分探索的研究领域，同时提出了未来研究的方向。通过这些工作，论文为理解扩散模型的发展提供了有价值的洞见，特别强调了注意力机制在其整合性和普遍性方面的重要作用。

链接: https://arxiv.org/abs/2504.03738
作者: Litao Hua,Fan Liu,Jie Su,Xingyu Miao,Zizhou Ouyang,Zeyu Wang,Runze Hu,Zhenyu Wen,Bing Zhai,Yang Long,Haoran Duan,Yuan Zhou
机构: School of Artificial Intelligence, Nanjing University of Information Science and Technology, China (南京大学信息科学技术学院，中国); Department of Automation, Tsinghua University (清华大学自动化系); Institute of Cyberspace Security and College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China (浙江工业大学网络空间安全学院与信息工程学院，杭州 310023，中国); University of Edinburgh, UK (英国爱丁堡大学); Beijing Institute of Technology (北京理工大学); College of Computer Science and Engineering, Dalian Minzu University (大连民族大学计算机科学与工程学院); Department of Computer Science, Durham University, UK (英国杜伦大学计算机科学系); Department of Computer and Information Sciences, Northumbria University, Newcastle Upon Tyne, UK (英国诺桑比亚大学计算机与信息系统系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Attention mechanisms have become a foundational component in diffusion models, significantly influencing their capacity across a wide range of generative and discriminative tasks. This paper presents a comprehensive survey of attention within diffusion models, systematically analysing its roles, design patterns, and operations across different modalities and tasks. We propose a unified taxonomy that categorises attention-related modifications into parts according to the structural components they affect, offering a clear lens through which to understand their functional diversity. In addition to reviewing architectural innovations, we examine how attention mechanisms contribute to performance improvements in diverse applications. We also identify current limitations and underexplored areas, and outline potential directions for future research. Our study provides valuable insights into the evolving landscape of diffusion models, with a particular focus on the integrative and ubiquitous role of attention.
zh

[CV-168] Uncertainty Propagation in XAI: A Comparison of Analytical and Empirical Estimators

【速读】：本文旨在解决可解释人工智能（XAI）中不确定性量化与解释的问题，以增强机器学习模型的可信度和决策可靠性。关键在于提出了一种统一框架，通过定义一个通用的解释函数 ( e_\theta(x, f) )，系统性地捕捉输入数据扰动和模型参数变化等关键来源的不确定性传播。该框架结合解析估计和经验估计来评估解释的方差，从而量化不确定性对解释的影响。其中，解析估计采用一阶不确定性传播方法。通过在异构数据集上的全面评估，论文比较了不同估计方法的性能，并揭示了一些现有XAI方法在捕获和传递不确定性方面的不一致性。研究结果强调了高风险应用场景下不确定性感知解释的重要性，并为改进现有XAI方法提供了新视角。实验代码已公开于指定仓库。

链接: https://arxiv.org/abs/2504.03736
作者: Teodor Chiaburu,Felix Bießmann,Frank Haußer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 10 figures, accepted at WCXAI 2025 Istanbul

点击查看摘要

Abstract:Understanding uncertainty in Explainable AI (XAI) is crucial for building trust and ensuring reliable decision-making in Machine Learning models. This paper introduces a unified framework for quantifying and interpreting Uncertainty in XAI by defining a general explanation function e_\theta(x, f) that captures the propagation of uncertainty from key sources: perturbations in input data and model parameters. By using both analytical and empirical estimates of explanation variance, we provide a systematic means of assessing the impact uncertainty on explanations. We illustrate the approach using a first-order uncertainty propagation as the analytical estimator. In a comprehensive evaluation across heterogeneous datasets, we compare analytical and empirical estimates of uncertainty propagation and evaluate their robustness. Extending previous work on inconsistencies in explanations, our experiments identify XAI methods that do not reliably capture and propagate uncertainty. Our findings underscore the importance of uncertainty-aware explanations in high-stakes applications and offer new insights into the limitations of current XAI methods. The code for the experiments can be found in our repository at this https URL
zh

[CV-169] Scalable heliostat surface predictions from focal spots: Sim-to-Real transfer of inverse Deep Learning Raytracing

【速读】：该论文旨在解决集中式太阳能发电（CSP）电站中定日镜表面实际误差与理想假设之间的不匹配问题，这直接影响了接收器上的太阳辐照分布精度及系统整体性能和安全性。传统方法难以对大量定日镜表面进行实际测量，因此控制系统的优化通常基于理想的定日镜表面假设，导致性能下降和潜在的安全隐患。

解决方案的关键在于引入了一种名为逆向深度学习光线追踪（inverse Deep Learning Raytracing, iDLR）的新方法，通过分析标准校准过程中记录的目标图像来推断定日镜的真实表面轮廓。本文实现了iDLR从仿真到现实场景的首次迁移，并验证了其在真实运行条件下对63台定日镜的预测能力，其中iDLR的表面预测中值平均绝对误差（MAE）为0.17毫米，在84%的情况下与斜率计实测结果高度一致。此外，当用于光线追踪模拟时，iDLR能够以90%的平均精度预测辐照密度，比常用的理想定日镜表面假设提升了26%。研究还证明了iDLR在面对未见过的太阳位置和接收器投影的双重外推场景中仍能保持高精度，体现了其良好的泛化能力。这一成果表明，iDLR是一种可扩展、自动化且成本效益高的工具，可用于将真实的定日镜表面模型集成到数字孪生系统中，从而提升未来CSP电站的辐照控制精度、性能建模精确度以及整体效率和安全性。

链接: https://arxiv.org/abs/2504.03712
作者: Jan Lewen,Max Pargmann,Jenia Jitsev,Mehdi Cherti,Robert Pitz-Paal,Daniel Maldonado Quinto
机构: German Aerospace Center (DLR)(德国航空航天中心); Research Center Jülich (于利希研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Concentrating Solar Power (CSP) plants are a key technology in the transition toward sustainable energy. A critical factor for their safe and efficient operation is the distribution of concentrated solar flux on the receiver. However, flux distributions from individual heliostats are sensitive to surface imperfections. Measuring these surfaces across many heliostats remains impractical in real-world deployments. As a result, control systems often assume idealized heliostat surfaces, leading to suboptimal performance and potential safety risks. To address this, inverse Deep Learning Raytracing (iDLR) has been introduced as a novel method for inferring heliostat surface profiles from target images recorded during standard calibration procedures. In this work, we present the first successful Sim-to-Real transfer of iDLR, enabling accurate surface predictions directly from real-world target images. We evaluate our method on 63 heliostats under real operational conditions. iDLR surface predictions achieve a median mean absolute error (MAE) of 0.17 mm and show good agreement with deflectometry ground truth in 84% of cases. When used in raytracing simulations, it enables flux density predictions with a mean accuracy of 90% compared to deflectometry over our dataset, and outperforms the commonly used ideal heliostat surface assumption by 26%. We tested this approach in a challenging double-extrapolation scenario-involving unseen sun positions and receiver projection-and found that iDLR maintains high predictive accuracy, highlighting its generalization capabilities. Our results demonstrate that iDLR is a scalable, automated, and cost-effective solution for integrating realistic heliostat surface models into digital twins. This opens the door to improved flux control, more precise performance modeling, and ultimately, enhanced efficiency and safety in future CSP plants.
zh

[CV-170] Semi-supervised learning for marine anomaly detection on board satellites

【速读】：该论文旨在解决海洋异常检测中因标注数据不足导致的高成本和低效率问题。由于海洋环境中的异常（如海洋垃圾、有害藻华等）需要专家标注，获取大规模标注数据极具挑战性，而传统的全监督学习方法依赖于大量标注数据。为应对这一难题，论文提出利用半监督学习方法（如FixMatch算法）来充分利用未标注数据以提升模型性能。解决方案的关键在于通过半监督学习在有限标注数据下设置合适的置信度阈值（0.9），并在不同标注数据量条件下评估其与全监督模型的性能差异，最终验证了在标注数据有限时，半监督模型优于全监督模型，而在标注数据充足时，全监督模型略胜一筹。

链接: https://arxiv.org/abs/2504.03705
作者: Luca Marini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Master’s project

点击查看摘要

Abstract:Aquatic bodies face numerous environmental threats caused by several marine anomalies. Marine debris can devastate habitats and endanger marine life through entanglement, while harmful algal blooms can produce toxins that negatively affect marine ecosystems. Additionally, ships may discharge oil or engage in illegal and overfishing activities, causing further harm. These marine anomalies can be identified by applying trained deep learning (DL) models on multispectral satellite imagery. Furthermore, the detection of other anomalies, such as clouds, could be beneficial in filtering out irrelevant images. However, DL models often require a large volume of labeled data for training, which can be both costly and time-consuming, particularly for marine anomaly detection where expert annotation is needed. A potential solution is the use of semi-supervised learning methods, which can also utilize unlabeled data. In this project, we implement and study the performance of FixMatch for Semantic Segmentation, a semi-supervised algorithm for semantic segmentation. Firstly, we found that semi-supervised models perform best with a high confidence threshold of 0.9 when there is a limited amount of labeled data. Secondly, we compare the performance of semi-supervised models with fully-supervised models under varying amounts of labeled data. Our findings suggest that semi-supervised models outperform fully-supervised models with limited labeled data, while fully-supervised models have a slightly better performance with larger volumes of labeled data. We propose two hypotheses to explain why fully-supervised models surpass semi-supervised ones when a high volume of labeled data is used. All of our experiments were conducted using a U-Net model architecture with a limited number of parameters to ensure compatibility with space-rated hardware.
zh

[CV-171] PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators

【速读】：该论文旨在解决在多加速器边缘设备上实现高效3D目标检测的问题。随着边缘设备配备多种低功耗加速器（如移动GPU和NPU），如何充分利用这些异构计算资源以应对传统单加速器设备难以处理的高负载任务成为新的挑战。论文提出了一种名为PointSplit的新型3D目标检测框架，其关键是通过以下创新设计解决相关技术难题：(1) 基于2D语义信息的偏置点采样；(2) 并行化3D特征提取；以及(3) 基于角色分组的量级量化。实验结果表明，PointSplit在保持相似精度的前提下，比基于GPU的全精度2D-3D融合方法快24.7倍。

链接: https://arxiv.org/abs/2504.03654
作者: Keondo Park,You Rim Choi,Inhoe Lee,Hyung-Sin Kim
机构: Graduate School of Data Science, Seoul National University (首尔国立大学数据科学研究生院)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Running deep learning models on resource-constrained edge devices has drawn significant attention due to its fast response, privacy preservation, and robust operation regardless of Internet connectivity. While these devices already cope with various intelligent tasks, the latest edge devices that are equipped with multiple types of low-power accelerators (i.e., both mobile GPU and NPU) can bring another opportunity; a task that used to be too heavy for an edge device in the single-accelerator world might become viable in the upcoming heterogeneous-accelerator this http URL realize the potential in the context of 3D object detection, we identify several technical challenges and propose PointSplit, a novel 3D object detection framework for multi-accelerator edge devices that addresses the problems. Specifically, our PointSplit design includes (1) 2D semantics-aware biased point sampling, (2) parallelized 3D feature extraction, and (3) role-based group-wise quantization. We implement PointSplit on TensorFlow Lite and evaluate it on a customized hardware platform comprising both mobile GPU and EdgeTPU. Experimental results on representative RGB-D datasets, SUN RGB-D and Scannet V2, demonstrate that PointSplit on a multi-accelerator device is 24.7 times faster with similar accuracy compared to the full-precision, 2D-3D fusion-based 3D detector on a GPU-only device.
zh

[CV-172] Universal Lymph Node Detection in Multiparametric MRI with Selective Augmentation

【速读】：该论文旨在解决在多参数磁共振成像（multiparametric MRI, mpMRI）中淋巴结（Lymph Nodes, LNs）定位不准确以及小而潜在转移性淋巴结易被遗漏的问题，这些问题影响了淋巴结病理性评估及后续癌症分期的准确性。论文的关键解决方案在于提出了一种通用的检测流程，利用最近提出的VFNet神经网络，在不同扫描设备和检查协议下的T2脂肪抑制序列与弥散加权成像（Diffusion Weighted Imaging, DWI）中识别良性与转移性淋巴结，从而实现后续精确测量。此外，通过引入选择性增强技术——Intra-Label LISA（ILL），提高了模型在训练阶段对输入样本多样性的适应能力，增强了其在评估阶段的鲁棒性。实验结果显示，使用ILL后灵敏度达到约83%，相比未使用ILL提升了约3个百分点，并在每体积假阳性为4个的情况下比现有基于mpMRI的淋巴结检测方法提高了约9%的灵敏度。

链接: https://arxiv.org/abs/2504.05196
作者: Tejas Sudharshan Mathai,Sungwon Lee,Thomas C. Shen,Zhiyong Lu,Ronald M. Summers
机构: Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Radiology and Imaging Sciences, Clinical Center, National Institutes of Health (国立卫生研究院) ; National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health (国立卫生研究院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at SPIE Medical Imaging 2023

点击查看摘要

Abstract:Robust localization of lymph nodes (LNs) in multiparametric MRI (mpMRI) is critical for the assessment of lymphadenopathy. Radiologists routinely measure the size of LN to distinguish benign from malignant nodes, which would require subsequent cancer staging. Sizing is a cumbersome task compounded by the diverse appearances of LNs in mpMRI, which renders their measurement difficult. Furthermore, smaller and potentially metastatic LNs could be missed during a busy clinical day. To alleviate these imaging and workflow problems, we propose a pipeline to universally detect both benign and metastatic nodes in the body for their ensuing measurement. The recently proposed VFNet neural network was employed to identify LN in T2 fat suppressed and diffusion weighted imaging (DWI) sequences acquired by various scanners with a variety of exam protocols. We also use a selective augmentation technique known as Intra-Label LISA (ILL) to diversify the input data samples the model sees during training, such that it improves its robustness during the evaluation phase. We achieved a sensitivity of \sim 83% with ILL vs. \sim 80% without ILL at 4 FP/vol. Compared with current LN detection approaches evaluated on mpMRI, we show a sensitivity improvement of \sim 9% at 4 FP/vol.
zh

[CV-173] Explainability of AI Uncertainty: Application to Multiple Sclerosis Lesion Segmentation on MRI

【速读】：本文旨在解决医学影像领域中预测不确定性在临床解释性和信息性方面的理解局限问题，特别是在多发性硬化症（MS）皮质病变分割任务中的深层集成模型的预测不确定性来源分析。论文的关键解决方案是提出了一种新颖的框架，用于解析深层集成模型中皮质病变分割任务的预测不确定性潜在来源，并将研究重点从不确定性与误差的关系转移到相关的医学和工程因素上。通过分析发现，样本级别的不确定性与病变大小、形状以及皮质受累程度密切相关，且专家标注者反馈验证了这些因素同样影响标注者的信心。此外，在两个数据集（包含206名患者近2000个病灶）的实验评估表明，该框架在域内及分布偏移场景下均具有实用价值。

链接: https://arxiv.org/abs/2504.04814
作者: Nataliia Molchanova,Pedro M. Gordaliza,Alessandro Cagol,Mario Ocampo–Pineda,Po–Jui Lu,Matthias Weigel,Xinjie Chen,Erin S. Beck,Haris Tsagkas,Daniel Reich,Anna Stölting,Pietro Maggi,Delphine Ribes,Adrien Depeursinge,Cristina Granziera,Henning Müller,Meritxell Bach Cuadra
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Trustworthy artificial intelligence (AI) is essential in healthcare, particularly for high-stakes tasks like medical image segmentation. Explainable AI and uncertainty quantification significantly enhance AI reliability by addressing key attributes such as robustness, usability, and explainability. Despite extensive technical advances in uncertainty quantification for medical imaging, understanding the clinical informativeness and interpretability of uncertainty remains limited. This study introduces a novel framework to explain the potential sources of predictive uncertainty, specifically in cortical lesion segmentation in multiple sclerosis using deep ensembles. The proposed analysis shifts the focus from the uncertainty-error relationship towards relevant medical and engineering factors. Our findings reveal that instance-wise uncertainty is strongly related to lesion size, shape, and cortical involvement. Expert rater feedback confirms that similar factors impede annotator confidence. Evaluations conducted on two datasets (206 patients, almost 2000 lesions) under both in-domain and distribution-shift conditions highlight the utility of the framework in different scenarios.
zh

[CV-174] Vision Transformers with Autoencoders and Explainable AI for Cancer Patient Risk Stratification Using Whole Slide Imaging

【速读】：该论文旨在解决病理学图像（Whole Slide Imaging, WSI）在癌症诊断和预后中的特征提取不充分以及模型可解释性不足的问题，以提高基于WSI的患者分层和风险预测的临床实用性。论文的关键在于提出了一种名为PATH-X的框架，通过整合Vision Transformers (ViT) 和Autoencoders进行特征提取与压缩，并结合SHAP (Shapley Additive Explanations) 提供模型解释能力。具体而言，PATH-X利用Google预训练的ViT从WSI中提取数值特征嵌入，经由Autoencoder压缩后用于无监督聚类与分类任务，并通过Kaplan-Meier生存分析评估分层效果。同时，SHAP用于识别关键贡献特征并映射回组织病理切片，提供空间语境。这一方案的核心创新在于提升了模型的可解释性与临床应用潜力，但其性能仍受限于数据规模，尤其是在数据量有限的情况下如肺癌。

链接: https://arxiv.org/abs/2504.04749
作者: Ahmad Hussein,Mukesh Prasad,Ali Braytee
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:Cancer remains one of the leading causes of mortality worldwide, necessitating accurate diagnosis and prognosis. Whole Slide Imaging (WSI) has become an integral part of clinical workflows with advancements in digital pathology. While various studies have utilized WSIs, their extracted features may not fully capture the most relevant pathological information, and their lack of interpretability limits clinical adoption. In this paper, we propose PATH-X, a framework that integrates Vision Transformers (ViT) and Autoencoders with SHAP (Shapley Additive Explanations) to enhance model explainability for patient stratification and risk prediction using WSIs from The Cancer Genome Atlas (TCGA). A representative image slice is selected from each WSI, and numerical feature embeddings are extracted using Google’s pre-trained ViT. These features are then compressed via an autoencoder and used for unsupervised clustering and classification tasks. Kaplan-Meier survival analysis is applied to evaluate stratification into two and three risk groups. SHAP is used to identify key contributing features, which are mapped onto histopathological slices to provide spatial context. PATH-X demonstrates strong performance in breast and glioma cancers, where a sufficient number of WSIs enabled robust stratification. However, performance in lung cancer was limited due to data availability, emphasizing the need for larger datasets to enhance model reliability and clinical applicability. Comments: 10 pages Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2504.04749 [eess.IV] (or arXiv:2504.04749v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2504.04749 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-175] Classification of ADHD and Healthy Children Using EEG Based Multi-Band Spatial Features Enhancement

【速读】：该论文旨在解决儿童注意缺陷多动障碍（ADHD）的早期准确诊断问题。解决方案的关键在于利用脑电图（EEG）信号的空间特征，并结合机器学习方法进行分类。研究通过提取来自19个通道的EEG数据的功率谱密度（PSD）和谱熵（SE）特征，构建了一个包含190维特征的综合特征集。采用支持向量机（SVM）与径向基函数（RBF）核进行分类，实现了平均交叉验证准确率高达99.2%的结果，证明了该方法在高精度和鲁棒性方面的有效性。这一成果展示了结合空间特征与机器学习技术在利用EEG数据准确识别ADHD方面的潜力，为开发非侵入性的数据驱动工具以实现儿童ADHD的早期诊断和评估提供了重要贡献。

链接: https://arxiv.org/abs/2504.04664
作者: Md Bayazid Hossain,Md Anwarul Islam Himel,Md Abdur Rahim,Shabbir Mahmood,Abu Saleh Musa Miah,Jungpil Shin
机构: Pabna University of Science and Technology (帕布纳科技大学); The University of Aizu (会津大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Attention Deficit Hyperactivity Disorder (ADHD) is a common neurodevelopmental disorder in children, characterized by difficulties in attention, hyperactivity, and impulsivity. Early and accurate diagnosis of ADHD is critical for effective intervention and management. Electroencephalogram (EEG) signals have emerged as a non-invasive and efficient tool for ADHD detection due to their high temporal resolution and ability to capture neural dynamics. In this study, we propose a method for classifying ADHD and healthy children using EEG data from the benchmark dataset. There were 61 children with ADHD and 60 healthy children, both boys and girls, aged 7 to 12. The EEG signals, recorded from 19 channels, were processed to extract Power Spectral Density (PSD) and Spectral Entropy (SE) features across five frequency bands, resulting in a comprehensive 190-dimensional feature set. To evaluate the classification performance, a Support Vector Machine (SVM) with the RBF kernel demonstrated the best performance with a mean cross-validation accuracy of 99.2% and a standard deviation of 0.0079, indicating high robustness and precision. These results highlight the potential of spatial features in conjunction with machine learning for accurately classifying ADHD using EEG data. This work contributes to developing non-invasive, data-driven tools for early diagnosis and assessment of ADHD in children.
zh

[CV-176] Here Comes the Explanation: A Shapley Perspective on Multi-contrast Medical Image Segmentation

【速读】：该论文旨在解决深度学习在多对比度磁共振成像（MRI）分割任务中黑盒模型可解释性不足的问题。现有的后验像素级解释方法（如基于梯度和扰动的方法）在处理复杂多对比度MRI数据时存在局限性，其稀疏的解释结果临床相关性较低。论文的关键解决方案是引入对比度层面的Shapley值来解释用于脑肿瘤分割的标准模型行为，通过Shapley分析揭示不同模型（如U-Net和Swin-UNETR）在处理多对比度数据时的特性差异，从而提升模型的透明性和临床实用性。

链接: https://arxiv.org/abs/2504.04645
作者: Tianyi Ren,Juampablo Heras Rivera,Hitender Oswal,Yutong Pan,Agamdeep Chopra,Jacob Ruzevick,Mehmet Kurt
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has been successfully applied to medical image segmentation, enabling accurate identification of regions of interest such as organs and lesions. This approach works effectively across diverse datasets, including those with single-image contrast, multi-contrast, and multimodal imaging data. To improve human understanding of these black-box models, there is a growing need for Explainable AI (XAI) techniques for model transparency and accountability. Previous research has primarily focused on post hoc pixel-level explanations, using methods gradient-based and perturbation-based apporaches. These methods rely on gradients or perturbations to explain model predictions. However, these pixel-level explanations often struggle with the complexity inherent in multi-contrast magnetic resonance imaging (MRI) segmentation tasks, and the sparsely distributed explanations have limited clinical relevance. In this study, we propose using contrast-level Shapley values to explain state-of-the-art models trained on standard metrics used in brain tumor segmentation. Our results demonstrate that Shapley analysis provides valuable insights into different models’ behavior used for tumor segmentation. We demonstrated a bias for U-Net towards over-weighing T1-contrast and FLAIR, while Swin-UNETR provided a cross-contrast understanding with balanced Shapley distribution.
zh

[CV-177] BrainMRDiff: A Diffusion Model for Anatomically Consistent Brain MRI Synthesis

【速读】：本文旨在解决在脑肿瘤诊断中因某些磁共振成像（MRI）序列获取受限（如运动伪影或对比剂禁忌症）而导致图像质量不佳的问题，这会进一步影响放射科医生的诊断。为应对这一挑战，研究提出了一种名为BrainMRDiff的新方法，这是一种基于拓扑保护和解剖引导的扩散模型，用于合成高质量的脑部MRI图像。关键在于引入了两个模块：肿瘤与结构聚合（Tumor+Structure Aggregation, TSA）和拓扑引导的解剖保持（Topology-Guided Anatomy Preservation, TGAP）。TSA模块整合了多种解剖结构与肿瘤信息，形成全面的条件机制以指导扩散过程；而TGAP模块则在逆向去噪扩散过程中强制执行拓扑一致性，确保生成的图像保持解剖完整性。实验结果表明，BrainMRDiff在BraTS-AG数据集上的性能提升了23.33%，在BraTS-Met数据集上的性能提升了33.33%。

链接: https://arxiv.org/abs/2504.04532
作者: Moinak Bhattacharya,Saumya Gupta,Annie Singh,Chao Chen,Gagandeep Singh,Prateek Prasanna
机构: Stony Brook University (石溪大学); Columbia University (哥伦比亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate brain tumor diagnosis relies on the assessment of multiple Magnetic Resonance Imaging (MRI) sequences. However, in clinical practice, the acquisition of certain sequences may be affected by factors like motion artifacts or contrast agent contraindications, leading to suboptimal outcome, such as poor image quality. This can then affect image interpretation by radiologists. Synthesizing high quality MRI sequences has thus become a critical research focus. Though recent advancements in controllable generative AI have facilitated the synthesis of diagnostic quality MRI, ensuring anatomical accuracy remains a significant challenge. Preserving critical structural relationships between different anatomical regions is essential, as even minor structural or topological inconsistencies can compromise diagnostic validity. In this work, we propose BrainMRDiff, a novel topology-preserving, anatomy-guided diffusion model for synthesizing brain MRI, leveraging brain and tumor anatomies as conditioning inputs. To achieve this, we introduce two key modules: Tumor+Structure Aggregation (TSA) and Topology-Guided Anatomy Preservation (TGAP). TSA integrates diverse anatomical structures with tumor information, forming a comprehensive conditioning mechanism for the diffusion process. TGAP enforces topological consistency during reverse denoising diffusion process; both these modules ensure that the generated image respects anatomical integrity. Experimental results demonstrate that BrainMRDiff surpasses existing baselines, achieving performance improvements of 23.33% on the BraTS-AG dataset and 33.33% on the BraTS-Met dataset. Code will be made publicly available soon.
zh

[CV-178] CALF: A Conditionally Adaptive Loss Function to Mitigate Class-Imbalanced Segmentation

【速读】：该论文旨在解决深度学习模型在训练医学诊断任务（尤其是分割任务）时因数据集不平衡所导致的问题。这种不平衡可能源于标注质量差、有限的标注样本、罕见病例或感兴趣区域（ROI）的小尺度分布，这些问题会降低模型性能并影响分割边界的准确性。传统损失函数如二元交叉熵（Binary Cross Entropy）无法有效应对这些挑战，容易复制标注偏差并限制模型泛化能力。

论文的关键解决方案是提出了一种新颖的统计驱动条件自适应损失函数（Conditionally Adaptive Loss Function, CALF）。该方法通过利用偏度（skewness）和峰度（kurtosis）等统计手段评估数据集的不平衡程度，并据此施加适当的变换以平衡训练数据，同时保留数据的异质性。其核心创新在于结合预处理、数据集筛选以及动态损失选择的多维度流程，从而实现最优性能。实验结果表明，该方法在大规模开源数据集（如UPENN-GBM、UCSF、LGG和BraTS）上的应用显著提升了分割性能。

链接: https://arxiv.org/abs/2504.04458
作者: Bashir Alam,Masa Cirkovic,Mete Harun Akcay,Md Kaf Shahrier,Sebastien Lafond,Hergys Rexha,Kurt Benke,Sepinoud Azimi,Janan Arslan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Imbalanced datasets pose a considerable challenge in training deep learning (DL) models for medical diagnostics, particularly for segmentation tasks. Imbalance may be associated with annotation quality limited annotated datasets, rare cases, or small-scale regions of interest (ROIs). These conditions adversely affect model training and performance, leading to segmentation boundaries which deviate from the true ROIs. Traditional loss functions, such as Binary Cross Entropy, replicate annotation biases and limit model generalization. We propose a novel, statistically driven, conditionally adaptive loss function (CALF) tailored to accommodate the conditions of imbalanced datasets in DL training. It employs a data-driven methodology by estimating imbalance severity using statistical methods of skewness and kurtosis, then applies an appropriate transformation to balance the training dataset while preserving data heterogeneity. This transformative approach integrates a multifaceted process, encompassing preprocessing, dataset filtering, and dynamic loss selection to achieve optimal outcomes. We benchmark our method against conventional loss functions using qualitative and quantitative evaluations. Experiments using large-scale open-source datasets (i.e., UPENN-GBM, UCSF, LGG, and BraTS) validate our approach, demonstrating substantial segmentation improvements. Code availability: this https URL.
zh

[CV-179] Autoregressive High-Order Finite Difference Modulo Imaging: High-Dynamic Range for Computer Vision Applications

【速读】：本文旨在解决高动态范围（HDR）成像在标准商业成像系统中的局限性问题，特别是由于像素井深不足和量化精度限制导致的HDR能力不足。传统方法通过模组化采样（Modulo Sampling, US）理论，利用信号饱和后重置的机制来扩展动态范围，但其二维信号优化问题尚不明确。论文的关键在于将US框架重新表述为一个自回归的(\ell_2)相位展开问题，并通过离散余弦域中的高效计算解决方案结合基于空间差值的步长去除算法，利用二阶有限差分提升二维图像的HDR重建性能。该方法无需重新训练即可显著改善自动驾驶场景中的物体检测效果。

链接: https://arxiv.org/abs/2504.04228
作者: Brayan Monroy,Kebin Contreras,Jorge Bacca
机构: VIE-UIS (未知中文)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High dynamic range (HDR) imaging is vital for capturing the full range of light tones in scenes, essential for computer vision tasks such as autonomous driving. Standard commercial imaging systems face limitations in capacity for well depth, and quantization precision, hindering their HDR capabilities. Modulo imaging, based on unlimited sampling (US) theory, addresses these limitations by using a modulo analog-to-digital approach that resets signals upon saturation, enabling estimation of pixel resets through neighboring pixel intensities. Despite the effectiveness of (US) algorithms in one-dimensional signals, their optimization problem for two-dimensional signals remains unclear. This work formulates the US framework as an autoregressive \ell_2 phase unwrapping problem, providing computationally efficient solutions in the discrete cosine domain jointly with a stride removal algorithm also based on spatial differences. By leveraging higher-order finite differences for two-dimensional images, our approach enhances HDR image reconstruction from modulo images, demonstrating its efficacy in improving object detection in autonomous driving scenes without retraining.
zh

[CV-180] Overcoming the Identity Mapping Problem in Self-Supervised Hyperspectral Anomaly Detection

【速读】：该论文旨在解决自监督高光谱异常检测（Self-supervised Hyperspectral Anomaly Detection, HAD）中的身份映射问题（Identity Mapping Problem, IMP），该问题源于神经网络强大的非线性拟合能力，导致模型倾向于过度拟合整个图像，尤其在网络复杂度增加或训练迭代次数增多时。这种过度拟合使得正常背景像素能够被精确重建，从而削弱了异常像素的可检测性。尽管已有多种模型尝试缓解与IMP相关的问题，但目前仍缺乏统一的描述框架及验证方案。

为解决上述问题，本文提出了一个从网络优化角度统一描述IMP的框架，包含扰动（perturbation）、重建（reconstruction）和正则化（regularization）三个维度。针对这三个方面，作者分别引入了三种解决方案：基于超像素池化与上采样的扰动处理方法、误差自适应卷积用于改进重建效果，以及在线背景像素挖掘技术以增强正则化能力。通过广泛的实验验证，这些方法有效提升了自监督HAD任务的表现。本研究期望为该领域提供有价值的见解，并激发更多相关研究工作。

链接: https://arxiv.org/abs/2504.04115
作者: Yongchuan Cui,Jinhe Zhang,Peng Liu,Weijing Song,Yi Zeng
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院航空航天信息研究所); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子电气与通信工程学院); School of Geography and Information Engineering, China University of Geosciences (Wuhan) (中国地质大学（武汉）地理与信息工程学院); School of Computer Science, China University of Geosciences (Wuhan) (中国地质大学（武汉）计算机科学学院); College of Information, Beijing Forestry University (北京林业大学信息学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The surge of deep learning has catalyzed considerable progress in self-supervised Hyperspectral Anomaly Detection (HAD). The core premise for self-supervised HAD is that anomalous pixels are inherently more challenging to reconstruct, resulting in larger errors compared to the background. However, owing to the powerful nonlinear fitting capabilities of neural networks, self-supervised models often suffer from the Identity Mapping Problem (IMP). The IMP manifests as a tendency for the model to overfit to the entire image, particularly with increasing network complexity or prolonged training iterations. Consequently, the whole image can be precisely reconstructed, and even the anomalous pixels exhibit imperceptible errors, making them difficult to detect. Despite the proposal of several models aimed at addressing the IMP-related issues, a unified descriptive framework and validation of solutions for IMP remain lacking. In this paper, we conduct an in-depth exploration to IMP, and summarize a unified framework that describes IMP from the perspective of network optimization, which encompasses three aspects: perturbation, reconstruction, and regularization. Correspondingly, we introduce three solutions: superpixel pooling and uppooling for perturbation, error-adaptive convolution for reconstruction, and online background pixel mining for regularization. With extensive experiments being conducted to validate the effectiveness, it is hoped that our work will provide valuable insights and inspire further research for self-supervised HAD. Code: \urlthis https URL.
zh

[CV-181] Performance Analysis of Deep Learning Models for Femur Segmentation in MRI Scan

【速读】：该论文旨在解决医学图像分割领域中模型在有限数据下因偏差影响泛化能力的问题。论文通过系统评估和比较四种模型（U-Net、Attention U-Net、U-KAN 和 SAM 2）在磁共振成像（MRI）股骨结构分割任务中的性能，以寻找更有效的解决方案。关键在于引入注意力机制（Attention Mechanism）和增强学习能力的模块（如 KAN），特别是 Attention U-Net 在整体表现上最优，而 U-KAN 在小感兴趣区域（ROI）的解剖区域中展现了更高的分割精度。

链接: https://arxiv.org/abs/2504.04066
作者: Mengyuan Liu,Yixiao Chen,Anning Tian,Xinmeng Wu,Mozhi Shen,Tianchou Gong,Jeongkyu Lee
机构: Khoury College of Computer Sciences, Northeastern University (东北大学克霍尔伊计算机科学学院; 塞勒姆校区)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks like U-Net excel in medical image segmentation, while attention mechanisms and KAN enhance feature extraction. Meta’s SAM 2 uses Vision Transformers for prompt-based segmentation without fine-tuning. However, biases in these models impact generalization with limited data. In this study, we systematically evaluate and compare the performance of three CNN-based models, i.e., U-Net, Attention U-Net, and U-KAN, and one transformer-based model, i.e., SAM 2 for segmenting femur bone structures in MRI scan. The dataset comprises 11,164 MRI scans with detailed annotations of femoral regions. Performance is assessed using the Dice Similarity Coefficient, which ranges from 0.932 to 0.954. Attention U-Net achieves the highest overall scores, while U-KAN demonstrated superior performance in anatomical regions with a smaller region of interest, leveraging its enhanced learning capacity to improve segmentation accuracy.
zh

[CV-182] Hierarchical Attention Network for Interpretable ECG-based Heart Disease Classification

【速读】：该论文旨在解决心血管疾病诊断中准确性和可解释性之间的权衡问题，致力于开发一种既能实现高精度分类又能提供清晰解释的机器学习工具。为实现这一目标，论文提出将原本用于文本分类的层级注意力网络（Hierarchical Attention Network, HAN）适配到基于心电图（Electrocardiogram, ECG）数据的心脏病分类任务中。解决方案的关键在于引入两级注意力机制，分别聚焦于不同尺度的ECG数据片段，并通过简化模型结构显著降低参数数量，同时保持较高的分类准确性。此外，该方法通过可视化注意力权重提升了模型的可解释性。实验结果表明，与包含卷积、注意力及Transformer层的复杂CAT-Net架构相比，适配后的HAN在MIT-BIH数据集上的测试准确率仅相差0.59%，而在PTB-XL数据集上实现了19.3倍的参数规模缩减，且测试准确率仅下降5%。

链接: https://arxiv.org/abs/2504.03703
作者: Mario Padilla Rodriguez,Mohamed Nafea
机构: University of Detroit Mercy (底特律Mercy大学); Missouri University of Science and Technology (密苏里科学技术大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Work in progress. 7 pages, 4 figures

点击查看摘要

Abstract:Cardiovascular disease remains one of the leading causes of mortality worldwide, underscoring the need for accurate as well as interpretable diagnostic machine learning tools. In this work, we investigate heart disease classification using electrocardiogram (ECG) data from two widely-utilized datasets: The MIT-BIH Arrhythmia and the PTB-XL datasets. We adapt a hierarchical attention network (HAN), originally developed for text classification, into an ECG-based heart-disease classification task. Our adapted HAN incorporates two attention layers that focus on ECG data segments of varying sizes. We conduct a comparative analysis between our adapted HAN and a more sophisticated state-of-the-art architecture, featuring a network with convolution, attention, and transformer layers (CAT-Net). Our empirical evaluation encompasses multiple aspects including test accuracy (quantified by 0-1 loss); model complexity (measured by the number of model parameters); and interpretability (through attention map visualization). Our adapted HAN demonstrates comparable test accuracy with significant reductions in model complexity and enhanced interpretability analysis: For the MIT-BIH dataset, our adapted HAN achieves 98.55% test accuracy compared to 99.14% for CAT-Net, while reducing the number of model parameters by a factor of 15.6. For the PTB-XL dataset, our adapted HAN achieves a 19.3-fold reduction in model complexity compared to CAT-Net, with only a 5% lower test accuracy. From an interpretability perspective, the significantly simpler architecture and the hierarchical nature of our adapted HAN model facilitate a more straightforward interpretability analysis based on visualizing attention weights. Building on this advantage, we conduct an interpretability analysis of our HAN that highlights the regions of the ECG signal most relevant to the model’s decisions.
zh

[CV-183] Process Optimization and Deployment for Sensor-Based Human Activity Recognition Based on Deep Learning

【速读】：本文旨在解决基于传感器的人类活动识别（Sensor-based Human Activity Recognition）领域中存在的诸多挑战，特别是在数据增强、特征挖掘及模型优化等方面的不足。论文的关键在于提出了一种以多注意力交互为中心的综合性优化流程方法。具体而言，首先通过无监督统计特征引导的扩散模型实现高度自适应的数据增强；其次设计了一种新颖的网络架构——多分支时空交互网络（Multi-branch Spatiotemporal Interaction Network），利用不同层次上的多分支特征进行有效的顺序时空交互，从而提升挖掘高级潜在特征的能力；此外，在训练阶段采用多损失函数融合策略动态调整批次间的融合权重，进一步优化训练效果。最后，还在嵌入式设备上进行了实际部署，全面验证所提方法在现有工作的实用性。

链接: https://arxiv.org/abs/2504.03687
作者: Hanyu Liu,Ying Yu,Hang Xiao,Siyao Li,Xuze Li,Jiarui Li,Haotian Tang
机构: Northeastern University, Shenyang 110169, China (东北大学); Liaoning University, Shenyang 110169, China (辽宁大学)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sensor-based human activity recognition is a key technology for many human-centered intelligent applications. However, this research is still in its infancy and faces many unresolved challenges. To address these, we propose a comprehensive optimization process approach centered on multi-attention interaction. We first utilize unsupervised statistical feature-guided diffusion models for highly adaptive data enhancement, and introduce a novel network architecture-Multi-branch Spatiotemporal Interaction Network, which uses multi-branch features at different levels to effectively Sequential ), which uses multi-branch features at different levels to effectively Sequential spatio-temporal interaction to enhance the ability to mine advanced latent features. In addition, we adopt a multi-loss function fusion strategy in the training phase to dynamically adjust the fusion weights between batches to optimize the training results. Finally, we also conducted actual deployment on embedded devices to extensively test the practical feasibility of the proposed method in existing work. We conduct extensive testing on three public datasets, including ablation studies, comparisons of related work, and embedded deployments.
zh

人工智能

[AI-0] Dion: A Communication-Efficient Optimizer for Large Models

【速读】：该论文试图解决在分布式训练大模型过程中因梯度同步导致的显著通信开销问题。解决方案的关键在于提出了一种名为Dion的高效通信优化器，它保留了标准分布式训练（如DDP、FSDP）的同步语义，同时大幅降低了I/O成本。Dion通过利用正交化更新和设备本地动量缓冲区，避免了全梯度矩阵的交换，并进一步支持一种高效的分片策略，从而在训练过程中无需重建大型矩阵。

链接: https://arxiv.org/abs/2504.05295
作者: Kwangjun Ahn,Byron Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: technical report; comments welcome!

点击查看摘要

Abstract:Training large AI models efficiently requires distributing computation across multiple accelerators, but this often incurs significant communication overhead – especially during gradient synchronization. We introduce Dion, a communication-efficient optimizer that retains the synchronous semantics of standard distributed training (e.g., DDP, FSDP) while substantially reducing I/O costs. Unlike conventional optimizers that synchronize full gradient matrices, Dion leverages orthonormalized updates with device-local momentum buffers, eliminating the need for full gradient exchange. It further supports an efficient sharding strategy that avoids reconstructing large matrices during training.
zh

[AI-1] he challenge of uncertainty quantification of large language models in medicine

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在医学应用中不确定性量化的问题，强调技术创新与哲学意义。随着LLMs在临床决策中的重要性日益增加，准确传达不确定性对于确保可靠、安全且符合伦理的AI辅助医疗至关重要。论文将不确定性视为知识的重要组成部分，而非障碍，提倡采用动态且反思性的方法设计AI系统。

解决方案的关键在于提出一个综合框架，该框架结合先进的概率方法（如贝叶斯推理、深度集成及蒙特卡罗丢弃）与基于语言分析的预测熵和语义熵计算，以管理认识论（epistemic）和统计学（aleatoric）不确定性。此框架通过代理建模解决专有API的限制，利用多源数据整合增强上下文理解，并借助持续学习和元学习实现动态校准。此外，通过引入不确定性图谱和置信度指标来嵌入可解释性，支持用户信任与临床解释能力。最终，该方法促进了透明且符合责任与反思性AI原则的决策过程。从哲学角度出发，论文主张接受受控的模糊性，而非追求绝对的可预测性，认可医学知识的固有暂时性。

链接: https://arxiv.org/abs/2504.05278
作者: Zahra Atf,Seyed Amir Ahmad Safavi-Naini,Peter R. Lewis,Aref Mahjoubfar,Nariman Naderi,Thomas R. Savage,Ali Soroush
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 11 figures

点击查看摘要

Abstract:This study investigates uncertainty quantification in large language models (LLMs) for medical applications, emphasizing both technical innovations and philosophical implications. As LLMs become integral to clinical decision-making, accurately communicating uncertainty is crucial for ensuring reliable, safe, and ethical AI-assisted healthcare. Our research frames uncertainty not as a barrier but as an essential part of knowledge that invites a dynamic and reflective approach to AI design. By integrating advanced probabilistic methods such as Bayesian inference, deep ensembles, and Monte Carlo dropout with linguistic analysis that computes predictive and semantic entropy, we propose a comprehensive framework that manages both epistemic and aleatoric uncertainties. The framework incorporates surrogate modeling to address limitations of proprietary APIs, multi-source data integration for better context, and dynamic calibration via continual and meta-learning. Explainability is embedded through uncertainty maps and confidence metrics to support user trust and clinical interpretability. Our approach supports transparent and ethical decision-making aligned with Responsible and Reflective AI principles. Philosophically, we advocate accepting controlled ambiguity instead of striving for absolute predictability, recognizing the inherent provisionality of medical knowledge.
zh

[AI-2] How to evaluate control measures for LLM agents ? A trajectory from today to superintelligence

【速读】：该论文旨在解决如何确保控制评估（Control Evaluations）能够有效捕捉受控大型语言模型（LLMs）可能存在的误alignment风险的问题。随着LLMs能力增强，其自主造成危害的可能性增加，因此需要更复杂的控制措施来预防潜在的误alignment行为。论文的关键在于提出一个系统性的框架，用于根据AI能力的发展动态调整红队（Red Team）所获得的操作权限（Affordances），以匹配即将部署的受控AI代理的能力概况（Capability Profiles）。不同于假设攻击策略始终保持最优，该框架利用对AI实际能力的了解，设计与能力相称的控制评估，从而实现更实用且成本效益更高的控制措施。这一方法通过定义五个虚构模型（M1-M5）及其对应的五种AI控制等级（ACLs）进行说明，并为每种控制等级提供了示例规则，包括控制评估、控制措施及安全性案例。最终，论文指出构建超级智能LLMs的控制安全性案例将需要研究突破，并暗示未来可能需要探索替代性方法来缓解误alignment风险。

链接: https://arxiv.org/abs/2504.05259
作者: Tomek Korbak,Mikita Balesni,Buck Shlegeris,Geoffrey Irving
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations accurately capture misalignment risks, the affordances granted to this red team should be adapted to the capability profiles of the agents to be deployed under control measures. In this paper we propose a systematic framework for adapting affordances of red teams to advancing AI capabilities. Rather than assuming that agents will always execute the best attack strategies known to humans, we demonstrate how knowledge of an agents’s actual capability profile can inform proportional control evaluations, resulting in more practical and cost-effective control measures. We illustrate our framework by considering a sequence of five fictional models (M1-M5) with progressively advanced capabilities, defining five distinct AI control levels (ACLs). For each ACL, we provide example rules for control evaluation, control measures, and safety cases that could be appropriate. Finally, we show why constructing a compelling AI control safety case for superintelligent LLM agents will require research breakthroughs, highlighting that we might eventually need alternative approaches to mitigating misalignment risk. Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2504.05259 [cs.AI] (or arXiv:2504.05259v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.05259 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-3] Adversarial KA

【速读】：该论文试图验证Kolmogorov-Arnold表示定理（KA）作为函数表达算法的鲁棒性，具体通过分析其对抗连续攻击的能力。论文发现KA在可数连续对抗集合下具有鲁棒性，但关键问题在于外函数的等度连续性尚未得到保证，这阻碍了极限操作的实现，从而无法有效应对连续对抗群。因此，外函数的正则性问题是影响KA在神经网络（Neural Networks, NNs）一般理论中应用适用性的核心争议点。

链接: https://arxiv.org/abs/2504.05255
作者: Sviatoslav Dzhenzher,Michael H. Freedman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Functional Analysis (math.FA)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Regarding the representation theorem of Kolmogorov and Arnold (KA) as an algorithm for representing or «expressing» functions, we test its robustness by analyzing its ability to withstand adversarial attacks. We find KA to be robust to countable collections of continuous adversaries, but unearth a question about the equi-continuity of the outer functions that, so far, obstructs taking limits and defeating continuous groups of adversaries. This question on the regularity of the outer functions is relevant to the debate over the applicability of KA to the general theory of NNs.
zh

[AI-4] PINNverse: Accurate parameter estimation in differential equations from noisy data with constrained physics-informed neural networks

【速读】：本文旨在解决从测量数据中估计微分方程参数这一在定量科学中普遍存在的逆问题，特别是在稀疏测量和系统信息不完整的情况下。论文提出了一种名为PINNverse的新训练范式，通过将学习过程重新表述为约束微分优化问题来克服传统Physics-Informed Neural Networks (PINNs) 面临的收敛困难、稳定性问题、过拟合及复杂损失函数设计等挑战。其关键在于训练过程中动态平衡数据损失与微分方程残差损失的同时，有效防止过拟合，并结合Modified Differential Method of Multipliers实现Pareto前沿上任意点的收敛。通过这种方法，论文展示了在物理和生物领域四个经典常微分方程（ODE）和偏微分方程（PDE）模型中，即使面对噪声数据或正向问题计算昂贵的情况，也能实现鲁棒且精确的参数估计。

链接: https://arxiv.org/abs/2504.05248
作者: Marius Almanstötter,Roman Vetter,Dagmar Iber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Parameter estimation for differential equations from measured data is an inverse problem prevalent across quantitative sciences. Physics-Informed Neural Networks (PINNs) have emerged as effective tools for solving such problems, especially with sparse measurements and incomplete system information. However, PINNs face convergence issues, stability problems, overfitting, and complex loss function design. Here we introduce PINNverse, a training paradigm that addresses these limitations by reformulating the learning process as a constrained differential optimization problem. This approach achieves a dynamic balance between data loss and differential equation residual loss during training while preventing overfitting. PINNverse combines the advantages of PINNs with the Modified Differential Method of Multipliers to enable convergence on any point on the Pareto front. We demonstrate robust and accurate parameter estimation from noisy data in four classical ODE and PDE models from physics and biology. Our method enables accurate parameter inference also when the forward problem is expensive to solve.
zh

[AI-5] FinGrAct: A Framework for FINe-GRrained Evaluation of ACTionability in Explainable Automatic Fact-Checking

【速读】：该论文旨在解决自动事实核查（Automatic Fact-Checking, AFC）领域中解释的可操作性（actionability）评估方法缺失的问题。现有AFC系统的透明性和可信度依赖于清晰且易于理解的解释，而这些解释的有效性直接取决于其可操作性——即是否能够帮助用户做出明智决策并减少误导信息的影响。尽管可操作性被认为是高质量解释的关键属性之一，但此前尚未有专门的方法对其进行评估。为此，论文提出了一种名为FinGrAct的细粒度评估框架，该框架能够访问网络，并通过明确的标准与评估数据集来衡量AFC解释中的可操作性。FinGrAct的关键创新在于其超越了当前最先进的评估器，在与人类判断的相关性（Pearson和Kendall相关系数）方面表现最佳，同时展现出最低的自我中心偏差，从而提供了一种更稳健的可操作性评估方法。

链接: https://arxiv.org/abs/2504.05229
作者: Islam Eldifrawi,Shengrui Wang,Amine Trabelsi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The field of explainable Automatic Fact-Checking (AFC) aims to enhance the transparency and trustworthiness of automated fact-verification systems by providing clear and comprehensible explanations. However, the effectiveness of these explanations depends on their actionability --their ability to empower users to make informed decisions and mitigate misinformation. Despite actionability being a critical property of high-quality explanations, no prior research has proposed a dedicated method to evaluate it. This paper introduces FinGrAct, a fine-grained evaluation framework that can access the web, and it is designed to assess actionability in AFC explanations through well-defined criteria and an evaluation dataset. FinGrAct surpasses state-of-the-art (SOTA) evaluators, achieving the highest Pearson and Kendall correlation with human judgments while demonstrating the lowest ego-centric bias, making it a more robust evaluation approach for actionability evaluation in AFC.
zh

[AI-6] A moving target in AI-assisted decision-making: Dataset shift model updating and the problem of update opacity

【速读】：该论文试图解决机器学习系统因数据漂移导致性能下降的问题，并聚焦于模型更新对辅助决策过程本身的影响，特别是由此引入的“更新不透明性”（update opacity），即用户无法理解模型更新如何或为何改变了机器学习系统的推理或行为。论文指出，现有的针对黑箱问题的解决方案在应对这种更新不透明性带来的认识论和安全性挑战时效果有限。论文的关键在于提出多种可能的替代策略来直接应对更新不透明性问题，包括双向事实解释（bi-factual explanations）、动态模型报告（dynamic model reporting）和更新兼容性（update compatibility）。然而，这些方案各自存在风险或显著局限性，未来仍需进一步研究以全面解决相关认识论和安全性问题。

链接: https://arxiv.org/abs/2504.05210
作者: Joshua Hatherley
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine learning (ML) systems are vulnerable to performance decline over time due to dataset shift. To address this problem, experts often suggest that ML systems should be regularly updated to ensure ongoing performance stability. Some scholarly literature has begun to address the epistemic and ethical challenges associated with different updating methodologies. Thus far, however, little attention has been paid to the impact of model updating on the ML-assisted decision-making process itself, particularly in the AI ethics and AI epistemology literatures. This article aims to address this gap in the literature. It argues that model updating introduces a new sub-type of opacity into ML-assisted decision-making – update opacity – that occurs when users cannot understand how or why an update has changed the reasoning or behaviour of an ML system. This type of opacity presents a variety of distinctive epistemic and safety concerns that available solutions to the black box problem in ML are largely ill-equipped to address. A variety of alternative strategies may be developed or pursued to address the problem of update opacity more directly, including bi-factual explanations, dynamic model reporting, and update compatibility. However, each of these strategies presents its own risks or carries significant limitations. Further research will be needed to address the epistemic and safety concerns associated with model updating and update opacity going forward.
zh

[AI-7] Resource-Efficient Beam Prediction in mmWave Communications with Multimodal Realistic Simulation Framework

【速读】：该论文旨在解决传统波束成形方法在快速变化的通信环境中适应性不足的问题，特别是常规信道估计算法（如导频信号或波束扫描）难以应对动态环境。为了解决这一局限性，多模态感知辅助波束预测受到了广泛关注，但其应用受到高计算复杂度、高昂成本以及有限数据集的限制。论文的关键解决方案是提出了一种资源高效的学习方法，通过跨模态关系知识蒸馏（Cross-Modal Relational Knowledge Distillation, CRKD），将多模态网络的知识迁移到仅基于雷达的单模态网络中，在降低计算开销的同时保持预测准确性。此外，为了支持现实数据下的多模态学习，论文开发了一个新颖的多模态仿真框架，结合了由自主驾驶模拟器CARLA生成的传感器数据与MATLAB中的毫米波信道建模工具，以反映真实世界条件。CRKD通过在不同特征空间中蒸馏关系信息来实现目标，从而提升波束预测性能，而无需依赖昂贵的多模态传感器数据。仿真结果表明，CRKD成功地将多模态知识高效迁移，使雷达-only模型达到了教师模型94.62%的性能，且仅使用了教师网络10%的参数量，显著降低了计算复杂度和对多模态传感器数据的依赖。

链接: https://arxiv.org/abs/2504.05187
作者: Yu Min Park,Yan Kyaw Tun,Walid Saad,Choong Seon Hong
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 8 figures, Submitted to IEEE Transactions on Communications on Apr. 07, 2025

点击查看摘要

Abstract:Beamforming is a key technology in millimeter-wave (mmWave) communications that improves signal transmission by optimizing directionality and intensity. However, conventional channel estimation methods, such as pilot signals or beam sweeping, often fail to adapt to rapidly changing communication environments. To address this limitation, multimodal sensing-aided beam prediction has gained significant attention, using various sensing data from devices such as LiDAR, radar, GPS, and RGB images to predict user locations or network conditions. Despite its promising potential, the adoption of multimodal sensing-aided beam prediction is hindered by high computational complexity, high costs, and limited datasets. Thus, in this paper, a resource-efficient learning approach is proposed to transfer knowledge from a multimodal network to a monomodal (radar-only) network based on cross-modal relational knowledge distillation (CRKD), while reducing computational overhead and preserving predictive accuracy. To enable multimodal learning with realistic data, a novel multimodal simulation framework is developed while integrating sensor data generated from the autonomous driving simulator CARLA with MATLAB-based mmWave channel modeling, and reflecting real-world conditions. The proposed CRKD achieves its objective by distilling relational information across different feature spaces, which enhances beam prediction performance without relying on expensive sensor data. Simulation results demonstrate that CRKD efficiently distills multimodal knowledge, allowing a radar-only model to achieve 94.62% of the teacher performance. In particular, this is achieved with just 10% of the teacher network’s parameters, thereby significantly reducing computational complexity and dependence on multimodal sensor data.
zh

[AI-8] Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval SIGIR2025

【速读】：该论文旨在解决生成式信息检索（Generative Information Retrieval, GenIR）模型中存在的令牌级错位问题，即现有模型在预测下一个令牌时难以有效捕捉文档级相关性的问题。此外，针对基于强化学习的方法（如从相关反馈中进行强化学习，Reinforcement Learning from Relevance Feedback, RLRF）因引入复杂的辅助奖励函数优化及后续的强化微调而带来的计算成本高且不稳定的问题，论文提出了一种直接文档相关性优化（Direct Document Relevance Optimization, DDRO）方法。DDRO通过成对排名的方式直接优化令牌级文档标识符生成与文档级相关性估计的一致性，无需显式的奖励建模和强化学习。实验结果表明，DDRO在MS MARCO数据集上的MRR@10指标提升了7.4%，在Natural Questions数据集上提升了19.9%，显著优于基于强化学习的方法，同时简化了生成式信息检索模型的排序优化流程。

链接: https://arxiv.org/abs/2504.05181
作者: Kidist Amde Mekonnen,Yubao Tang,Maarten de Rijke
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: 13 pages, 5 figures. Submitted to SIGIR 2025. Proposes DDRO, a lightweight and reinforcement-free document relevance optimization method for generative retrieval. Code and pretrained models available at: this https URL

点击查看摘要

Abstract:Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task, allowing for end-to-end optimization toward a unified global retrieval objective. However, existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively. While reinforcement learning-based methods, such as reinforcement learning from relevance feedback (RLRF), aim to address this misalignment through reward modeling, they introduce significant complexity, requiring the optimization of an auxiliary reward function followed by reinforcement fine-tuning, which is computationally expensive and often unstable. To address these challenges, we propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking, eliminating the need for explicit reward modeling and reinforcement learning. Experimental results on benchmark datasets, including MS MARCO document and Natural Questions, show that DDRO outperforms reinforcement learning-based methods, achieving a 7.4% improvement in MRR@10 for MS MARCO and a 19.9% improvement for Natural Questions. These findings highlight DDRO’s potential to enhance retrieval effectiveness with a simplified optimization approach. By framing alignment as a direct optimization problem, DDRO simplifies the ranking optimization pipeline of GenIR models while offering a viable alternative to reinforcement learning-based methods.
zh

[AI-9] BRIDGES: Bridging Graph Modality and Large Language Models within EDA Tasks

【速读】：该论文旨在解决EDA（电子设计自动化）任务中现有大型语言模型（LLMs）未能充分利用图结构数据的问题。具体而言，传统方法要么将图表示为顺序文本，要么完全忽略了可能有益的图结构数据（如RTL代码的数据流图）。研究表明，这种处理方式会导致LLM性能下降，而引入额外的图信息则显著提升性能。为应对这些挑战，论文提出了BRIDGES框架，其核心在于将图模态整合到LLMs中以支持EDA任务。解决方案的关键包括：首先构建了一个基于LLM驱动的工作流，用于生成大规模包含超过500,000个图实例及超过15亿标记的大规模数据集；其次提出了一种轻量级跨模态投影器，将图表示编码为与文本兼容的提示，使LLMs能够有效利用图数据而不需修改架构。实验结果表明，相比仅使用文本的方法，在多个任务上性能提升了2至10倍，同时计算开销极小（模型权重增加1%，运行时间增加30%），且无需进一步微调即可大幅超越文本-only方法。

链接: https://arxiv.org/abs/2504.05180
作者: Wei Li,Yang Zou,Christopher Ellis,Ruben Purdy,Shawn Blanton,José M. F. Moura
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While many EDA tasks already involve graph-based data, existing LLMs in EDA primarily either represent graphs as sequential text, or simply ignore graph-structured data that might be beneficial like dataflow graphs of RTL code. Recent studies have found that LLM performance suffers when graphs are represented as sequential text, and using additional graph information significantly boosts performance. To address these challenges, we introduce BRIDGES, a framework designed to incorporate graph modality into LLMs for EDA tasks. BRIDGES integrates an automated data generation workflow, a solution that combines graph modality with LLM, and a comprehensive evaluation suite. First, we establish an LLM-driven workflow to generate RTL and netlist-level data, converting them into dataflow and netlist graphs with function descriptions. This workflow yields a large-scale dataset comprising over 500,000 graph instances and more than 1.5 billion tokens. Second, we propose a lightweight cross-modal projector that encodes graph representations into text-compatible prompts, enabling LLMs to effectively utilize graph data without architectural modifications. Experimental results demonstrate 2x to 10x improvements across multiple tasks compared to text-only baselines, including accuracy in design retrieval, type prediction and perplexity in function description, with negligible computational overhead (1% model weights increase and 30% additional runtime overhead). Even without additional LLM finetuning, our results outperform text-only by a large margin. We plan to release BRIDGES, including the dataset, models, and training flow.
zh

[AI-10] Attention-Based Multi-Scale Temporal Fusion Network for Uncertain-Mode Fault Diagnosis in Multimode Processes

【速读】：该论文旨在解决多模态过程中故障诊断面临的挑战，即来自不同模式的监控数据存在显著分布差异，导致模型难以提取与系统健康状态相关的共享特征表示。为应对这一问题，论文提出了一种基于注意力机制的多尺度时序融合网络（Attention-based Multi-Scale Temporal Fusion Network）。其关键在于结合多尺度深度可分离卷积和门控循环单元提取局部上下文特征和长短期特征，并设计了一种时序注意机制，聚焦于具有更高跨模式共享信息的关键时间点，从而提升故障诊断的准确性。实验结果表明，该方法在Tennessee Eastman过程数据集和三相流设施数据集上实现了卓越的诊断性能且模型规模较小。

链接: https://arxiv.org/abs/2504.05172
作者: Guangqiang Li,M. Amine Atoui,Xiangshun Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages,11 figures

点击查看摘要

Abstract:Fault diagnosis in multimode processes plays a critical role in ensuring the safe operation of industrial systems across multiple modes. It faces a great challenge yet to be addressed - that is, the significant distributional differences among monitoring data from multiple modes make it difficult for the models to extract shared feature representations related to system health conditions. In response to this problem, this paper introduces a novel method called attention-based multi-scale temporal fusion network. The multi-scale depthwise convolution and gated recurrent unit are employed to extract multi-scale contextual local features and long-short-term features. A temporal attention mechanism is designed to focus on critical time points with higher cross-mode shared information, thereby enhancing the accuracy of fault diagnosis. The proposed model is applied to Tennessee Eastman process dataset and three-phase flow facility dataset. The experiments demonstrate that the proposed model achieves superior diagnostic performance and maintains a small model size.
zh

[AI-11] RLBayes: a Bayesian Network Structure Learning Algorithm via Reinforcement Learning-Based Search Strategy

【速读】：本文旨在解决贝叶斯网络（Bayesian Network, BN）结构学习中搜索空间随变量数量呈超指数增长导致的问题，即NP难组合优化问题（Combination Optimization Problem, COP）。传统启发式方法虽取得一定成功，但通常无法获得令人满意的结果。为应对这一挑战，论文提出了一种基于强化学习（Reinforcement Learning, RL）的贝叶斯网络结构学习算法——RLBayes。其关键在于利用强化学习的思想，通过动态维护的Q表（Q-table）记录和引导学习过程，从而在有限的空间内存储无限的搜索空间，并通过Q学习实现贝叶斯网络的结构学习。理论证明表明RLBayes可收敛至全局最优结构，实验结果进一步验证其优于几乎所有其他启发式搜索算法。

链接: https://arxiv.org/abs/2504.05167
作者: Mingcan Wang,Junchang Xin,Luxuan Qu,Qi Chen,Zhiqiong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The score-based structure learning of Bayesian network (BN) is an effective way to learn BN models, which are regarded as some of the most compelling probabilistic graphical models in the field of representation and reasoning under uncertainty. However, the search space of structure learning grows super-exponentially as the number of variables increases, which makes BN structure learning an NP-hard problem, as well as a combination optimization problem (COP). Despite the successes of many heuristic methods on it, the results of the structure learning of BN are usually unsatisfactory. Inspired by Q-learning, in this paper, a Bayesian network structure learning algorithm via reinforcement learning-based (RL-based) search strategy is proposed, namely RLBayes. The method borrows the idea of RL and tends to record and guide the learning process by a dynamically maintained Q-table. By creating and maintaining the dynamic Q-table, RLBayes achieve storing the unlimited search space within limited space, thereby achieving the structure learning of BN via Q-learning. Not only is it theoretically proved that RLBayes can converge to the global optimal BN structure, but also it is experimentally proved that RLBayes has a better effect than almost all other heuristic search algorithms.
zh

[AI-12] Evaluating Knowledge Graph Based Retrieval Augmented Generation Methods under Knowledge Incompleteness

【速读】：该论文试图解决在知识图谱不完整的情况下，基于知识图谱的检索增强生成（KG-RAG）方法在问答任务中的性能下降问题。现有基准未能充分反映知识图谱不完整性对KG-RAG性能的影响。为了解决这一问题，论文通过采用不同的方法移除三元组，并分析其影响，系统评估了KG-RAG方法在不完整知识图谱下的表现。关键在于揭示KG-RAG方法对知识图谱不完整性的敏感性，并强调在实际场景中需要更鲁棒的方法。

链接: https://arxiv.org/abs/2504.05163
作者: Dongzhuoran Zhou,Yuqicheng Zhu,Yuan He,Jiaoyan Chen,Evgeny Kharlamov,Steffen Staab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) is a technique that enhances Large Language Model (LLM) inference in tasks like Question Answering (QA) by retrieving relevant information from knowledge graphs (KGs). However, real-world KGs are often incomplete, meaning that essential information for answering questions may be missing. Existing benchmarks do not adequately capture the impact of KG incompleteness on KG-RAG performance. In this paper, we systematically evaluate KG-RAG methods under incomplete KGs by removing triples using different methods and analyzing the resulting effects. We demonstrate that KG-RAG methods are sensitive to KG incompleteness, highlighting the need for more robust approaches in realistic settings.
zh

[AI-13] Leverag ing Label Potential for Enhanced Multimodal Emotion Recognition IJCNN2025

【速读】：本文旨在解决多模态情感识别（MER）领域中现有方法忽视情感标签中丰富信息的问题，这些信息本可以显著提升MER的性能。传统研究主要集中在音频与文本特征的融合，而未充分利用情感标签蕴含的洞察力。为克服这一局限，论文提出了一种名为Label Signal-Guided Multimodal Emotion Recognition (LSGMER)的新模型。LSGMER的关键在于通过Label Signal Enhancement模块优化模态特征表示，该模块利用标签嵌入与音频及文本特征交互，以精准捕捉情感细微差别；同时引入联合目标优化（JO）方法，并通过属性-预测一致性约束（APC）加强融合特征与情感类别的对齐，从而提高分类准确性和稳定性。实验结果表明，LSGMER在IEMOCAP和MELD数据集上的有效性。

链接: https://arxiv.org/abs/2504.05158
作者: Xuechun Shao,Yinfeng Yu,Liejun Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Main paper (8 pages). Accepted for publication by IJCNN 2025

点击查看摘要

Abstract:Multimodal emotion recognition (MER) seeks to integrate various modalities to predict emotional states accurately. However, most current research focuses solely on the fusion of audio and text features, overlooking the valuable information in emotion labels. This oversight could potentially hinder the performance of existing methods, as emotion labels harbor rich, insightful information that could significantly aid MER. We introduce a novel model called Label Signal-Guided Multimodal Emotion Recognition (LSGMER) to overcome this limitation. This model aims to fully harness the power of emotion label information to boost the classification accuracy and stability of MER. Specifically, LSGMER employs a Label Signal Enhancement module that optimizes the representation of modality features by interacting with audio and text features through label embeddings, enabling it to capture the nuances of emotions precisely. Furthermore, we propose a Joint Objective Optimization(JOO) approach to enhance classification accuracy by introducing the Attribution-Prediction Consistency Constraint (APC), which strengthens the alignment between fused features and emotion categories. Extensive experiments conducted on the IEMOCAP and MELD datasets have demonstrated the effectiveness of our proposed LSGMER model.
zh

[AI-14] A Reinforcement Learning Method for Environments with Stochastic Variables: Post-Decision Proximal Policy Optimization with Dual Critic Networks IJCNN2025

【速读】：本文旨在解决混合整数规划中的批量优化（Lot-sizing）问题，该问题在不确定的需求和成本参数下需要优化生产计划、交付履约及库存水平。论文提出了一种新的深度强化学习方法——后决策策略近端策略优化（Post-Decision Proximal Policy Optimization, PDPPO），其关键在于引入后决策状态（post-decision state）与双重批评者（dual critics）。通过将状态转移过程分为确定性步骤生成后决策状态和随机性步骤到达下一状态，PDPPO不仅降低了问题的维度，还提升了价值函数估计的准确性，从而实现更高效且一致的学习性能，特别是在具有随机成分的环境中显著优于传统Proximal Policy Optimization (PPO) 方法。

链接: https://arxiv.org/abs/2504.05150
作者: Leonardo Kanashiro Felizardo,Edoardo Fadda,Paolo Brandimarte,Emilio Del-Moral-Hernandez,Mariá Cristina Vasconcelos Nascimento
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures. Accepted for presentation at IJCNN 2025

点击查看摘要

Abstract:This paper presents Post-Decision Proximal Policy Optimization (PDPPO), a novel variation of the leading deep reinforcement learning method, Proximal Policy Optimization (PPO). The PDPPO state transition process is divided into two steps: a deterministic step resulting in the post-decision state and a stochastic step leading to the next state. Our approach incorporates post-decision states and dual critics to reduce the problem’s dimensionality and enhance the accuracy of value function estimation. Lot-sizing is a mixed integer programming problem for which we exemplify such dynamics. The objective of lot-sizing is to optimize production, delivery fulfillment, and inventory levels in uncertain demand and cost parameters. This paper evaluates the performance of PDPPO across various environments and configurations. Notably, PDPPO with a dual critic architecture achieves nearly double the maximum reward of vanilla PPO in specific scenarios, requiring fewer episode iterations and demonstrating faster and more consistent learning across different initializations. On average, PDPPO outperforms PPO in environments with a stochastic component in the state transition. These results support the benefits of using a post-decision state. Integrating this post-decision state in the value function approximation leads to more informed and efficient learning in high-dimensional and stochastic environments.
zh

[AI-15] Interpretable Style Takagi-Sugeno-Kang Fuzzy Clustering

【速读】：该论文旨在解决传统聚类算法在解释性（interpretability）方面的不足以及数据同质性导致不同数据组具有各自独特风格（style）的问题。论文提出了一种可解释的Takagi-Sugeno-Kang (TSK) 模糊聚类算法（Interpretable Style TSK Fuzzy Clustering, IS-TSK-FC），其关键在于通过TSK模糊推理系统指导聚类行为，使得聚类结果能够被详细解释，并引入一系列风格矩阵（style matrices）来增强规则后件（consequent）的数据表示能力。此外，通过交替迭代优化方法解决了算法的参数确定问题，验证了该算法在具有隐式或显式风格的数据集上的有效性及优越性能。

链接: https://arxiv.org/abs/2504.05125
作者: Suhang Gu,Ye Wang,Yongxin Chou,Jinliang Cong,Mingli Lu,Zhuqing Jiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clustering is an efficient and essential technique for exploring latent knowledge of data. However, limited attention has been given to the interpretability of the clusters detected by most clustering algorithms. In addition, due to the homogeneity of data, different groups of data have their own homogeneous styles. In this paper, the above two aspects are considered, and an interpretable style Takagi-Sugeno-Kang (TSK) fuzzy clustering (IS-TSK-FC) algorithm is proposed. The clustering behavior of IS-TSK-FC is fully guided by the TSK fuzzy inference on fuzzy rules. In particular, samples are grouped into clusters represented by the corresponding consequent vectors of all fuzzy rules learned in an unsupervised manner. This can explain how the clusters are generated in detail, thus making the underlying decision-making process of the IS-TSK-FC interpretable. Moreover, a series of style matrices are introduced to facilitate the consequents of fuzzy rules in IS-TSK-FC by capturing the styles of clusters as well as the nuances between different styles. Consequently, all the fuzzy rules in IS-TSK-FC have powerful data representation capability. After determining the antecedents of all the fuzzy rules, the optimization problem of IS-TSK-FC can be iteratively solved in an alternation manner. The effectiveness of IS-TSK-FC as an interpretable clustering tool is validated through extensive experiments on benchmark datasets with unknown implicit/explicit styles. Specially, the superior clustering performance of IS-TSK-FC is demonstrated on case studies where different groups of data present explicit styles. The source code of IS-TSK-FC can be downloaded from this https URL.
zh

[AI-16] VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

【速读】：该论文旨在解决长链路推理（long-CoT）任务中的价值基方法所面临的三个关键挑战：价值模型偏差（value model bias）、异构序列长度（heterogeneous sequence lengths）的存在以及奖励信号稀疏性（sparsity of reward signals）。为应对这些挑战，论文提出了一种名为VAPO（Value-based Augmented Proximal Policy Optimization）的新框架。VAPO通过系统化设计提供了一个综合解决方案，显著提升了长CoT推理任务的性能。其核心在于结合预训练模型与强化学习策略，在保证训练稳定性与高效性的前提下，实现了在AIME 2024数据集上的最新性能记录。

链接: https://arxiv.org/abs/2504.05118
作者: YuYue,Yufeng Yuan,Qiying Yu,Xiaochen Zuo,Ruofei Zhu,Wenyuan Xu,Jiaze Chen,Chengyi Wang,TianTian Fan,Zhengyin Du,Xiangpeng Wei,Gaohong Liu,Juncai Liu,Lingjun Liu,Haibin Lin,Zhiqi Lin,Bole Ma,Chi Zhang,Mofan Zhang,Wang Zhang,Hang Zhu,Ru Zhang,Xin Liu,Mingxuan Wang,Yonghui Wu,Lin Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of \mathbf60.4 . In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.
zh

[AI-17] Algorithm Discovery With LLM s: Evolutionary Search Meets Reinforcement Learning

【速读】：该论文旨在解决通过传统方法发现高效算法效率低下且依赖大量人类专业知识的问题，特别是在数学和优化领域。现有基于大型语言模型（Large Language Models, LLMs）的演化搜索方法仅将LLM视为静态生成器，未能充分利用演化探索过程中获得的反馈信号来动态改进模型。为此，论文提出了一种结合强化学习（Reinforcement Learning, RL）微调的增强型LLM演化搜索方法。其关键在于通过演化策略进行算法探索的同时，利用强化学习优化LLM的策略，从而实现对搜索算子的持续精化。实验结果表明，该方法在三个组合优化任务（二元装箱问题、旅行商问题和平面包装问题）中显著提升了算法发现的效率，展示了强化学习增强的演化策略在计算机科学家和数学家进行更高效算法设计中的潜力。

链接: https://arxiv.org/abs/2504.05108
作者: Anja Surina,Amin Mansouri,Lars Quaedvlieg,Amal Seddas,Maryna Viazovska,Emmanuel Abbe,Caglar Gulcehre
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 30 pages

点击查看摘要

Abstract:Discovering efficient algorithms for solving complex problems has been an outstanding challenge in mathematics and computer science, requiring substantial human expertise over the years. Recent advancements in evolutionary search with large language models (LLMs) have shown promise in accelerating the discovery of algorithms across various domains, particularly in mathematics and optimization. However, existing approaches treat the LLM as a static generator, missing the opportunity to update the model with the signal obtained from evolutionary exploration. In this work, we propose to augment LLM-based evolutionary search by continuously refining the search operator - the LLM - through reinforcement learning (RL) fine-tuning. Our method leverages evolutionary search as an exploration strategy to discover improved algorithms, while RL optimizes the LLM policy based on these discoveries. Our experiments on three combinatorial optimization tasks - bin packing, traveling salesman, and the flatpack problem - show that combining RL and evolutionary search improves discovery efficiency of improved algorithms, showcasing the potential of RL-enhanced evolutionary strategies to assist computer scientists and mathematicians for more efficient algorithm design.
zh

[AI-18] SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation

【速读】：该论文旨在解决 novice content creators 在为社交媒体视频录制富有表现力的语音时投入大量时间的问题，尽管文本转语音（Text-to-Speech, TTS）技术已取得显著进展，但许多用户仍难以适应其不直观或过于细致的界面。论文的关键解决方案是通过允许用户在脚本之外提供高层次上下文信息来简化 TTS 的生成过程，并提出了一种名为 SpeakEasy 的“Oz 魔术师”系统，该系统利用用户提供的上下文信息指导和影响 TTS 输出，支持基于高层次反馈的迭代优化。这一方法基于两项探索性研究设计：一项研究内容创作者使用 TTS 的体验，另一项研究借鉴配音演员的有效策略。评估结果显示，使用 SpeakEasy 的参与者能够更成功地生成符合个人标准的语音表现，且所需努力并未显著超过领先的行业界面。

链接: https://arxiv.org/abs/2504.05106
作者: Stephen Brade,Sam Anderson,Rithesh Kumar,Zeyu Jin,Anh Truong
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators’ experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.
zh

[AI-19] Debate Only When Necessary: Adaptive Multiagent Collaboration for Efficient LLM Reasoning

【速读】：该论文旨在解决多智能体协作框架在提升大语言模型（Large Language Models, LLMs）推理能力的同时所面临的显著计算开销问题，以及在不需要协作的查询上进行辩论可能引发错误传播的风险。为了解决这些问题，论文提出了一种名为“仅在必要时辩论（Debate Only When Necessary, DOWN）”的自适应多智能体辩论框架。其关键在于通过初始响应置信度分数来选择性激活辩论过程，并在触发辩论的情况下，利用参与智能体的反馈及其置信度分数优化输出结果。实验表明，这种机制不仅提高了效率，还保持甚至超越了现有多智能体辩论系统的性能，同时缓解了错误传播并增强了可靠响应的选择性整合。这些成果确立了DOWN作为一种优化策略，促进了基于LLM的协作在实际应用中的部署。

链接: https://arxiv.org/abs/2504.05047
作者: Sugyeong Eo,Hyeonseok Moon,Evelyn Hayoon Zi,Chanjun Park,Heuiseok Lim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multiagent collaboration has emerged as a promising framework for enhancing the reasoning capabilities of large language models (LLMs). While this approach improves reasoning capability, it incurs substantial computational overhead due to iterative agent interactions. Furthermore, engaging in debates for queries that do not necessitate collaboration amplifies the risk of error generation. To address these challenges, we propose Debate Only When Necessary (DOWN), an adaptive multiagent debate framework that selectively activates the debate process based on the confidence score of the agent’s initial response. For queries where debate is triggered, agents refine their outputs using responses from participating agents and their confidence scores. Experimental results demonstrate that this mechanism significantly improves efficiency while maintaining or even surpassing the performance of existing multiagent debate systems. We also find that confidence-guided debate mitigates error propagation and enhances the selective incorporation of reliable responses. These results establish DOWN as an optimization strategy for efficient and effective multiagent reasoning, facilitating the practical deployment of LLM-based collaboration.
zh

[AI-20] Graph-based Diffusion Model for Collaborative Filtering

【速读】：该论文旨在解决现有扩散模型在推荐系统中的局限性，即未能充分利用用户与项目之间潜在的高阶协同信号。这些信号通过图结构能够更自然地表达复杂且精细的关系。为了解决这一问题，论文将基于扩散的推荐方法扩展到图领域，直接利用扩散模型对用户-项目二分图进行建模，从而更好地捕捉交互动态中的高阶连通性。然而，这种扩展带来了两个主要挑战：噪声异质性和关系爆炸。为此，论文提出了基于图的协同过滤扩散模型（GDMCF）。针对噪声异质性，引入了多级噪声腐蚀机制，整合连续和离散噪声，有效模拟现实世界中的交互复杂性；针对关系爆炸，设计了一种用户活跃引导的扩散过程，选择性关注最重要边和活跃用户，降低推理成本的同时保持图的拓扑完整性。大量实验表明，GDMCF在三个基准数据集上的表现始终优于最先进的方法，证明了其在捕获高阶协同信号和提升推荐性能方面的有效性。

链接: https://arxiv.org/abs/2504.05029
作者: Xuan Zhang,Xiang Deng,Hongxing Yuan,Chunyu Wei,Yushun Fan
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, diffusion-based recommendation methods have achieved impressive results. However, existing approaches predominantly treat each user’s historical interactions as independent training samples, overlooking the potential of higher-order collaborative signals between users and items. Such signals, which encapsulate richer and more nuanced relationships, can be naturally captured using graph-based data structures. To address this limitation, we extend diffusion-based recommendation methods to the graph domain by directly modeling user-item bipartite graphs with diffusion models. This enables better modeling of the higher-order connectivity inherent in complex interaction dynamics. However, this extension introduces two primary challenges: (1) Noise Heterogeneity, where interactions are influenced by various forms of continuous and discrete noise, and (2) Relation Explosion, referring to the high computational costs of processing large-scale graphs. To tackle these challenges, we propose a Graph-based Diffusion Model for Collaborative Filtering (GDMCF). To address noise heterogeneity, we introduce a multi-level noise corruption mechanism that integrates both continuous and discrete noise, effectively simulating real-world interaction complexities. To mitigate relation explosion, we design a user-active guided diffusion process that selectively focuses on the most meaningful edges and active users, reducing inference costs while preserving the graph’s topological integrity. Extensive experiments on three benchmark datasets demonstrate that GDMCF consistently outperforms state-of-the-art methods, highlighting its effectiveness in capturing higher-order collaborative signals and improving recommendation performance.
zh

[AI-21] Measuring the right thing: justifying metrics in AI impact assessments

【速读】：该论文试图解决的问题是如何合理选择和论证用于评估人工智能系统影响的度量标准，特别是在难以量化伦理和社会价值的情况下。论文指出，为了确保度量标准的合理性，需要采取两步方法：首先明确概念定义（如罗尔斯式的公平或团结式的公平），然后将度量标准与这些概念相匹配。关键在于这两个步骤都需要独立的论证，其中概念工程提供了有用的工具来支持第一步；第二步则通过分析不同的公平性度量标准来展示，概念所提供的额外内容有助于为特定度量标准的选择提供正当理由。因此，论文主张影响评估不仅应清晰说明其使用的度量标准，还应阐明驱动这些度量标准的概念基础。

链接: https://arxiv.org/abs/2504.05007
作者: Stefan Buijsman,Herman Veluwenkamp
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Accepted for publication in Global Perspectives on AI Impact Assessment (Oxford University Press, forthcoming). Pre-publication version; final version will be available from the publisher

点击查看摘要

Abstract:AI Impact Assessments are only as good as the measures used to assess the impact of these systems. It is therefore paramount that we can justify our choice of metrics in these assessments, especially for difficult to quantify ethical and social values. We present a two-step approach to ensure metrics are properly motivated. First, a conception needs to be spelled out (e.g. Rawlsian fairness or fairness as solidarity) and then a metric can be fitted to that conception. Both steps require separate justifications, as conceptions can be judged on how well they fit with the function of, for example, fairness. We argue that conceptual engineering offers helpful tools for this step. Second, metrics need to be fitted to a conception. We illustrate this process through an examination of competing fairness metrics to illustrate that here the additional content that a conception offers helps us justify the choice for a specific metric. We thus advocate that impact assessments are not only clear on their metrics, but also on the conceptions that motivate those metrics.
zh

[AI-22] ransforming Future Data Center Operations and Management via Physical AI

【速读】：该论文旨在解决传统数据中心（Data Centers, DCs）运营与管理方法在应对人工智能（Artificial Intelligence, AI）驱动的业务需求和成本优化挑战时的局限性。随着从互联网数据中心向AI数据中心的演进，需要超越基于最佳实践的传统范式，提出创新解决方案以提升业务韧性并降低总拥有成本（Total Cost of Ownership, TCO）。论文的关键在于提出了一种名为Physical AI (PhyAI) 的新型框架，其核心在于通过结合工业级模拟引擎、物理信息机器学习（Physics-Informed Machine Learning, PIML）模型训练评估以及基于NVIDIA Omniverse的五层数字孪生平台，实现未来数据中心的数字化、优化和自动化。这种方案的关键创新点在于构建实时数字孪生能力，显著提高预测精度（如热流特性预测的中位绝对温度误差仅为0.18°C），从而替代耗时的传统计算流体力学/传热学（CFD/HT）仿真方法，为未来数据中心的运营与管理提供可扩展且灵活的解决方案。

链接: https://arxiv.org/abs/2504.04982
作者: Zhiwei Cao,Minghao Li,Feng Lin,Qiang Fu,Jimin Jia,Yonggang Wen,Jianxiong Yin,Simon See
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Data centers (DCs) as mission-critical infrastructures are pivotal in powering the growth of artificial intelligence (AI) and the digital economy. The evolution from Internet DC to AI DC has introduced new challenges in operating and managing data centers for improved business resilience and reduced total cost of ownership. As a result, new paradigms, beyond the traditional approaches based on best practices, must be in order for future data centers. In this research, we propose and develop a novel Physical AI (PhyAI) framework for advancing DC operations and management. Our system leverages the emerging capabilities of state-of-the-art industrial products and our in-house research and development. Specifically, it presents three core modules, namely: 1) an industry-grade in-house simulation engine to simulate DC operations in a highly accurate manner, 2) an AI engine built upon NVIDIA PhysicsNemo for the training and evaluation of physics-informed machine learning (PIML) models, and 3) a digital twin platform built upon NVIDIA Omniverse for our proposed 5-tier digital twin framework. This system presents a scalable and adaptable solution to digitalize, optimize, and automate future data center operations and management, by enabling real-time digital twins for future data centers. To illustrate its effectiveness, we present a compelling case study on building a surrogate model for predicting the thermal and airflow profiles of a large-scale DC in a real-time manner. Our results demonstrate its superior performance over traditional time-consuming Computational Fluid Dynamics/Heat Transfer (CFD/HT) simulation, with a median absolute temperature prediction error of 0.18 °C. This emerging approach would open doors to several potential research directions for advancing Physical AI in future DC operations.
zh

[AI-23] Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds

【速读】：该论文致力于解决受限马尔可夫决策过程（CMDPs）在面对随机阈值约束下的强化学习安全问题，特别是在未知和不确定环境中的安全性。论文的关键在于提出了一种基于Growing-Window估计器的方法，用于从与动态环境的交互中估计随机阈值，并据此设计了Stochastic Pessimistic-Optimistic Thresholding (SPOT)，这是一种针对多个随机阈值约束的新型模型驱动的主从算法。SPOT能够在悲观和乐观阈值设定下实现强化学习。理论分析表明，该算法在T个episode中实现了次线性遗憾和约束违反，即奖励遗憾为 (\tilde{\mathcal{O}}(\sqrt{T}))，同时允许 (\tilde{\mathcal{O}}(\sqrt{T})) 的约束违反。这一成果首次在理论上保证了在甚至阈值未知的不确定环境中，强化学习算法能够达到与固定明确阈值情况下相当的性能。

链接: https://arxiv.org/abs/2504.04973
作者: Qian Zuo,Fengxiang He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper studies constrained Markov decision processes (CMDPs) with constraints against stochastic thresholds, aiming at safety of reinforcement learning in unknown and uncertain environments. We leverage a Growing-Window estimator sampling from interactions with the uncertain and dynamic environment to estimate the thresholds, based on which we design Stochastic Pessimistic-Optimistic Thresholding (SPOT), a novel model-based primal-dual algorithm for multiple constraints against stochastic thresholds. SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings. We prove that our algorithm achieves sublinear regret and constraint violation; i.e., a reward regret of \tilde\mathcalO(\sqrtT) while allowing an \tilde\mathcalO(\sqrtT) constraint violation over T episodes. The theoretical guarantees show that our algorithm achieves performance comparable to that of an approach relying on fixed and clear thresholds. To the best of our knowledge, SPOT is the first reinforcement learning algorithm that realises theoretical guaranteed performance in an uncertain environment where even thresholds are unknown.
zh

[AI-24] A High-Force Gripper with Embedded Multimodal Sensing for Powerful and Perception Driven Grasping

【速读】：该论文致力于解决现代人形机器人在执行高负载抓取与操作任务时面临的局限性。现有机器人末端执行器通常无法匹配机械臂的承载能力，导致其抓取与操作的负载能力受限，并且硬件感知能力不足，依赖于机器人身体其他部位安装的传感器，这些传感器易受机械臂运动引起的遮挡影响。为了解决这些问题，论文提出了一种模块化高抓取力末端执行器，集成了嵌入式的多模态感知功能。关键解决方案在于设计了一个能够产生110牛顿抓取力的紧凑型执行器，并结合了包括眼在手相机、飞行时间（ToF）距离传感器、惯性测量单元（IMU）和全向麦克风在内的多模态传感系统，从而实现感知驱动的抓取功能。此外，通过引入新的负载评估指标以及基于感知引导的增强抓取操作，全面验证了该执行器的抓取力能力和嵌入式多模态感知性能。

链接: https://arxiv.org/abs/2504.04970
作者: Edoardo Del Bianco,Davide Torielli,Federico Rollo,Damiano Gasperini,Arturo Laurenzi,Lorenzo Baccelliere,Luca Muratore,Marco Roveri,Nikos G. Tsagarakis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 15 figures

点击查看摘要

Abstract:Modern humanoid robots have shown their promising potential for executing various tasks involving the grasping and manipulation of objects using their end-effectors. Nevertheless, in the most of the cases, the grasping and manipulation actions involve low to moderate payload and interaction forces. This is due to limitations often presented by the end-effectors, which can not match their arm-reachable payload, and hence limit the payload that can be grasped and manipulated. In addition, grippers usually do not embed adequate perception in their hardware, and grasping actions are mainly driven by perception sensors installed in the rest of the robot body, frequently affected by occlusions due to the arm motions during the execution of the grasping and manipulation tasks. To address the above, we developed a modular high grasping force gripper equipped with embedded multi-modal perception functionalities. The proposed gripper can generate a grasping force of 110 N in a compact implementation. The high grasping force capability is combined with embedded multi-modal sensing, which includes an eye-in-hand camera, a Time-of-Flight (ToF) distance sensor, an Inertial Measurement Unit (IMU) and an omnidirectional microphone, permitting the implementation of perception-driven grasping functionalities. We extensively evaluated the grasping force capacity of the gripper by introducing novel payload evaluation metrics that are a function of the robot arm’s dynamic motion and gripper thermal states. We also evaluated the embedded multi-modal sensing by performing perception-guided enhanced grasping operations. Comments: 8 pages, 15 figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.04970 [cs.RO] (or arXiv:2504.04970v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2504.04970 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: IEEE-RAS International Conference on Humanoid Robots (Humanoids), Nancy, France, 2024, pp. 149-156 Related DOI: https://doi.org/10.1109/Humanoids58906.2024.10769951 Focus to learn more DOI(s) linking to related resources
zh

[AI-25] he Dream Within Huang Long Cave: AI-Driven Interactive Narrative for Family Storytelling and Emotional Reflection

【速读】：本文旨在探讨“大他者”（Big Other）的非存在性，并通过艺术实践重新定义人际情感的真实性。论文提出了一种结合心理分析理论与计算技术的创新方法，以生成式 AI (Generative AI) 和沉浸式叙事体验为核心，构建了一个名为 YELL 的虚拟角色。YELL 是“大他者”的虚构化身，其行为模式基于艺术家的真实父亲。通过观众与 YELL 在洞穴自动虚拟环境（CAVE）中的互动，共同探索语言谜题并协助其应对生活挑战，该项目揭示了复杂家庭关系的解构过程。关键解决方案在于利用大型语言模型 (LLM) 驱动的对话系统，结合高度拟真的数字角色设计，使观众在沉浸式叙事中体验并反思“大他者”的概念，从而强调真实情感在家庭动态中的重要性，将艺术定位为连接情感理解的桥梁。

链接: https://arxiv.org/abs/2504.04968
作者: Jiayang Huang,Lingjie Li,Kang Zhang,David Yip
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: 8 pages,8 figures, International Symposium on Electronic/Emerging Art (ISEA)

点击查看摘要

Abstract:This paper introduces the art project The Dream Within Huang Long Cave, an AI-driven interactive and immersive narrative experience. The project offers new insights into AI technology, artistic practice, and psychoanalysis. Inspired by actual geographical landscapes and familial archetypes, the work combines psychoanalytic theory and computational technology, providing an artistic response to the concept of the non-existence of the Big Other. The narrative is driven by a combination of a large language model (LLM) and a realistic digital character, forming a virtual agent named YELL. Through dialogue and exploration within a cave automatic virtual environment (CAVE), the audience is invited to unravel the language puzzles presented by YELL and help him overcome his life challenges. YELL is a fictional embodiment of the Big Other, modeled after the artist’s real father. Through a cross-temporal interaction with this digital father, the project seeks to deconstruct complex familial relationships. By demonstrating the non-existence of the Big Other, we aim to underscore the authenticity of interpersonal emotions, positioning art as a bridge for emotional connection and understanding within family dynamics.
zh

[AI-26] GOTHAM: Graph Class Incremental Learning Framework under Weak Supervision

【速读】：本文旨在解决图数据中节点分类任务面临的图类增量学习（Graph Class Incremental Learning, GCL）问题，特别是在弱监督条件下，当新类别的样本稀少或完全无标签时，如何有效进行节点分类。传统方法依赖于大量标注数据训练模型，但在实际应用中这一条件往往难以满足。论文的关键在于提出了一种名为GOTHAM的框架，通过在基础类别上进行元训练（meta-training），以应对类别增量过程中可能出现的少样本或零样本问题。GOTHAM的核心创新点包括利用原型表示（prototype representation）作为类别代表来高效处理未标注节点，并针对文本属性图（Text-Attributed Graphs, TAGs）进一步结合语义信息增强表征能力。此外，通过教师-学生知识蒸馏技术缓解灾难性遗忘（catastrophic forgetting），确保模型在面对不断演化的图数据时仍能保持性能。实验结果表明，该方法在Cora-ML、Amazon和OBGN-Arxiv等数据集上的表现优于现有技术。

链接: https://arxiv.org/abs/2504.04954
作者: Aditya Hemant Shahane,Prathosh A.P,Sandeep Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphs are growing rapidly, along with the number of distinct label categories associated with them. Applications like e-commerce, healthcare, recommendation systems, and various social media platforms are rapidly moving towards graph representation of data due to their ability to capture both structural and attribute information. One crucial task in graph analysis is node classification, where unlabeled nodes are categorized into predefined classes. In practice, novel classes appear incrementally sometimes with just a few labels (seen classes) or even without any labels (unseen classes), either because they are new or haven’t been explored much. Traditional methods assume abundant labeled data for training, which isn’t always feasible. We investigate a broader objective: \emphGraph Class Incremental Learning under Weak Supervision (GCL), addressing this challenge by meta-training on base classes with limited labeled instances. During the incremental streams, novel classes can have few-shot or zero-shot representation. Our proposed framework GOTHAM efficiently accommodates these unlabeled nodes by finding the closest prototype representation, serving as class representatives in the attribute space. For Text-Attributed Graphs (TAGs), our framework additionally incorporates semantic information to enhance the representation. By employing teacher-student knowledge distillation to mitigate forgetting, GOTHAM achieves promising results across various tasks. Experiments on datasets such as Cora-ML, Amazon, and OBGN-Arxiv showcase the effectiveness of our approach in handling evolving graph data under limited supervision. The repository is available here: \hrefthis https URL\small \textcolorblueCode
zh

[AI-27] One Quantizer is Enough: Toward a Lightweight Audio Codec

【速读】：该论文旨在解决现有神经音频编解码器（Neural Audio Codec）资源消耗高、计算复杂度大以及实际应用受限的问题。论文提出了一种轻量级神经音频编解码器SQCodec，其关键创新在于采用单量化器（single quantizer）架构，并结合精简的卷积网络与局部Transformer模块，同时引入TConv机制以捕捉多时间尺度上的声学变化。这些设计不仅提升了重建保真度，还显著降低了模型复杂度和资源消耗，使得SQCodec在保持与多量化器基线相当音频质量的同时，具备更高的适应性和更低的运行开销。

链接: https://arxiv.org/abs/2504.04949
作者: Linwei Zhai,Han Ding,Cui Zhao,fei wang,Ge Wang,Wang Zhi,Wei Xi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural audio codecs have recently gained traction for their ability to compress high-fidelity audio and generate discrete tokens that can be utilized in downstream generative modeling tasks. However, leading approaches often rely on resource-intensive models and multi-quantizer architectures, resulting in considerable computational overhead and constrained real-world applicability. In this paper, we present SQCodec, a lightweight neural audio codec that leverages a single quantizer to address these limitations. SQCodec explores streamlined convolutional networks and local Transformer modules, alongside TConv, a novel mechanism designed to capture acoustic variations across multiple temporal scales, thereby enhancing reconstruction fidelity while reducing model complexity. Extensive experiments across diverse datasets show that SQCodec achieves audio quality comparable to multi-quantizer baselines, while its single-quantizer design offers enhanced adaptability and its lightweight architecture reduces resource consumption by an order of magnitude. The source code is publicly available at this https URL.
zh

[AI-28] Lemmanaid: Neuro-Symbolic Lemma Conjecturing

【速读】：该论文旨在解决自动推测有用、有趣且新颖的引理（lemmas）这一极具挑战性的问题，以提升自动化推理工具的能力，并降低在证明助手（proof assistants）中形式化数学的门槛。论文的关键在于提出了一种名为Lemmanaid的实用神经符号混合方法，结合大型语言模型（LLMs）与符号方法，用于Isabelle证明库中的引理推测任务。具体而言，Lemmanaid通过训练LLM生成描述引理结构的模板，并利用符号方法填充细节，从而克服单一神经或符号方法的局限性。研究结果表明，神经与符号技术具有互补性，通过整合两者的优势，可以针对广泛的输入领域生成有用的引理，促进计算机辅助理论发展与形式化工作。

链接: https://arxiv.org/abs/2504.04942
作者: Yousef Alhessi,Sólrún Halla Einarsdóttir,George Granberry,Emily First,Moa Johansson,Sorin Lerner,Nicholas Smallbone
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Automatically conjecturing useful, interesting and novel lemmas would greatly improve automated reasoning tools and lower the bar for formalizing mathematics in proof assistants. It is however a very challenging task for both neural and symbolic approaches. We present the first steps towards a practical neuro-symbolic lemma conjecturing tool, Lemmanaid, that combines Large Language Models (LLMs) and symbolic methods, and evaluate it on proof libraries for the Isabelle proof assistant. We train an LLM to generate lemma templates that describe the shape of a lemma, and use symbolic methods to fill in the details. We compare Lemmanaid against an LLM trained to generate complete lemma statements as well as previous fully symbolic conjecturing methods. Our results indicate that neural and symbolic techniques are complementary. By leveraging the best of both symbolic and neural methods we can generate useful lemmas for a wide range of input domains, facilitating computer-assisted theory development and formalization.
zh

[AI-29] Boosting Relational Deep Learning with Pretrained Tabular Models

【速读】：该论文旨在解决在关系型数据库中利用图神经网络（Graph Neural Networks, GNNs）进行预测时面临的两个主要问题：一是复杂关系模式的特征设计挑战；二是GNN在推理阶段的时间开销限制其在实时场景中的应用。为了解决这些问题，论文的关键方案是结合现有的特征工程方法与GNN的优势，通过使用GNN捕获关系型数据库中难以特征化的复杂关系，同时利用预先设计的特征来编码时间信息。这种方法避免了保留完整历史图的需求，使得可以使用更小且更高效的图结构，从而显著提升了推理效率。实验结果表明，所提出的\textsc{LightRDL}方法不仅提高了性能，还实现了比传统GNN高达526倍的推理速度提升，同时在RelBench基准测试中取得了高达33%的性能改进，使其非常适合实时推理场景。

链接: https://arxiv.org/abs/2504.04934
作者: Veronica Lachi,Antonio Longa,Beatrice Bevilacqua,Bruno Lepri,Andrea Passerini,Bruno Ribeiro
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Relational databases, organized into tables connected by primary-foreign key relationships, are a common format for organizing data. Making predictions on relational data often involves transforming them into a flat tabular format through table joins and feature engineering, which serve as input to tabular methods. However, designing features that fully capture complex relational patterns remains challenging. Graph Neural Networks (GNNs) offer a compelling alternative by inherently modeling these relationships, but their time overhead during inference limits their applicability for real-time scenarios. In this work, we aim to bridge this gap by leveraging existing feature engineering efforts to enhance the efficiency of GNNs in relational databases. Specifically, we use GNNs to capture complex relationships within relational databases, patterns that are difficult to featurize, while employing engineered features to encode temporal information, thereby avoiding the need to retain the entire historical graph and enabling the use of smaller, more efficient graphs. Our \textscLightRDL approach not only improves efficiency, but also outperforms existing models. Experimental results on the RelBench benchmark demonstrate that our framework achieves up to 33% performance improvement and a 526\times inference speedup compared to GNNs, making it highly suitable for real-time inference.
zh

[AI-30] Expectations vs Reality – A Secondary Study on AI Adoption in Software Testing

【速读】：该论文旨在解决软件测试领域中人工智能（Artificial Intelligence, AI）实际应用与预期效果之间的差距问题。论文通过系统性映射研究（Systematic Mapping Study）分析了2020年及以后工业背景下关于AI在软件测试中应用的研究，识别出真实世界的应用案例、潜在益处以及存在的问题。关键在于通过主题分析（Thematic Analysis）揭示当前AI在软件测试中的实际采用情况与其潜在能力之间的差异，并指出在线数据库搜索时可能出现的误报问题（False Positive Search Results）。研究发现尽管AI在测试用例生成、代码分析及智能测试自动化等方面存在诸多潜力，但实际落地实施及其带来的收益仍较为有限。

链接: https://arxiv.org/abs/2504.04921
作者: Katja Karhu,Jussi Kasurinen,Kari Smolander
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 26 pages, 1 figure, submitted to Software Testing, Vertification and Reliability journal

点击查看摘要

Abstract:In the software industry, artificial intelligence (AI) has been utilized more and more in software development activities. In some activities, such as coding, AI has already been an everyday tool, but in software testing activities AI it has not yet made a significant breakthrough. In this paper, the objective was to identify what kind of empirical research with industry context has been conducted on AI in software testing, as well as how AI has been adopted in software testing practice. To achieve this, we performed a systematic mapping study of recent (2020 and later) studies on AI adoption in software testing in the industry, and applied thematic analysis to identify common themes and categories, such as the real-world use cases and benefits, in the found papers. The observations suggest that AI is not yet heavily utilized in software testing, and still relatively few studies on AI adoption in software testing have been conducted in the industry context to solve real-world problems. Earlier studies indicated there was a noticeable gap between the actual use cases and actual benefits versus the expectations, which we analyzed further. While there were numerous potential use cases for AI in software testing, such as test case generation, code analysis, and intelligent test automation, the reported actual implementations and observed benefits were limited. In addition, the systematic mapping study revealed a potential problem with false positive search results in online databases when using the search string “artificial intelligence”.
zh

[AI-31] Constitution or Collapse? Exploring Constitutional AI with Llama 3-8B

【速读】：该论文旨在解决大型语言模型在训练过程中高质量标注数据获取成本增加的问题，以及人工标注数据存在的噪声和帮助性与有害性不平衡的挑战。解决方案的关键在于通过引入Constitutional AI（宪法型人工智能）方法，利用AI自身提供反馈以替代人工标注，从而大幅降低对人类标注的需求。研究通过在较小规模的LLaMA 3-8B模型上复现Constitutional AI的工作流程，验证其在减少模型有害行为方面的有效性，同时分析其对模型帮助性的影响及潜在的模型退化问题。研究发现，Constitutional AI能够显著提升模型的无害性，但可能削弱模型的帮助性，并且在较小模型上可能存在输出质量不足导致的自改进困难问题。

链接: https://arxiv.org/abs/2504.04918
作者: Xue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures. Conducted as part of research on alignment techniques for language models

点击查看摘要

Abstract:As language models continue to grow larger, the cost of acquiring high-quality training data has increased significantly. Collecting human feedback is both expensive and time-consuming, and manual labels can be noisy, leading to an imbalance between helpfulness and harmfulness. Constitutional AI, introduced by Anthropic in December 2022, uses AI to provide feedback to another AI, greatly reducing the need for human labeling. However, the original implementation was designed for a model with around 52 billion parameters, and there is limited information on how well Constitutional AI performs with smaller models, such as LLaMA 3-8B. In this paper, we replicated the Constitutional AI workflow using the smaller LLaMA 3-8B model. Our results show that Constitutional AI can effectively increase the harmlessness of the model, reducing the Attack Success Rate in MT-Bench by 40.8%. However, similar to the original study, increasing harmlessness comes at the cost of helpfulness. The helpfulness metrics, which are an average of the Turn 1 and Turn 2 scores, dropped by 9.8% compared to the baseline. Additionally, we observed clear signs of model collapse in the final DPO-CAI model, indicating that smaller models may struggle with self-improvement due to insufficient output quality, making effective fine-tuning more challenging. Our study suggests that, like reasoning and math ability, self-improvement is an emergent property.
zh

[AI-32] AlgOS: Algorithm Operating System

【速读】：该论文旨在解决算法实现过程中标准化不足导致的可复现性和可靠性问题。论文提出了一种名为Algorithm Operating System (AlgOS)的框架作为解决方案，其关键在于通过结合抽象语法树（Abstract Syntax Trees）与一种新颖的观察者模式（Observer pattern）实现对算法逻辑流程的控制，同时提供自动化超参数调优、通用命令行接口解析、新类自动注册以及集中式实验记录数据库等功能，以降低新算法实现的复杂度并标准化算法比较流程。

链接: https://arxiv.org/abs/2504.04909
作者: Llewyn Salt,Marcus Gallagher
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Algorithm Operating System (AlgOS) is an unopinionated, extensible, modular framework for algorithmic implementations. AlgOS offers numerous features: integration with Optuna for automated hyperparameter tuning; automated argument parsing for generic command-line interfaces; automated registration of new classes; and a centralised database for logging experiments and studies. These features are designed to reduce the overhead of implementing new algorithms and to standardise the comparison of algorithms. The standardisation of algorithmic implementations is crucial for reproducibility and reliability in research. AlgOS combines Abstract Syntax Trees with a novel implementation of the Observer pattern to control the logical flow of algorithmic segments.
zh

[AI-33] Futureproof Static Memory Planning

【速读】：该论文试图解决动态存储分配（Dynamic Storage Allocation, DSA）问题，即在已知大小和生命周期的一组缓冲区集合上分配偏移量以最小化总内存使用这一 NP 完全组合优化任务。现有 DSA 实现要么采用快速但浪费资源的启发式方法，要么使用在规模超过一千个缓冲区时无法有效扩展的内存高效方法。论文指出，“AI 内存墙”与深度神经网络的静态架构重新激发了对 DSA 的研究兴趣。
论文的关键解决方案是提出了一种名为 idealloc 的低碎片化、高性能的 DSA 实现，专为百万级缓冲区实例设计，并通过一组特别困难的跨领域基准测试验证其性能，在有效性与鲁棒性联合评估标准下优于四种生产级实现。

链接: https://arxiv.org/abs/2504.04874
作者: Christos Lamprakos,Panagiotis Xanthopoulos,Manolis Katsaragakis,Sotirios Xydis,Dimitrios Soudris,Francky Catthoor
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Submitted to ACM TOPLAS

点击查看摘要

Abstract:The NP-complete combinatorial optimization task of assigning offsets to a set of buffers with known sizes and lifetimes so as to minimize total memory usage is called dynamic storage allocation (DSA). Existing DSA implementations bypass the theoretical state-of-the-art algorithms in favor of either fast but wasteful heuristics, or memory-efficient approaches that do not scale beyond one thousand buffers. The “AI memory wall”, combined with deep neural networks’ static architecture, has reignited interest in DSA. We present idealloc, a low-fragmentation, high-performance DSA implementation designed for million-buffer instances. Evaluated on a novel suite of particularly hard benchmarks from several domains, idealloc ranks first against four production implementations in terms of a joint effectiveness/robustness criterion.
zh

[AI-34] FedSAUC: A Similarity-Aware Update Control for Communication-Efficient Federated Learning in Edge Computing

【速读】：该论文旨在解决联邦学习在边缘设备上运行时面临的电池消耗和带宽占用问题，同时确保训练精度不受影响。论文的关键解决方案是提出了一种名为FedSAUC的更新控制机制，其核心在于通过聚类算法将行为（模型）相似的用户分组，并仅选择每个簇的代表设备进行模型更新信息的传输与训练。这种策略显著减少了非代表性设备的计算和通信开销，从而有效降低能耗和带宽使用，而实验结果表明，该方法不会对长期训练精度造成负面影响。

链接: https://arxiv.org/abs/2504.04867
作者: Ming-Lun Lee,Han-Chang Chou,Yan-AnnChen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in the Proceedings of the International Conference on Mobile Computing and Ubiquitous Network (ICMU), 2021

点击查看摘要

Abstract:Federated learning is a distributed machine learning framework to collaboratively train a global model without uploading privacy-sensitive data onto a centralized server. Usually, this framework is applied to edge devices such as smartphones, wearable devices, and Internet of Things (IoT) devices which closely collect information from users. However, these devices are mostly battery-powered. The update procedure of federated learning will constantly consume the battery power and the transmission bandwidth. In this work, we propose an update control for federated learning, FedSAUC, by considering the similarity of users’ behaviors (models). At the server side, we exploit clustering algorithms to group devices with similar models. Then we select some representatives for each cluster to update information to train the model. We also implemented a testbed prototyping on edge devices for validating the performance. The experimental results show that this update control will not affect the training accuracy in the long run.
zh

[AI-35] GAMDTP: Dynamic Trajectory Prediction with Graph Attention Mamba Network

【速读】：本文旨在解决自动驾驶系统中交通参与者动态轨迹预测的准确性问题，以提升系统的安全性和稳定性。论文提出了一种名为GAMDTP的新颖图注意力网络，其关键在于通过门机制融合自注意力(self attention)与Mamba-SSM的结果，充分发挥两者的优势，更高效且精确地提取特征。此外，GAMDTP结合高清地图(HD Map)数据及交通参与者的轨迹历史坐标进行编码，并解码输出以生成最终预测结果。为进一步优化两阶段框架（包含提议和细化过程）的性能，作者还设计了一种评分机制来评估预测质量。实验表明，GAMDTP在Argoverse数据集上实现了最先进的性能，在动态轨迹预测方面表现出色。

链接: https://arxiv.org/abs/2504.04862
作者: Yunxiang Liu,Hongkuo Niu,Jianlin Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate motion prediction of traffic agents is crucial for the safety and stability of autonomous driving systems. In this paper, we introduce GAMDTP, a novel graph attention-based network tailored for dynamic trajectory prediction. Specifically, we fuse the result of self attention and mamba-ssm through a gate mechanism, leveraging the strengths of both to extract features more efficiently and accurately, in each graph convolution layer. GAMDTP encodes the high-definition map(HD map) data and the agents’ historical trajectory coordinates and decodes the network’s output to generate the final prediction results. Additionally, recent approaches predominantly focus on dynamically fusing historical forecast results and rely on two-stage frameworks including proposal and refinement. To further enhance the performance of the two-stage frameworks we also design a scoring mechanism to evaluate the prediction quality during the proposal and refinement processes. Experiments on the Argoverse dataset demonstrates that GAMDTP achieves state-of-the-art performance, achieving superior accuracy in dynamic trajectory prediction.
zh

[AI-36] Dont Lag RAG : Training-Free Adversarial Detection Using RAG

【速读】：该论文旨在解决对抗补丁攻击对视觉系统构成的重大威胁，传统防御方法通常需要重新训练或微调，这在实际部署中往往不切实际。论文提出了一种名为Visual Retrieval-Augmented Generation (VRAG)的无训练框架，通过集成Vision-Language Models (VLMs) 进行对抗补丁检测。其关键是利用不断扩展的数据库检索视觉上相似的补丁和图像，这些图像类似于存储的攻击样本，并通过生成式推理识别多种攻击类型，而无需额外的训练或微调。实验结果表明，该方法能够以最少的人工标注有效识别各种对抗补丁，为应对不断演化的对抗补丁攻击提供了稳健且实用的防御手段。

链接: https://arxiv.org/abs/2504.04858
作者: Roie Kazoom,Raz Lapid,Moshe Sipper,Ofer Hadar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adversarial patch attacks pose a major threat to vision systems by embedding localized perturbations that mislead deep models. Traditional defense methods often require retraining or fine-tuning, making them impractical for real-world deployment. We propose a training-free Visual Retrieval-Augmented Generation (VRAG) framework that integrates Vision-Language Models (VLMs) for adversarial patch detection. By retrieving visually similar patches and images that resemble stored attacks in a continuously expanding database, VRAG performs generative reasoning to identify diverse attack types, all without additional training or fine-tuning. We extensively evaluate open-source large-scale VLMs, including Qwen-VL-Plus, Qwen2.5-VL-72B, and UI-TARS-72B-DPO, alongside Gemini-2.0, a closed-source model. Notably, the open-source UI-TARS-72B-DPO model achieves up to 95 percent classification accuracy, setting a new state-of-the-art for open-source adversarial patch detection. Gemini-2.0 attains the highest overall accuracy, 98 percent, but remains closed-source. Experimental results demonstrate VRAG’s effectiveness in identifying a variety of adversarial patches with minimal human annotation, paving the way for robust, practical defenses against evolving adversarial patch attacks.
zh

[AI-37] BIASINSPECTOR: Detecting Bias in Structured Data through LLM Agents

【速读】：该论文试图解决在结构化数据中检测偏见这一复杂且耗时的任务中存在的局限性。现有自动化技术在数据类型多样性方面受限，并高度依赖人工逐案处理，导致其通用性不足。此外，虽然大型语言模型（Large Language Model, LLM）在数据科学领域取得了显著进展，但其在检测数据偏见方面的潜力尚未得到充分探索。为填补这一空白，论文提出了BIASINSPECTOR，这是一种端到端、多智能体协同框架，旨在根据特定用户需求自动检测结构化数据中的偏见。关键在于它首先制定一个多阶段计划来分析用户指定的偏见检测任务，然后使用一组多样化且合适的工具实现该计划，并提供包含解释和可视化结果的详细输出。同时，为了应对缺乏标准化框架评估LLM代理检测数据偏见能力的问题，论文进一步提出了一套综合基准，包含多种评估指标和大量测试用例。广泛实验表明，该框架在结构化数据偏见检测方面实现了卓越的整体性能，为更公平的数据应用设定了新的里程碑。

链接: https://arxiv.org/abs/2504.04855
作者: Haoxuan Li,Mingyu Derek Ma,Jen-tse Huang,Zhaotian Weng,Wei Wang,Jieyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages,6 figures

点击查看摘要

Abstract:Detecting biases in structured data is a complex and time-consuming task. Existing automated techniques are limited in diversity of data types and heavily reliant on human case-by-case handling, resulting in a lack of generalizability. Currently, large language model (LLM)-based agents have made significant progress in data science, but their ability to detect data biases is still insufficiently explored. To address this gap, we introduce the first end-to-end, multi-agent synergy framework, BIASINSPECTOR, designed for automatic bias detection in structured data based on specific user requirements. It first develops a multi-stage plan to analyze user-specified bias detection tasks and then implements it with a diverse and well-suited set of tools. It delivers detailed results that include explanations and visualizations. To address the lack of a standardized framework for evaluating the capability of LLM agents to detect biases in data, we further propose a comprehensive benchmark that includes multiple evaluation metrics and a large set of test cases. Extensive experiments demonstrate that our framework achieves exceptional overall performance in structured data bias detection, setting a new milestone for fairer data applications.
zh

[AI-38] An Efficient Approach for Cooperative Multi-Agent Learning Problems ICTAI2024

【速读】：本文旨在解决多智能体（Multi-Agent）协作学习中因联合动作空间（joint action space）维度爆炸导致的可扩展性问题。传统集中式方法在处理多个智能体需要协调完成特定任务时，会面临由于所有可能个体动作组合所定义的联合动作空间规模过大而难以高效学习的挑战。为应对这一难题，论文提出了一种引入“监督者”（supervisor）元智能体的集中式多智能体学习框架，通过序列化抽象（sequential abstraction）将联合动作表示为对各智能体的动作分配序列，从而有效简化联合动作空间并提升框架的可扩展性和效率。实验结果表明，该方法能够在多种规模的多智能体学习环境中成功实现智能体间的协调。

链接: https://arxiv.org/abs/2504.04850
作者: Ángel Aso-Mollar,Eva Onaindia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICTAI 2024

点击查看摘要

Abstract:In this article, we propose a centralized Multi-Agent Learning framework for learning a policy that models the simultaneous behavior of multiple agents that need to coordinate to solve a certain task. Centralized approaches often suffer from the explosion of an action space that is defined by all possible combinations of individual actions, known as joint actions. Our approach addresses the coordination problem via a sequential abstraction, which overcomes the scalability problems typical to centralized methods. It introduces a meta-agent, called \textitsupervisor, which abstracts joint actions as sequential assignments of actions to each agent. This sequential abstraction not only simplifies the centralized joint action space but also enhances the framework’s scalability and efficiency. Our experimental results demonstrate that the proposed approach successfully coordinates agents across a variety of Multi-Agent Learning environments of diverse sizes.
zh

[AI-39] Explanation-Driven Interventions for Artificial Intelligence Model Customization: Empowering End-Users to Tailor Black-Box AI in Rhinocytology

【速读】：本文旨在解决在高风险领域中，如何确保人类在与黑箱人工智能（Black-box AI）模型交互过程中保持控制权的问题。解决方案的关键在于通过重新设计用户界面，提出了一种面向终端用户开发（End-User Development, EUD）的新方法，应用于Rhino-Cyt平台（一款面向医疗专业人士，特别是鼻窦细胞学家的医学AI辅助决策支持系统）。该界面使用户能够通过编辑解释（Explanation Editing）和重新配置模型（Model Reconfiguration）来干预AI的决策过程，从而影响其未来的预测结果。这种方法强调了人机协同（Human-Centered AI, HCAI）与终端用户开发的结合，通过基于解释的干预措施，实现了可解释性、用户干预和模型调整的融合，促进了人类与个性化AI系统的共生关系。

链接: https://arxiv.org/abs/2504.04833
作者: Andrea Esposito(1),Miriana Calvano(1),Antonio Curci(1 and 2),Francesco Greco(1),Rosa Lanzilotti(1),Antonio Piccinno(1) ((1) University of Bari Aldo Moro, (2) University of Pisa)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: First version (14 pages, 12 of content that will be reduced to 8 in the near future)

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) in modern society is heavily shifting the way that individuals carry out their tasks and activities. Employing AI-based systems raises challenges that designers and developers must address to ensure that humans remain in control of the interaction process, particularly in high-risk domains. This article presents a novel End-User Development (EUD) approach for black-box AI models through a redesigned user interface in the Rhino-Cyt platform, a medical AI-based decision-support system for medical professionals (more precisely, rhinocytologists) to carry out cell classification. The proposed interface empowers users to intervene in AI decision-making process by editing explanations and reconfiguring the model, influencing its future predictions. This work contributes to Human-Centered AI (HCAI) and EUD by discussing how explanation-driven interventions allow a blend of explainability, user intervention, and model reconfiguration, fostering a symbiosis between humans and user-tailored AI systems.
zh

[AI-40] A Customized SAT-based Solver for Graph Coloring

【速读】：本文旨在解决图着色问题（Graph Coloring Problem），提出了一种基于 SAT 的新型算法 ZykovColor。该算法的关键在于利用模仿 Zykov 树的编码，并结合 Hébrard 和 Katsirelos (2020) 提出的方法，通过引入传播器（propagator）来强制传递性约束、整合下界以剪枝搜索树以及支持推断传播。此外，ZykovColor 借助最新的 IPASIR-UP 接口改进了 CaDiCal SAT 求解器的实现，并提出了新的特性，包括利用顶点支配提示优化决策策略，以及采用增量自底向上搜索以重用之前调用中学到的子句。同时，该方法还集成了更高效的团计算（clique computation）以提升搜索过程中的下界估计。实验验证表明，这些新特性显著提升了算法性能，在 DIMACS 测试集及随机 Erdős-Rényi 图上的表现均优于现有最先进的图着色实现。

链接: https://arxiv.org/abs/2504.04821
作者: Timo Brand,Daniel Faber,Stephan Held,Petra Mutzel
机构: 未知
类目: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Logic in Computer Science (cs.LO)
备注: 5 figures, 2 tables, source code published at this https URL

点击查看摘要

Abstract:We introduce ZykovColor, a novel SAT-based algorithm to solve the graph coloring problem working on top of an encoding that mimics the Zykov tree. Our method is based on an approach of Hébrard and Katsirelos (2020) that employs a propagator to enforce transitivity constraints, incorporate lower bounds for search tree pruning, and enable inferred propagations. We leverage the recently introduced IPASIR-UP interface for CaDiCal to implement these techniques with a SAT solver. Furthermore, we propose new features that take advantage of the underlying SAT solver. These include modifying the integrated decision strategy with vertex domination hints and using incremental bottom-up search that allows to reuse learned clauses from previous calls. Additionally, we integrate a more efficient clique computation to improve the lower bounds during the search. We validate the effectiveness of each new feature through an experimental analysis. ZykovColor outperforms other state-of-the-art graph coloring implementations on the DIMACS benchmark set. Further experiments on random Erdős-Rényi graphs show that our new approach dominates state-of-the-art SAT-based methods for both very sparse and highly dense graphs.
zh

[AI-41] ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

【速读】：该论文旨在解决在云数据仓库广泛应用背景下，基于 Extract-Load-Transform (ELT) 管道的数据工程任务中因人工设计管道所需的大量手动工作而带来的效率瓶颈问题。尽管近年来基于人工智能的方法（如文本转 SQL）在数据任务中展现了强大的能力，但当前的数据工程基准测试仅评估孤立的任务，缺乏对生成端到端 ELT 管道的 AI 代理进行全面评估的标准。为此，论文引入了 ELT-Bench，这是一个专门设计用于评估 AI 代理构建 ELT 管道能力的端到端基准。ELT-Bench 包含来自多个领域的 100 条管道、835 个源表和 203 个数据模型，并通过模拟真实场景来评估 AI 代理处理复杂数据工程工作流的能力。解决方案的关键在于设计一个包含多样化数据源集成与流行数据工具使用的基准，并全面评估 AI 代理在数据库交互、代码和 SQL 查询编写以及管道阶段编排等方面的表现。实验结果表明，目前最先进的 AI 代理在 ELT-Bench 上的性能仍存在显著不足，强调了开发更先进 AI 代理以减少人工干预需求的重要性。

链接: https://arxiv.org/abs/2504.04808
作者: Tengjun Jin,Yuxuan Zhu,Daniel Kang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 14 pages, 18 figures

点击查看摘要

Abstract:Practitioners are increasingly turning to Extract-Load-Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent advances in AI-based methods, which have shown strong capabilities in data tasks, such as text-to-SQL, present an opportunity to alleviate manual efforts in developing ELT pipelines. Unfortunately, current benchmarks in data engineering only evaluate isolated tasks, such as using data tools and writing data transformation queries, leaving a significant gap in evaluating AI agents for generating end-to-end ELT pipelines. To fill this gap, we introduce ELT-Bench, an end-to-end benchmark designed to assess the capabilities of AI agents to build ELT pipelines. ELT-Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains. By simulating realistic scenarios involving the integration of diverse data sources and the use of popular data tools, ELT-Bench evaluates AI agents’ abilities in handling complex data engineering workflows. AI agents must interact with databases and data tools, write code and SQL queries, and orchestrate every pipeline stage. We evaluate two representative code agent frameworks, Spider-Agent and SWE-Agent, using six popular Large Language Models (LLMs) on ELT-Bench. The highest-performing agent, Spider-Agent Claude-3.7-Sonnet with extended thinking, correctly generates only 3.9% of data models, with an average cost of 4.30 and 89.3 steps per pipeline. Our experimental results demonstrate the challenges of ELT-Bench and highlight the need for a more advanced AI agent to reduce manual effort in ELT workflows. Our code and data are available at this https URL. Comments: 14 pages, 18 figures Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.04808 [cs.DB] (or arXiv:2504.04808v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2504.04808 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-42] Multimodal Agricultural Agent Architecture (MA3): A New Paradigm for Intelligent Agricultural Decision-Making

【速读】：该论文旨在应对现代农业面临的双重挑战：提升生产效率与实现可持续发展，特别是在气候剧烈变化导致极端天气频发的背景下，解决农业生产系统中不确定性风险指数增长的问题。论文的关键在于提出了一种创新的多模态农业智能体架构（Multimodal Agricultural Agent Architecture, MA³），其核心在于利用跨模态信息融合与任务协作机制实现智能化的农业决策支持。解决方案的关键在于构建了一个包含分类、检测、视觉问答（Visual Question Answering, VQA）、工具选择及智能体评估五大任务的多模态农业智能体数据集，并通过统一的甘蔗病害分类与检测工具背骨以及专家模型，结合创新的工具选择模块，开发出能够高效完成多任务的多模态农业智能体，同时引入多维度定量评估框架验证了MA³在实际农业场景中的实用性和鲁棒性。

链接: https://arxiv.org/abs/2504.04789
作者: Zhuoning Xu,Jian Xu,Mingqing Zhang,Peijie Wang,Chao Deng,Cheng-Lin Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As a strategic pillar industry for human survival and development, modern agriculture faces dual challenges: optimizing production efficiency and achieving sustainable development. Against the backdrop of intensified climate change leading to frequent extreme weather events, the uncertainty risks in agricultural production systems are increasing exponentially. To address these challenges, this study proposes an innovative \textbfMultimodal \textbfAgricultural \textbfAgent \textbfArchitecture (\textbfMA3), which leverages cross-modal information fusion and task collaboration mechanisms to achieve intelligent agricultural decision-making. This study constructs a multimodal agricultural agent dataset encompassing five major tasks: classification, detection, Visual Question Answering (VQA), tool selection, and agent evaluation. We propose a unified backbone for sugarcane disease classification and detection tools, as well as a sugarcane disease expert model. By integrating an innovative tool selection module, we develop a multimodal agricultural agent capable of effectively performing tasks in classification, detection, and VQA. Furthermore, we introduce a multi-dimensional quantitative evaluation framework and conduct a comprehensive assessment of the entire architecture over our evaluation dataset, thereby verifying the practicality and robustness of MA3 in agricultural scenarios. This study provides new insights and methodologies for the development of agricultural agents, holding significant theoretical and practical implications. Our source code and dataset will be made publicly available upon acceptance.
zh

[AI-43] Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors

【速读】：该论文旨在解决高效利用当代大型语言模型（Large Language Models, LLMs）能力的挑战，特别是在直接微调昂贵且不切实际的情况下。传统无训练方法通常需要大量人工努力或产生次优结果。为应对这一问题，论文提出了一种名为弱强Harnessing（Weak-for-Strong Harnessing, W4S）的新框架，其关键是通过将工作流设计形式化为多轮马尔可夫决策过程，并引入强化学习驱动的代理优化（Reinforcement Learning for Agentic Workflow Optimization, RLAO），训练一个成本较低的小型元代理。通过与环境的迭代交互，该元代理能够在无需人工干预的情况下学会设计越来越有效的工作流，从而显著提升包括GPT-3.5-Turbo和GPT-4在内的先进模型性能。实验结果表明，仅需一个GPU小时训练的7B元代理，在十一个基准测试中比最强基线提升了2.9%至24.6%，同时展现出跨已见和未见任务的强大泛化能力，为避免直接微调提供了高效的高性能替代方案。

链接: https://arxiv.org/abs/2504.04785
作者: Fan Nie,Lan Feng,Haotian Ye,Weixin Liang,Pan Lu,Huaxiu Yao,Alexandre Alahi,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficiently leveraging of the capabilities of contemporary large language models (LLMs) is increasingly challenging, particularly when direct fine-tuning is expensive and often impractical. Existing training-free methods, including manually or automated designed workflows, typically demand substantial human effort or yield suboptimal results. This paper proposes Weak-for-Strong Harnessing (W4S), a novel framework that customizes smaller, cost-efficient language models to design and optimize workflows for harnessing stronger models. W4S formulates workflow design as a multi-turn markov decision process and introduces reinforcement learning for agentic workflow optimization (RLAO) to train a weak meta-agent. Through iterative interaction with the environment, the meta-agent learns to design increasingly effective workflows without manual intervention. Empirical results demonstrate the superiority of W4S that our 7B meta-agent, trained with just one GPU hour, outperforms the strongest baseline by 2.9% ~ 24.6% across eleven benchmarks, successfully elevating the performance of state-of-the-art models such as GPT-3.5-Turbo and GPT-4o. Notably, W4S exhibits strong generalization capabilities across both seen and unseen tasks, offering an efficient, high-performing alternative to directly fine-tuning strong models.
zh

[AI-44] Bidirectional Hierarchical Protein Multi-Modal Representation Learning

【速读】：该论文旨在解决蛋白质表示学习中序列信息与结构信息未能有效融合的问题。传统基于大型Transformer的蛋白语言模型（pLMs）在基于序列的任务中表现优异，但缺乏结构信息；而利用三维结构信息的图神经网络（GNNs）虽在预测任务中展现出良好的泛化能力，但受限于标注结构数据的稀缺性。论文的关键在于提出了一种多模态双向层次融合框架（Bi-Hierarchical Fusion Framework），通过注意力机制和门控机制实现pLMs生成的序列表示与GNN提取的结构特征之间的有效交互，并在神经网络各层间增强信息交流与提升。此外，进一步引入了局部和全局双向层次融合方法，显著提升了多种蛋白质相关任务（如酶分类、模型质量评估、配体结合亲和力预测等）中的性能，在多模态蛋白质表示学习基准测试中达到了新的最先进水平。

链接: https://arxiv.org/abs/2504.04770
作者: Xuefeng Liu,Songhao Jiang,Chih-chan Tien,Jinbo Xu,Rick Stevens
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
备注:

点击查看摘要

Abstract:Protein representation learning is critical for numerous biological tasks. Recently, large transformer-based protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence-based tasks. However, pLMs lack structural information. Conversely, graph neural networks (GNNs) designed to leverage 3D structural information have shown promising generalization in protein-related prediction tasks, but their effectiveness is often constrained by the scarcity of labeled structural data. Recognizing that sequence and structural representations are complementary perspectives of the same protein entity, we propose a multimodal bidirectional hierarchical fusion framework to effectively merge these modalities. Our framework employs attention and gating mechanisms to enable effective interaction between pLMs-generated sequential representations and GNN-extracted structural features, improving information exchange and enhancement across layers of the neural network. Based on the framework, we further introduce local Bi-Hierarchical Fusion with gating and global Bi-Hierarchical Fusion with multihead self-attention approaches. Through extensive experiments on a diverse set of protein-related tasks, our method demonstrates consistent improvements over strong baselines and existing fusion techniques in a variety of protein representation learning benchmarks, including react (enzyme/EC classification), model quality assessment (MQA), protein-ligand binding affinity prediction (LBA), protein-protein binding site prediction (PPBS), and B cell epitopes prediction (BCEs). Our method establishes a new state-of-the-art for multimodal protein representation learning, emphasizing the efficacy of BIHIERARCHICAL FUSION in bridging sequence and structural modalities.
zh

[AI-45] KunPeng: A Global Ocean Environmental Model

【速读】：本文旨在解决海洋环境预测中由于海陆边界处剧烈梯度导致的训练发散问题，并提升远、中、近距离海洋特征的整合能力及时间依赖性特征提取能力。为应对海洋空间不连续特性，提出了一种地形自适应掩码约束机制（terrain-adaptive mask constraint mechanism）以缓解训练发散；通过引入经形变卷积网络增强的长周期动态感受野建模方法（longitude-cyclic deformable convolution network, LC-DCN），实现多尺度海洋特征的精细化建模；同时利用形变卷积增强的多步预测模块（Deformable Convolution-enhanced Multi-Step Prediction module, DC-MTP）强化时间依赖特征的提取能力。关键在于结合地形自适应掩码约束机制与形变卷积技术，显著提升了全球15天海洋环境预测的准确性。

链接: https://arxiv.org/abs/2504.04766
作者: Yi Zhao,Jiaqi Li,Haitao Xia,Tianjiao Zhang,Zerong Zeng,Tianyu Ren,Yucheng Zhang,Chao Zhu,Shengtong Xu,Hongchun Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inspired by the similarity of the atmosphere-ocean physical coupling mechanism, this study innovatively migrates meteorological large-model techniques to the ocean domain, constructing the KunPeng global ocean environmental prediction model. Aimed at the discontinuous characteristics of marine space, we propose a terrain-adaptive mask constraint mechanism to mitigate effectively training divergence caused by abrupt gradients at land-sea boundaries. To fully integrate far-, medium-, and close-range marine features, a longitude-cyclic deformable convolution network (LC-DCN) is employed to enhance the dynamic receptive field, achieving refined modeling of multi-scale oceanic characteristics. A Deformable Convolution-enhanced Multi-Step Prediction module (DC-MTP) is employed to strengthen temporal dependency feature extraction capabilities. Experimental results demonstrate that this model achieves an average ACC of 0.80 in 15-day global predictions at 0.25 ^\circ resolution, outperforming comparative models by 0.01-0.08. The average mean squared error (MSE) is 0.41 (representing a 5%-31% reduction) and the average mean absolute error (MAE) is 0.44 (0.6%-21% reduction) compared to other models. Significant improvements are particularly observed in sea surface parameter prediction, deep-sea region characterization, and current velocity field forecasting. Through a horizontal comparison of the applicability of operators at different scales in the marine domain, this study reveals that local operators significantly outperform global operators under slow-varying oceanic processes, demonstrating the effectiveness of dynamic feature pyramid representations in predicting marine physical parameters.
zh

[AI-46] Generalising from Self-Produced Data: Model Training Beyond Human Constraints

【速读】：该论文试图解决大型语言模型（LLMs）受制于人工标注数据以及单一抽象层次限制的问题，这些问题阻碍了其对确定性真理判断的能力。论文提出了一种新颖框架，使AI模型能够通过直接与环境交互自主生成和验证新知识。方案的关键在于采用无界且不可操纵的数值奖励（如附加磁盘空间或关注者数量），这种奖励引导学习过程而无需依赖人类设定的基准。AI代理通过迭代生成策略和可执行代码来最大化此指标，并将成功结果作为自我再训练和逐步泛化的基础。此外，为缓解模型崩溃和冷启动问题，框架强调经验验证而非文本相似性，并支持通过GRPO进行微调。系统架构利用模块化代理进行环境分析、策略生成和代码合成，以实现可扩展的实验能力。这一研究为迈向超越人为约束、实现自主通用智能的自我提升AI系统指明了方向。

链接: https://arxiv.org/abs/2504.04711
作者: Alfath Daryl Alhajir,Jennifer Dodgson,Joseph Lim,Truong Ma Phi,Julian Peh,Akira Rafhael Janson Pattirane,Lokesh Poovaragan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures

点击查看摘要

Abstract:Current large language models (LLMs) are constrained by human-derived training data and limited by a single level of abstraction that impedes definitive truth judgments. This paper introduces a novel framework in which AI models autonomously generate and validate new knowledge through direct interaction with their environment. Central to this approach is an unbounded, ungamable numeric reward - such as annexed disk space or follower count - that guides learning without requiring human benchmarks. AI agents iteratively generate strategies and executable code to maximize this metric, with successful outcomes forming the basis for self-retraining and incremental generalisation. To mitigate model collapse and the warm start problem, the framework emphasizes empirical validation over textual similarity and supports fine-tuning via GRPO. The system architecture employs modular agents for environment analysis, strategy generation, and code synthesis, enabling scalable experimentation. This work outlines a pathway toward self-improving AI systems capable of advancing beyond human-imposed constraints toward autonomous general intelligence.
zh

[AI-47] AdvKT: An Adversarial Multi-Step Training Framework for Knowledge Tracing

【速读】：该论文旨在解决知识追踪（Knowledge Tracing, KT）领域中单步训练范式导致的多步推理误差累积问题，以及数据稀疏性对智能辅导系统推荐模型性能的影响。为应对这些挑战，论文提出了一种新颖的对抗多步训练框架（Adversarial Multi-Step Training Framework for Knowledge Tracing, AdvKT）。其关键在于引入对抗学习范式，通过生成器模拟高奖励响应以减少多步推理中的误差累积，同时利用判别器生成合成数据提供反馈。此外，论文设计了专门的数据增强技术，以在数据稀疏场景下丰富训练数据，确保模型的良好泛化能力。实验结果表明，AdvKT在四个真实数据集上的表现优于现有KT模型，有效解决了误差累积与数据稀疏问题。

链接: https://arxiv.org/abs/2504.04706
作者: Lingyue Fu,Ting Long,Jianghao Lin,Wei Xia,Xinyi Dai,Ruiming Tang,Yasheng Wang,Weinan Zhang,Yong Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) monitors students’ knowledge states and simulates their responses to question sequences. Existing KT models typically follow a single-step training paradigm, which leads to discrepancies with the multi-step inference process required in real-world simulations, resulting in significant error accumulation. This accumulation of error, coupled with the issue of data sparsity, can substantially degrade the performance of recommendation models in the intelligent tutoring systems. To address these challenges, we propose a novel Adversarial Multi-Step Training Framework for Knowledge Tracing (AdvKT), which, for the first time, focuses on the multi-step KT task. More specifically, AdvKT leverages adversarial learning paradigm involving a generator and a discriminator. The generator mimics high-reward responses, effectively reducing error accumulation across multiple steps, while the discriminator provides feedback to generate synthetic data. Additionally, we design specialized data augmentation techniques to enrich the training data with realistic variations, ensuring that the model generalizes well even in scenarios with sparse data. Experiments conducted on four real-world datasets demonstrate the superiority of AdvKT over existing KT models, showcasing its ability to address both error accumulation and data sparsity issues effectively.
zh

[AI-48] Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent

【速读】：该论文旨在探究基于Transformer架构的模型是否能够通过梯度下降方法真正学习简单的多数函数（majority functions），尤其是在处理最基本的逻辑推理任务时所面临的理论与优化挑战。论文的关键在于分析在训练样本数量分别为 ( n = \mathrm{poly}(d) ) 和 ( n = \exp(\Omega(d)) ) 的情况下，简化版Transformer架构的表现，并揭示即使经过多项式次数的梯度查询后，其泛化误差仍显著较大且随维度 ( d ) 指数增长的现象。这表明现有基于梯度的方法在训练Transformer以完成简单逻辑推理任务时存在根本性的优化难题，并为理解Transformer的理论局限性提供了新视角。

链接: https://arxiv.org/abs/2504.04702
作者: Bo Chen,Zhenmei Shi,Zhao Song,Jiahao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:

点击查看摘要

Abstract:Recent advancements in Transformer-based architectures have led to impressive breakthroughs in natural language processing tasks, with models such as GPT-4, Claude, and Gemini demonstrating human-level reasoning abilities. However, despite their high performance, concerns remain about the inherent limitations of these models, especially when it comes to learning basic logical functions. While complexity-theoretic analyses indicate that Transformers can represent simple logic functions (e.g., \mathsfAND , \mathsfOR , and majority gates) by its nature of belonging to the \mathsfTC^0 class, these results assume ideal parameter settings and do not account for the constraints imposed by gradient descent-based training methods. In this work, we investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods. We focus on a simplified variant of the Transformer architecture and consider both n=\mathrmpoly(d) and n=\exp(\Omega(d)) number of training samples, where each sample is a d -size binary string paired with the output of a basic majority function. Our analysis demonstrates that even after \mathrmpoly(d) gradient queries, the generalization error of the Transformer model still remains substantially large, growing exponentially with d . This work highlights fundamental optimization challenges in training Transformers for the simplest logical reasoning tasks and provides new insights into their theoretical limitations.
zh

[AI-49] HypRL: Reinforcement Learning of Control Policies for Hyperproperties

【速读】：本文研究了在复杂任务中学习控制策略的问题，这些任务的需求由超性质（hyperproperty）给出。超性质因其在形式化多智能体系统需求以及需要在多个执行轨迹中表达特性（如隐私和公平性）方面的强大能力而被采用。论文针对具有未知转移函数的马尔可夫决策过程 (M) 和一个超LTL公式 (\varphi)，提出了一种方法：首先通过Skolemization处理 (\varphi) 中的量词交替；然后引入超LTL的定量鲁棒性函数，定义 (M) 的有限轨迹相对于 (\varphi) 的奖励；最后利用合适的强化学习算法，同时学习（1）(\varphi) 中每个轨迹量词对应的策略，以及（2）(M) 的转移概率分布，以最大化期望奖励，从而提高 (\varphi) 在 (M) 中满足的概率。解决方案的关键在于结合超LTL公式的量化特性与强化学习算法，通过定义定量鲁棒性函数将超性质转化为可优化的奖励信号，并通过学习策略和转移分布实现对超性质的满足。

链接: https://arxiv.org/abs/2504.04675
作者: Tzu-Han Hsu,Arshia Rafieioskouei,Borzoo Bonakdarpour
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We study the problem of learning control policies for complex tasks whose requirements are given by a hyperproperty. The use of hyperproperties is motivated by their significant power to formally specify requirements of multi-agent systems as well as those that need expressiveness in terms of multiple execution traces (e.g., privacy and fairness). Given a Markov decision process M with unknown transitions (representing the environment) and a HyperLTL formula \varphi , our approach first employs Skolemization to handle quantifier alternations in \varphi . We introduce quantitative robustness functions for HyperLTL to define rewards of finite traces of M with respect to \varphi . Finally, we utilize a suitable reinforcement learning algorithm to learn (1) a policy per trace quantifier in \varphi , and (2) the probability distribution of transitions of M that together maximize the expected reward and, hence, probability of satisfaction of \varphi in M. We present a set of case studies on (1) safety-preserving multi-agent path planning, (2) fairness in resource allocation, and (3) the post-correspondence problem (PCP).
zh

[AI-50] EquiCPI: SE(3)-Equivariant Geometric Deep Learning for Structure-Aware Prediction of Compound-Protein Interactions

【速读】：本文旨在解决化合物-蛋白质相互作用（CPI）预测中的精确性挑战，现有基于序列的方法主要依赖分子指纹或图表示，但忽视了结合亲和力的三维结构决定因素。为弥合这一差距，论文提出了EquiCPI，这是一种端到端的几何深度学习框架，它结合了第一性原理结构建模与SE(3)-等变神经网络。关键在于，EquiCPI通过原子点云上的SE(3)-等变消息传递来保留旋转、平移和反射下的对称性，并通过球谐函数张量积层次化编码局部交互模式，从而有效利用三维结构信息进行CPI预测。

链接: https://arxiv.org/abs/2504.04654
作者: Ngoc-Quang Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Accurate prediction of compound-protein interactions (CPI) remains a cornerstone challenge in computational drug discovery. While existing sequence-based approaches leverage molecular fingerprints or graph representations, they critically overlook three-dimensional (3D) structural determinants of binding affinity. To bridge this gap, we present EquiCPI, an end-to-end geometric deep learning framework that synergizes first-principles structural modeling with SE(3)-equivariant neural networks. Our pipeline transforms raw sequences into 3D atomic coordinates via ESMFold for proteins and DiffDock-L for ligands, followed by physics-guided conformer re-ranking and equivariant feature learning. At its core, EquiCPI employs SE(3)-equivariant message passing over atomic point clouds, preserving symmetry under rotations, translations, and reflections, while hierarchically encoding local interaction patterns through tensor products of spherical harmonics. The proposed model is evaluated on BindingDB (affinity prediction) and DUD-E (virtual screening), EquiCPI achieves performance on par with or exceeding the state-of-the-art deep learning competitors.
zh

[AI-51] ool-as-Interface: Learning Robot Policies from Human Tool Usage through Imitation Learning

【速读】：该论文旨在解决机器人工具使用能力训练中数据收集效率低以及现有方法难以应对动态任务的问题。论文提出了一种从人类自然工具使用数据中迁移知识到机器人的框架作为解决方案的关键。其核心在于利用双目RGB相机生成3D重建、采用高斯点阵增强视角、通过分割模型提取与实体无关的观测，并结合任务空间中的工具动作表示来训练视觉运动策略。这种方法不仅显著提高了成功率（比基于远程操作数据的扩散策略高出71%），还大幅减少了数据收集时间（减少77%），并在某些任务中实现了唯一可行的解决方案。此外，该方法有效弥合了实体差距，增强了对相机视角和机器人配置变化的鲁棒性，并在物体和空间设置上具有良好的泛化能力。

链接: https://arxiv.org/abs/2504.04612
作者: Haonan Chen,Cheng Zhu,Yunzhu Li,Katherine Driggs-Campbell
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL . 17 pages, 14 figures

点击查看摘要

Abstract:Tool use is critical for enabling robots to perform complex real-world tasks, and leveraging human tool-use data can be instrumental for teaching robots. However, existing data collection methods like teleoperation are slow, prone to control delays, and unsuitable for dynamic tasks. In contrast, human natural data, where humans directly perform tasks with tools, offers natural, unstructured interactions that are both efficient and easy to collect. Building on the insight that humans and robots can share the same tools, we propose a framework to transfer tool-use knowledge from human data to robots. Using two RGB cameras, our method generates 3D reconstruction, applies Gaussian splatting for novel view augmentation, employs segmentation models to extract embodiment-agnostic observations, and leverages task-space tool-action representations to train visuomotor policies. We validate our approach on diverse real-world tasks, including meatball scooping, pan flipping, wine bottle balancing, and other complex tasks. Our method achieves a 71% higher average success rate compared to diffusion policies trained with teleoperation data and reduces data collection time by 77%, with some tasks solvable only by our framework. Compared to hand-held gripper, our method cuts data collection time by 41%. Additionally, our method bridges the embodiment gap, improves robustness to variations in camera viewpoints and robot configurations, and generalizes effectively across objects and spatial setups.
zh

[AI-52] AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability

【速读】：该论文试图解决在构建世界模型（World Model）以评估人工智能代理（AI Agent）可靠性与安全性时，如何平衡计算效率与模型可解释性的问题。传统方法中，准确的世界模型通常具有较高的计算需求，限制了评估的范围和深度。受“桶中大脑”思想实验的启发，论文探索了简化世界模型的方法，使其对被评估的AI代理保持无关性。解决方案的关键在于遵循计算力学的原则，揭示了世界模型构建过程中效率与可解释性之间的根本权衡关系，表明不存在同时优化所有理想特性的单一模型。基于此权衡，论文提出了构建世界模型的具体程序，分别用于最小化内存需求、界定可学习边界或追踪不良结果的原因，从而确立了世界建模的基本限制，并为有效评估代理设计提供了可行的指导原则。

链接: https://arxiv.org/abs/2504.04608
作者: Fernando Rosas,Alexander Boyd,Manuel Baltieri
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 38 pages, 5 figures

点击查看摘要

Abstract:Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment to ensure their reliability and safety. However, accurate world models often have high computational demands that can severely restrict the scope and depth of such assessments. Inspired by the classic `brain in a vat’ thought experiment, here we investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation. By following principles from computational mechanics, our approach reveals a fundamental trade-off in world model construction between efficiency and interpretability, demonstrating that no single world model can optimise all desirable characteristics. Building on this trade-off, we identify procedures to build world models that either minimise memory requirements, delineate the boundaries of what is learnable, or allow tracking causes of undesirable outcomes. In doing so, this work establishes fundamental limits in world modelling, leading to actionable guidelines that inform core design choices related to effective agent evaluation.
zh

[AI-53] Capturing AIs Attention: Physics of Repetition Hallucination Bias and Beyond

【速读】：该论文试图从第一性物理原理的角度解析大型语言模型（LLMs）中注意力机制（Attention mechanism）的核心——基本注意力头（Attention head）的工作原理，以定量分析生成式AI（Generative AI）面临的挑战，如输出重复、幻觉生成（hallucination）、有害内容生成以及偏见（如训练和微调过程中的偏差）。论文的关键在于提出了一种基于2体形式的理论框架，该框架不仅能解释LLMs为何表现优异，还推测引入3体注意力机制可能进一步提升其性能。此外，通过与自旋浴（spin-bath）的相似性，论文暗示现有的物理学知识可以被迅速应用于确保AI的可信性和抵御操纵的能力。

链接: https://arxiv.org/abs/2504.04600
作者: Frank Yingjie Huo,Neil F. Johnson
机构: 未知
类目: Artificial Intelligence (cs.AI); Other Condensed Matter (cond-mat.other); Mathematical Physics (math-ph); Adaptation and Self-Organizing Systems (nlin.AO); Physics and Society (physics.soc-ph)
备注: Comments welcome to neiljohnson@gwu.edu

点击查看摘要

Abstract:We derive a first-principles physics theory of the AI engine at the heart of LLMs’ ‘magic’ (e.g. ChatGPT, Claude): the basic Attention head. The theory allows a quantitative analysis of outstanding AI challenges such as output repetition, hallucination and harmful content, and bias (e.g. from training and fine-tuning). Its predictions are consistent with large-scale LLM outputs. Its 2-body form suggests why LLMs work so well, but hints that a generalized 3-body Attention would make such AI work even better. Its similarity to a spin-bath means that existing Physics expertise could immediately be harnessed to help Society ensure AI is trustworthy and resilient to manipulation.
zh

[AI-54] “You just cant go around killing people” Explaining Agent Behavior to a Human Terminator ICML2024

【速读】：该论文试图解决在预训练智能体与人类操作员协同工作的环境中，如何优化人类干预（接管）次数的问题。论文关注的是在不允许任何接管与接管过于频繁之间的权衡：前者可能导致智能体采用次优甚至危险的行为策略，而后者则会削弱人类对智能体的信任，限制其实际应用价值。解决方案的关键在于提出了一种可解释性方案（Explainability Scheme），通过增强智能体行为的透明度，帮助平衡人类干预的频率，从而提升人机协作的整体性能与安全性。

链接: https://arxiv.org/abs/2504.04592
作者: Uri Menkes,Assaf Hallak,Ofra Amir
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, in proceedings of ICML 2024 Workshop on Models of Human Feedback for AI Alignment

点击查看摘要

Abstract:Consider a setting where a pre-trained agent is operating in an environment and a human operator can decide to temporarily terminate its operation and take-over for some duration of time. These kind of scenarios are common in human-machine interactions, for example in autonomous driving, factory automation and healthcare. In these settings, we typically observe a trade-off between two extreme cases – if no take-overs are allowed, then the agent might employ a sub-optimal, possibly dangerous policy. Alternatively, if there are too many take-overs, then the human has no confidence in the agent, greatly limiting its usefulness. In this paper, we formalize this setup and propose an explainability scheme to help optimize the number of human interventions.
zh

[AI-55] Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）作为机器人规划器在处理长时域和复杂任务时的局限性，尤其是在需要外部知识的专门环境中。论文指出，尽管分层规划（hierarchical planning）和基于检索增强生成（Retrieval-Augmented Generation, RAG）技术部分缓解了这些问题，但单一方法仍不足以构建更可靠的系统，因此需要更深层次的整合。论文的关键解决方案在于提出了一种神经符号方法，通过结合基于知识图谱的RAG来增强基于LLMs的规划器，实现分层计划生成。该方法将复杂任务分解为可管理的子任务，并进一步扩展为可执行的基本动作序列。为确保形式上的正确性和分解的合理性，论文引入了一个符号验证器（Symbolic Validator），它不仅用于验证计划的正确性，还作为失败检测器，通过比对预期和实际世界状态来工作。实验结果表明，该方法在不同复杂度的任务和不同的LLMs上，相较于基线方法具有显著优势。此外，论文提出的实验设置和新型评估指标不仅验证了所提方法在复杂规划中的有效性，也为评估LLMs的推理和组合能力提供了工具。

链接: https://arxiv.org/abs/2504.04578
作者: Cristina Cornelio,Flavio Petruzzellis,Pietro Lio
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise as robotic planners but often struggle with long-horizon and complex tasks, especially in specialized environments requiring external knowledge. While hierarchical planning and Retrieval-Augmented Generation (RAG) address some of these challenges, they remain insufficient on their own and a deeper integration is required for achieving more reliable systems. To this end, we propose a neuro-symbolic approach that enhances LLMs-based planners with Knowledge Graph-based RAG for hierarchical plan generation. This method decomposes complex tasks into manageable subtasks, further expanded into executable atomic action sequences. To ensure formal correctness and proper decomposition, we integrate a Symbolic Validator, which also functions as a failure detector by aligning expected and observed world states. Our evaluation against baseline methods demonstrates the consistent significant advantages of integrating hierarchical planning, symbolic verification, and RAG across tasks of varying complexity and different LLMs. Additionally, our experimental setup and novel metrics not only validate our approach for complex planning but also serve as a tool for assessing LLMs’ reasoning and compositional capabilities.
zh

[AI-56] Planning Safety Trajectories with Dual-Phase Physics-Informed and Transportation Knowledge-Driven Large Language Models

【速读】：该论文旨在解决现有基础模型在驾驶任务中面临的幻觉（hallucinations）、不确定性以及长推理延迟等问题，并弥补其在交通领域特定安全知识方面的不足。论文提出了一种名为LetsPi的物理信息驱动、双阶段、知识引导的框架，用于实现安全且类人轨迹规划。解决方案的关键在于结合大语言模型（LLM）的推理能力与基于物理的社会力动力学，通过双阶段架构平衡推理与计算效率：第一阶段利用物理信息驱动的LLM进行结果处理和优化存储高质量驾驶经验；第二阶段提取相似驾驶经验作为少样本示例以快速生成轨迹，同时简化输入输出需求确保高效性与安全性。

链接: https://arxiv.org/abs/2504.04562
作者: Rui Gan,Pei Li,Keke Long,Bocheng An,Junwei You,Keshu Wu,Bin Ran
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models have demonstrated strong reasoning and generalization capabilities in driving-related tasks, including scene understanding, planning, and control. However, they still face challenges in hallucinations, uncertainty, and long inference latency. While existing foundation models have general knowledge of avoiding collisions, they often lack transportation-specific safety knowledge. To overcome these limitations, we introduce LetsPi, a physics-informed, dual-phase, knowledge-driven framework for safe, human-like trajectory planning. To prevent hallucinations and minimize uncertainty, this hybrid framework integrates Large Language Model (LLM) reasoning with physics-informed social force dynamics. LetsPi leverages the LLM to analyze driving scenes and historical information, providing appropriate parameters and target destinations (goals) for the social force model, which then generates the future trajectory. Moreover, the dual-phase architecture balances reasoning and computational efficiency through its Memory Collection phase and Fast Inference phase. The Memory Collection phase leverages the physics-informed LLM to process and refine planning results through reasoning, reflection, and memory modules, storing safe, high-quality driving experiences in a memory bank. Surrogate safety measures and physics-informed prompt techniques are introduced to enhance the LLM’s knowledge of transportation safety and physical force, respectively. The Fast Inference phase extracts similar driving experiences as few-shot examples for new scenarios, while simplifying input-output requirements to enable rapid trajectory planning without compromising safety. Extensive experiments using the HighD dataset demonstrate that LetsPi outperforms baseline models across five safety this http URL PDF for project Github link.
zh

[AI-57] A Consequentialist Critique of Binary Classification Evaluation Practices

【速读】：该论文试图解决在机器学习支持的决策（如二分类概率预测）评估中，主流评价指标未能充分反映实际决策需求的问题。现有评估框架倾向于优先使用独立决策指标（如Accuracy）或Top-K指标（如Precision@K），以及固定的阈值或阈值无关的度量（如AUC-ROC），而忽视了基于后果论视角下更适配独立决策场景的混合阈值评价指标（如Brier Score和Log Loss）。论文的关键解决方案是通过决策理论框架重新映射评价指标至其最优应用场景，并开发了一个名为briertools的Python工具包以促进Brier Score等指标的广泛应用。此外，研究还揭示了Brier Score与Decision Curve Analysis之间的新理论联系，回应了关于适当评分规则临床效用的长期争议。

链接: https://arxiv.org/abs/2504.04528
作者: Gerardo Flores,Abigail Schiff,Alyssa H. Smith,Julia A Fukuyama,Ashia C. Wilson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:ML-supported decisions, such as ordering tests or determining preventive custody, often involve binary classification based on probabilistic forecasts. Evaluation frameworks for such forecasts typically consider whether to prioritize independent-decision metrics (e.g., Accuracy) or top-K metrics (e.g., Precision@K), and whether to focus on fixed thresholds or threshold-agnostic measures like AUC-ROC. We highlight that a consequentialist perspective, long advocated by decision theorists, should naturally favor evaluations that support independent decisions using a mixture of thresholds given their prevalence, such as Brier scores and Log loss. However, our empirical analysis reveals a strong preference for top-K metrics or fixed thresholds in evaluations at major conferences like ICML, FAccT, and CHIL. To address this gap, we use this decision-theoretic framework to map evaluation metrics to their optimal use cases, along with a Python package, briertools, to promote the broader adoption of Brier scores. In doing so, we also uncover new theoretical connections, including a reconciliation between the Brier Score and Decision Curve Analysis, which clarifies and responds to a longstanding critique by (Assel, et al. 2017) regarding the clinical utility of proper scoring rules.
zh

[AI-58] rust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

【速读】：该论文旨在解决基于奖励的优化方法在对齐任务中易受奖励劫持（reward hacking）影响的问题，同时探索偏好优化算法在推理任务中性能仍落后于基于奖励的优化方法（如PPO）的情况。论文的关键在于提出了一种名为Trust Region Preference Approximation (TRPA)的新算法，通过将基于规则的优化与基于偏好的优化相结合，用于推理任务。TRPA作为基于偏好的算法，天然避免了奖励劫持问题，并通过构建基于预定义规则的偏好层级、形成相应的偏好对以及采用具有理论单调改进保证的新型优化算法来进行强化学习训练，从而实现性能提升与稳定性增强。

链接: https://arxiv.org/abs/2504.04524
作者: Xuerui Su,Shufang Xie,Guoqing Liu,Yingce Xia,Renqian Luo,Peiran Jin,Zhiming Ma,Yue Wang,Zun Wang,Yuting Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10pages

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have rapidly evolved, approaching Artificial General Intelligence (AGI) while benefiting from large-scale reinforcement learning to enhance Human Alignment (HA) and Reasoning. Recent reward-based optimization algorithms, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) have achieved significant performance on reasoning tasks, whereas preference-based optimization algorithms such as Direct Preference Optimization (DPO) significantly improve the performance of LLMs on human alignment. However, despite the strong performance of reward-based optimization methods in alignment tasks , they remain vulnerable to reward hacking. Furthermore, preference-based algorithms (such as Online DPO) haven’t yet matched the performance of reward-based optimization algorithms (like PPO) on reasoning tasks, making their exploration in this specific area still a worthwhile pursuit. Motivated by these challenges, we propose the Trust Region Preference Approximation (TRPA) algorithm, which integrates rule-based optimization with preference-based optimization for reasoning tasks. As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue. TRPA constructs preference levels using predefined rules, forms corresponding preference pairs, and leverages a novel optimization algorithm for RL training with a theoretical monotonic improvement guarantee. Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability. The code of this paper are released and updating on this https URL.
zh

[AI-59] LoopGen: Training-Free Loopable Music Generation

【速读】：该论文旨在解决当前生成式音乐（Generative Music）模型难以生成真正可无缝循环（loopable）音频的问题。现有方法仅通过独立生成短波形无法确保其起点与终点之间的平滑过渡，从而导致明显的听觉瑕疵。为填补这一空白，论文提出的关键解决方案是对一个非自回归模型（MAGNeT）进行修改，使其以循环模式生成标记（tokens），允许模型在生成结尾时关注音频的起始部分。此仅依赖推理（inference-only）的方法使得生成的音频能够感知未来上下文并实现自然循环，且无需额外训练或数据支持。实验结果表明，该方法在循环过渡一致性上提升了55%的困惑度（perplexity），盲测评分也显著提高了70%，充分验证了推理优化在提升生成模型性能方面的有效性，并突显了非自回归方法在上下文感知音乐生成中的优势。

链接: https://arxiv.org/abs/2504.04466
作者: Davide Marincione,Giorgio Strano,Donato Crisostomi,Roberto Ribuoli,Emanuele Rodolà
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Loops–short audio segments designed for seamless repetition–are central to many music genres, particularly those rooted in dance and electronic styles. However, current generative music models struggle to produce truly loopable audio, as generating a short waveform alone does not guarantee a smooth transition from its endpoint back to its start, often resulting in audible this http URL–short audio segments designed for seamless repetition–are central to many music genres, particularly those rooted in dance and electronic styles. However, current generative music models struggle to produce truly loopable audio, as generating a short waveform alone does not guarantee a smooth transition from its endpoint back to its start, often resulting in audible this http URL address this gap by modifying a non-autoregressive model (MAGNeT) to generate tokens in a circular pattern, letting the model attend to the beginning of the audio when creating its ending. This inference-only approach results in generations that are aware of future context and loop naturally, without the need for any additional training or data. We evaluate the consistency of loop transitions by computing token perplexity around the seam of the loop, observing a 55% improvement. Blind listening tests further confirm significant perceptual gains over baseline methods, improving mean ratings by 70%. Taken together, these results highlight the effectiveness of inference-only approaches in improving generative models and underscore the advantages of non-autoregressive methods for context-aware music generation.
zh

[AI-60] Do We Need Responsible XR? Drawing on Responsible AI to Inform Ethical Research and Practice into XRAI / the Metaverse

【速读】：该论文试图探讨人机交互（HCI）领域是否需要定义负责任的扩展现实（Responsible XR）作为负责任的人工智能（Responsible AI）的平行领域，并结合两者共同应对由大规模采用可穿戴人工智能增强的增强现实（AR）眼镜及扩展现实（XR）设备所引发的独特脆弱性问题。论文的关键在于提出一种框架或方法，以确保在AI驱动的人类感知增强技术普及过程中，能够有效评估和管理由此带来的伦理、隐私和社会影响，从而促进技术的负责任发展与应用。

链接: https://arxiv.org/abs/2504.04440
作者: Mark McGill,Joseph O’Hagan,Thomas Goodge,Graham Wilson,Mohamed Khamis,Veronika Krauß,Jan Gugenheimer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This position paper for the CHI 2025 workshop “Everyday AR through AI-in-the-Loop” reflects on whether as a field HCI needs to define Responsible XR as a parallel to, and in conjunction with, Responsible AI, addressing the unique vulnerabilities posed by mass adoption of wearable AI-enabled AR glasses and XR devices that could enact AI-driven human perceptual augmentation.
zh

[AI-61] AGITB: A Signal-Level Benchmark for Evaluating Artificial General Intelligence

【速读】：该论文旨在解决现有AI系统未能达到人类水平通用智能（Artificial General Intelligence, AGI）的问题，并提出了一种新的评估方法以更有效地衡量迈向AGI的进展。当前的AGI评估方法无法提供实用、渐进且信息丰富的度量标准，而论文的关键创新在于引入了“人工通用智能测试平台”(Artificial General Intelligence Test Bed, AGITB)。AGITB由十二项严格的测试组成，这些测试基于信号处理层面，旨在揭示潜在的认知能力特征。其核心解决方案在于通过模型预测时间序列二元信号的能力来评估智能，而不依赖符号表示或预训练，聚焦于反映生物智能本质的核心计算不变性（如确定性、敏感性和泛化能力）。这一设计确保了测试的公平性、独立于语义意义，并排除了通过暴力破解或记忆达成解题的可能性，从而为AGI的研究提供了更具挑战性和指导性的基准。

链接: https://arxiv.org/abs/2504.04430
作者: Matej Šprogar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite remarkable progress in machine learning, current AI systems continue to fall short of true human-like intelligence. While Large Language Models (LLMs) excel in pattern recognition and response generation, they lack genuine understanding - an essential hallmark of Artificial General Intelligence (AGI). Existing AGI evaluation methods fail to offer a practical, gradual, and informative metric. This paper introduces the Artificial General Intelligence Test Bed (AGITB), comprising twelve rigorous tests that form a signal-processing-level foundation for the potential emergence of cognitive capabilities. AGITB evaluates intelligence through a model’s ability to predict binary signals across time without relying on symbolic representations or pretraining. Unlike high-level tests grounded in language or perception, AGITB focuses on core computational invariants reflective of biological intelligence, such as determinism, sensitivity, and generalisation. The test bed assumes no prior bias, operates independently of semantic meaning, and ensures unsolvability through brute force or memorization. While humans pass AGITB by design, no current AI system has met its criteria, making AGITB a compelling benchmark for guiding and recognizing progress toward AGI.
zh

[AI-62] Formula-Supervised Sound Event Detection: Pre-Training Without Real Data ICASSP2025

【速读】：该论文旨在解决声事件检测（Sound Event Detection, SED）任务中因标注数据不足及人工标签噪声与主观偏差导致的挑战。论文的关键解决方案是提出了一种基于公式驱动合成数据集（Formula-SED）的新型预训练方法。该方法通过数学公式生成声学数据，并利用合成过程中使用的参数作为真实标签，从而实现大规模无噪声的预训练。这种方法不仅显著提升了模型的准确性，还加速了训练过程，验证结果在DCASE2023 Challenge Task 4所使用的DESED数据集上得以体现。

链接: https://arxiv.org/abs/2504.04428
作者: Yuto Shibata,Keitaro Tanaka,Yoshiaki Bando,Keisuke Imoto,Hirokatsu Kataoka,Yoshimitsu Aoki
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:In this paper, we propose a novel formula-driven supervised learning (FDSL) framework for pre-training an environmental sound analysis model by leveraging acoustic signals parametrically synthesized through formula-driven methods. Specifically, we outline detailed procedures and evaluate their effectiveness for sound event detection (SED). The SED task, which involves estimating the types and timings of sound events, is particularly challenged by the difficulty of acquiring a sufficient quantity of accurately labeled training data. Moreover, it is well known that manually annotated labels often contain noises and are significantly influenced by the subjective judgment of annotators. To address these challenges, we propose a novel pre-training method that utilizes a synthetic dataset, Formula-SED, where acoustic data are generated solely based on mathematical formulas. The proposed method enables large-scale pre-training by using the synthesis parameters applied at each time step as ground truth labels, thereby eliminating label noise and bias. We demonstrate that large-scale pre-training with Formula-SED significantly enhances model accuracy and accelerates training, as evidenced by our results in the DESED dataset used for DCASE2023 Challenge Task 4. The project page is at this https URL
zh

[AI-63] Driving-RAG : Driving Scenarios Embedding Search and RAG Applications

【速读】：本文旨在解决驾驶场景数据在高效嵌入、搜索及应用方面的挑战，特别是在 Retrieval-Augmented-Generation (RAG) 系统中的需求。论文提出了一种名为 Driving-RAG 的框架，其关键在于通过嵌入模型在向量空间中对基础场景信息和场景距离度量进行对齐，结合层次化可导航小世界网络实现高效的场景向量检索，从而在保证高效率的同时不牺牲准确性。此外，引入图知识重组机制增强了与提示场景的相关性，并提升了大语言模型 (LLM) 的生成能力。该框架在复杂交互场景（如匝道和交叉口）的典型轨迹规划任务中展示了其有效性，凸显了其在 RAG 应用中的优势。

链接: https://arxiv.org/abs/2504.04419
作者: Cheng Chang,Jingwei Ge,Jiazhe Guo,Zelin Guo,Binghong Jiang,Li Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Driving scenario data play an increasingly vital role in the development of intelligent vehicles and autonomous driving. Accurate and efficient scenario data search is critical for both online vehicle decision-making and planning, and offline scenario generation and simulations, as it allows for leveraging the scenario experiences to improve the overall performance. Especially with the application of large language models (LLMs) and Retrieval-Augmented-Generation (RAG) systems in autonomous driving, urgent requirements are put forward. In this paper, we introduce the Driving-RAG framework to address the challenges of efficient scenario data embedding, search, and applications for RAG systems. Our embedding model aligns fundamental scenario information and scenario distance metrics in the vector space. The typical scenario sampling method combined with hierarchical navigable small world can perform efficient scenario vector search to achieve high efficiency without sacrificing accuracy. In addition, the reorganization mechanism by graph knowledge enhances the relevance to the prompt scenarios and augment LLM generation. We demonstrate the effectiveness of the proposed framework on typical trajectory planning task for complex interactive scenarios such as ramps and intersections, showcasing its advantages for RAG applications.
zh

[AI-64] Universal Item Tokenization for Transferable Generative Recommendation

【速读】：该论文旨在解决现有生成式推荐方法中，因领域特定的项目标记器（item tokenizer）和推荐器导致的跨领域迁移能力受限的问题。为了解决这一问题，论文提出了一种名为UTGRec的通用项目标记方法，用于可迁移的生成式推荐。其关键解决方案包括：(1) 设计了一个基于多模态大型语言模型（Multimodal Large Language Model, MLLM）的通用项目标记器，以编码丰富的项目语义，并通过树状结构代码本将内容表示离散化为对应的代码；(2) 引入两种关键技术：一是使用双轻量级解码器实现原始内容的重建，以捕获嵌入在内容中的通用知识；二是假设共现项目相似，并通过共现对齐与重建集成协同信号；(3) 提出一个联合学习框架，用于在多个领域预训练和适配可迁移的生成式推荐器。实验结果验证了UTGRec在四个公开数据集上的优越性。

链接: https://arxiv.org/abs/2504.04405
作者: Bowen Zheng,Hongyu Lu,Yu Chen,Wayne Xin Zhao,Ji-Rong Wen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, generative recommendation has emerged as a promising paradigm, attracting significant research attention. The basic framework involves an item tokenizer, which represents each item as a sequence of codes serving as its identifier, and a generative recommender that predicts the next item by autoregressively generating the target item identifier. However, in existing methods, both the tokenizer and the recommender are typically domain-specific, limiting their ability for effective transfer or adaptation to new domains. To this end, we propose UTGRec, a Universal item Tokenization approach for transferable Generative Recommendation. Specifically, we design a universal item tokenizer for encoding rich item semantics by adapting a multimodal large language model (MLLM). By devising tree-structured codebooks, we discretize content representations into corresponding codes for item tokenization. To effectively learn the universal item tokenizer on multiple domains, we introduce two key techniques in our approach. For raw content reconstruction, we employ dual lightweight decoders to reconstruct item text and images from discrete representations to capture general knowledge embedded in the content. For collaborative knowledge integration, we assume that co-occurring items are similar and integrate collaborative signals through co-occurrence alignment and reconstruction. Finally, we present a joint learning framework to pre-train and adapt the transferable generative recommender across multiple domains. Extensive experiments on four public datasets demonstrate the superiority of UTGRec compared to both traditional and generative recommendation baselines.
zh

[AI-65] Pre-training Generative Recommender with Multi-Identifier Item Tokenization

【速读】：该论文旨在解决生成式推荐系统中低频项语义建模不足以及基于单一标识符的推荐多样性受限的问题。现有方法通常采用一对一映射策略，即每个项目仅由单个标识符表示，这导致低频项的语义建模效果不佳，并限制了令牌序列数据的多样性。为克服这些局限性，论文提出了一种名为MTGRec的方法，通过引入多标识符项目分词技术来增强生成式推荐器的预训练数据。

MTGRec的关键创新点在于多标识符项目分词与课程推荐器预训练。在多标识符项目分词方面，研究者使用RQ-VAE作为分词器主干，并将相邻训练轮次的模型检查点视为语义相关的分词器，使每个项目能够关联多个标识符，从而将单个用户交互序列转换为多个令牌序列作为不同的数据组。课程推荐器预训练则引入了一种由数据影响估计引导的课程学习方案，在推荐器预训练过程中动态调整每个数据组的采样概率。预训练完成后，使用单一分词器对模型进行微调，以确保推荐的准确物品识别。实验结果表明，MTGRec在三个公开基准数据集上的表现显著优于传统和生成式推荐基线方法。

链接: https://arxiv.org/abs/2504.04400
作者: Bowen Zheng,Enze Liu,Zhongfu Chen,Zhongrui Ma,Yue Wang,Wayne Xin Zhao,Ji-Rong Wen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative recommendation autoregressively generates item identifiers to recommend potential items. Existing methods typically adopt a one-to-one mapping strategy, where each item is represented by a single identifier. However, this scheme poses issues, such as suboptimal semantic modeling for low-frequency items and limited diversity in token sequence data. To overcome these limitations, we propose MTGRec, which leverages Multi-identifier item Tokenization to augment token sequence data for Generative Recommender pre-training. Our approach involves two key innovations: multi-identifier item tokenization and curriculum recommender pre-training. For multi-identifier item tokenization, we leverage the RQ-VAE as the tokenizer backbone and treat model checkpoints from adjacent training epochs as semantically relevant tokenizers. This allows each item to be associated with multiple identifiers, enabling a single user interaction sequence to be converted into several token sequences as different data groups. For curriculum recommender pre-training, we introduce a curriculum learning scheme guided by data influence estimation, dynamically adjusting the sampling probability of each data group during recommender pre-training. After pre-training, we fine-tune the model using a single tokenizer to ensure accurate item identification for recommendation. Extensive experiments on three public benchmark datasets demonstrate that MTGRec significantly outperforms both traditional and generative recommendation baselines in terms of effectiveness and scalability.
zh

[AI-66] ADCPS: Time Series Anomaly Detection for Evolving Cyber-physical Systems via Incremental Meta-learning

【速读】：该论文旨在解决现有异常检测方法在处理网络物理系统（CPS）中因数据分布随时间（temporal）和空间（spatial）维度变化而导致的适应性不足问题。解决方案的关键在于提出了一种基于增量元学习的方法（iADCPS），其通过少量演化的正常样本持续更新模型，以弥合演化数据与历史数据之间的分布差异。具体而言，首先引入时间混合策略（temporal mixup strategy）以实现数据层面的泛化，并结合单类元学习方法实现模型层面的泛化。此外，开发了一种非参数动态阈值，无需异常监督即可根据异常分数的概率密度自适应调整阈值。实验结果表明，该方法在PUMP、SWaT和WADI三个公开数据集上的F1-Score分别达到99.0%、93.1%和78.7%，显著优于现有的最先进的CPS异常检测方法。

链接: https://arxiv.org/abs/2504.04374
作者: Jiyu Tian,Mingchu Li,Liming Chen,Zumin Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly detection for cyber-physical systems (ADCPS) is crucial in identifying faults and potential attacks by analyzing the time series of sensor measurements and actuator states. However, current methods lack adaptation to data distribution shifts in both temporal and spatial dimensions as cyber-physical systems evolve. To tackle this issue, we propose an incremental meta-learning-based approach, namely iADCPS, which can continuously update the model through limited evolving normal samples to reconcile the distribution gap between evolving and historical time series. Specifically, We first introduce a temporal mixup strategy to align data for data-level generalization which is then combined with the one-class meta-learning approach for model-level generalization. Furthermore, we develop a non-parametric dynamic threshold to adaptively adjust the threshold based on the probability density of the abnormal scores without any anomaly supervision. We empirically evaluate the effectiveness of the iADCPS using three publicly available datasets PUMP, SWaT, and WADI. The experimental results demonstrate that our method achieves 99.0%, 93.1%, and 78.7% F1-Score, respectively, which outperforms the state-of-the-art (SOTA) ADCPS method, especially in the context of the evolving CPSs.
zh

[AI-67] How Accurately Do Large Language Models Understand Code?

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在代码理解能力方面评估不足的问题。现有方法主要关注代码生成或依赖开发者调查，而缺乏针对代码理解的标准化度量。此外，固定基准容易因成为训练数据的一部分而过时。论文的关键解决方案是受变异测试（mutation testing）启发，将LLMs的错误发现能力作为其深层代码理解程度的代理指标。这种方法基于一个核心洞察：能够识别细微功能差异的模型必然对代码有深刻理解。通过在真实程序中注入故障并让LLMs定位这些故障，同时验证语义保持的代码变异（Semantic-Preserving Mutations, SPMs）后模型是否仍能正确定位故障，论文评估了九种流行LLMs在575,000个来自Java和Python程序的调试任务中的表现，揭示了LLMs对代码理解的局限性及其与词法和句法特征的紧密关联。

链接: https://arxiv.org/abs/2504.04372
作者: Sabaat Haroon,Ahmad Faraz Khan,Ahmad Humayun,Waris Gill,Abdul Haddi Amjad,Ali R. Butt,Mohammad Taha Khan,Muhammad Ali Gulzar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper is currently Under Review. It consists of 11 pages, 12 Figures, and 5 Tables

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in post-development tasks such as code repair and testing. A key factor in these tasks’ success is the model’s deep understanding of code. However, the extent to which LLMs truly understand code remains largely unevaluated. Quantifying code comprehension is challenging due to its abstract nature and the lack of a standardized metric. Previously, this was assessed through developer surveys, which are not feasible for evaluating LLMs. Existing LLM benchmarks focus primarily on code generation, fundamentally different from code comprehension. Additionally, fixed benchmarks quickly become obsolete as they become part of the training data. This paper presents the first large-scale empirical investigation into LLMs’ ability to understand code. Inspired by mutation testing, we use an LLM’s fault-finding ability as a proxy for its deep code understanding. This approach is based on the insight that a model capable of identifying subtle functional discrepancies must understand the code well. We inject faults in real-world programs and ask the LLM to localize them, ensuring the specifications suffice for fault localization. Next, we apply semantic-preserving code mutations (SPMs) to the faulty programs and test whether the LLMs still locate the faults, verifying their confidence in code understanding. We evaluate nine popular LLMs on 575000 debugging tasks from 670 Java and 637 Python programs. We find that LLMs lose the ability to debug the same bug in 81% of faulty programs when SPMs are applied, indicating a shallow understanding of code and reliance on features irrelevant to semantics. We also find that LLMs understand code earlier in the program better than later. This suggests that LLMs’ code comprehension remains tied to lexical and syntactic features due to tokenization designed for natural languages, which overlooks code semantics.
zh

[AI-68] WeiDetect: Weibull Distribution-Based Defense against Poisoning Attacks in Federated Learning for Network Intrusion Detection Systems

【速读】：该论文旨在解决在数据扩展时代，传统基于AI的应用因数据隐私保障不足及物联网设备普及带来的网络安全挑战而面临的局限性，同时应对联邦学习（Federated Learning, FL）系统中分布式模型训练虽采用隐私保护技术但仍易受对抗攻击的问题。此外，论文关注FL场景下客户端数据分布异质性所带来的挑战。为解决这些问题，论文提出了一种名为WeiDetect的两阶段服务器端防御机制，其关键是通过验证数据集评估本地模型生成验证分数，并利用威布尔分布（Weibull Distribution）分析这些分数以识别并移除恶意模型，从而增强联邦学习驱动的网络入侵检测系统（Network Intrusion Detection System, NIDS）的安全性和有效性。

链接: https://arxiv.org/abs/2504.04367
作者: Sameera K. M.,Vinod P.,Anderson Rocha,Rafidha Rehiman K. A.,Mauro Conti
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the era of data expansion, ensuring data privacy has become increasingly critical, posing significant challenges to traditional AI-based applications. In addition, the increasing adoption of IoT devices has introduced significant cybersecurity challenges, making traditional Network Intrusion Detection Systems (NIDS) less effective against evolving threats, and privacy concerns and regulatory restrictions limit their deployment. Federated Learning (FL) has emerged as a promising solution, allowing decentralized model training while maintaining data privacy to solve these issues. However, despite implementing privacy-preserving technologies, FL systems remain vulnerable to adversarial attacks. Furthermore, data distribution among clients is not heterogeneous in the FL scenario. We propose WeiDetect, a two-phase, server-side defense mechanism for FL-based NIDS that detects malicious participants to address these challenges. In the first phase, local models are evaluated using a validation dataset to generate validation scores. These scores are then analyzed using a Weibull distribution, identifying and removing malicious models. We conducted experiments to evaluate the effectiveness of our approach in diverse attack settings. Our evaluation included two popular datasets, CIC-Darknet2020 and CSE-CIC-IDS2018, tested under non-IID data distributions. Our findings highlight that WeiDetect outperforms state-of-the-art defense approaches, improving higher target class recall up to 70% and enhancing the global model’s F1 score by 1% to 14%.
zh

[AI-69] Solving Sokoban using Hierarchical Reinforcement Learning with Landmarks

【速读】：该论文旨在解决复杂组合难题（如Sokoban）中基于层级强化学习 (Hierarchical Reinforcement Learning, HRL) 的高效规划与执行问题。论文的关键在于提出了一种新颖的层级框架，通过自下而上的子目标递归规划实现端到端的学习，无需任何领域知识。其解决方案的核心是构建了一个六层策略层级结构，每一层高阶策略为下一层生成子目标，从而实现从单一高层指令生成长动作序列的能力。研究表明，这种深层次的子目标分解能够完全从学习中涌现，并且该层级结构可以有效扩展到困难的谜题领域。

链接: https://arxiv.org/abs/2504.04366
作者: Sergey Pastukhov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:We introduce a novel hierarchical reinforcement learning (HRL) framework that performs top-down recursive planning via learned subgoals, successfully applied to the complex combinatorial puzzle game Sokoban. Our approach constructs a six-level policy hierarchy, where each higher-level policy generates subgoals for the level below. All subgoals and policies are learned end-to-end from scratch, without any domain knowledge. Our results show that the agent can generate long action sequences from a single high-level call. While prior work has explored 2-3 level hierarchies and subgoal-based planning heuristics, we demonstrate that deep recursive goal decomposition can emerge purely from learning, and that such hierarchies can scale effectively to hard puzzle domains.
zh

[AI-70] AutoPDL: Automatic Prompt Optimization for LLM Agents

【速读】：该论文试图解决手动调整大型语言模型（Large Language Models, LLMs）提示组合（包括高层级提示模式如Zero-Shot、CoT、ReAct、ReWOO以及具体提示内容如指令和少量示例演示）的繁琐、易错且无法跨模型或任务迁移的问题。论文提出了一种名为AutoPDL的自动化方法，旨在发现有效的LLM代理配置。解决方案的关键在于将此问题形式化为一个结构化的自动机器学习（AutoML）问题，通过组合代理和非代理提示模式及演示，在一个组合空间中高效导航，并采用successive halving算法优化搜索过程。此外，AutoPDL提供可读、可编辑且可执行的PDL程序作为解决方案，支持从源到源的优化以及人机协作的精炼与重用。

链接: https://arxiv.org/abs/2504.04365
作者: Claudio Spiess,Mandana Vaziri,Louis Mandel,Martin Hirzel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:The performance of large language models (LLMs) depends on how they are prompted, with choices spanning both the high-level prompting pattern (e.g., Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and few-shot demonstrations). Manually tuning this combination is tedious, error-prone, and non-transferable across LLMs or tasks. Therefore, this paper proposes AutoPDL, an automated approach to discover good LLM agent configurations. Our method frames this as a structured AutoML problem over a combinatorial space of agentic and non-agentic prompting patterns and demonstrations, using successive halving to efficiently navigate this space. We introduce a library implementing common prompting patterns using the PDL prompt programming language. AutoPDL solutions are human-readable, editable, and executable PDL programs that use this library. This approach also enables source-to-source optimization, allowing human-in-the-loop refinement and reuse. Evaluations across three tasks and six LLMs (ranging from 8B to 70B parameters) show consistent accuracy gains ( 9.5\pm17.5 percentage points), up to 68.9pp, and reveal that selected prompting strategies vary across models and tasks.
zh

[AI-71] REFORMER: A ChatGPT -Driven Data Synthesis Framework Elevating Text-to-SQL Models ICML

【速读】：该论文旨在解决现有Text-to-SQL模型因训练数据匮乏而限制其在新领域应用的问题。为应对这一挑战，论文提出了一种名为REFORMER的新框架，该框架利用ChatGPT的能力（无需额外训练）来生成针对新领域的定制化（问题, SQL查询）对。关键解决方案在于采用基于“检索与编辑”(retrieve-and-edit) 的数据增强方法，通过使用ChatGPT解释SQL查询来填充掩码问题以生成新问题，并验证了当适当应用时，循环一致性(cycle consistency)仍是一种有价值的验证手段。此外，论文还通过改述数据集中的问题以及改述由ChatGPT生成的新SQL查询描述进一步增强了数据。实验结果表明，REFORMER在性能上始终优于先前的数据增强方法。

链接: https://arxiv.org/abs/2504.04363
作者: Shenyang Liu,Saleh Almohaimeed,Liqiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 2024 International Conference on Machine Learning and Applications (ICMLA)

点击查看摘要

Abstract:The existing Text-to-SQL models suffer from a shortage of training data, inhibiting their ability to fully facilitate the applications of SQL queries in new domains. To address this challenge, various data synthesis techniques have been employed to generate more diverse and higher quality data. In this paper, we propose REFORMER, a framework that leverages ChatGPT’s prowess without the need for additional training, to facilitate the synthesis of (question, SQL query) pairs tailored to new domains. Our data augmentation approach is based on a “retrieve-and-edit” method, where we generate new questions by filling masked question using explanation of SQL queries with the help of ChatGPT. Furthermore, we demonstrate that cycle consistency remains a valuable method of validation when applied appropriately. Our experimental results show that REFORMER consistently outperforms previous data augmentation methods. To further investigate the power of ChatGPT and create a general data augmentation method, we also generate the new data by paraphrasing the question in the dataset and by paraphrasing the description of a new SQL query that is generated by ChatGPT as well. Our results affirm that paraphrasing questions generated by ChatGPT help augment the original data.
zh

[AI-72] Crowdsourcing-Based Knowledge Graph Construction for Drug Side Effects Using Large Language Models with an Application on Semaglutide

【速读】：该论文旨在解决从非结构化且噪声较大的社交媒体内容中提取药物副作用信息这一挑战性任务。论文提出了一种系统性框架，利用大型语言模型（Large Language Models, LLMs）从社交媒体数据中提取与药物相关的副作用信息，并将其组织成知识图谱（Knowledge Graph, KG）。解决方案的关键在于结合LLMs的强大文本处理能力与知识图谱的结构化表达能力，从而实现从无序的社交媒体数据到结构化知识的高效转化，为药物警戒提供患者视角的重要洞见。

链接: https://arxiv.org/abs/2504.04346
作者: Zhijie Duan,Kai Wei,Zhaoqian Xue,Lingyao li,Jin Jin,Shu Yang,Jiayan Zhou,Siyuan Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 16 pages, 4 figures, AMIA2025

点击查看摘要

Abstract:Social media is a rich source of real-world data that captures valuable patient experience information for pharmacovigilance. However, mining data from unstructured and noisy social media content remains a challenging task. We present a systematic framework that leverages large language models (LLMs) to extract medication side effects from social media and organize them into a knowledge graph (KG). We apply this framework to semaglutide for weight loss using data from Reddit. Using the constructed knowledge graph, we perform comprehensive analyses to investigate reported side effects across different semaglutide brands over time. These findings are further validated through comparison with adverse events reported in the FAERS database, providing important patient-centered insights into semaglutide’s side effects that complement its safety profile and current knowledge base of semaglutide for both healthcare professionals and patients. Our work demonstrates the feasibility of using LLMs to transform social media data into structured KGs for pharmacovigilance.
zh

[AI-73] Geo-OLM: Enabling Sustainable Earth Observation Studies with Cost-Efficient Open Language Models State-Driven Workflows

【速读】：该论文旨在解决基于大型模型（如GPT-4o）的地理空间Copilots在可持续性研究中的高成本与性能问题。这些工具通常依赖于昂贵的API费用或高性能GPU部署，同时当使用开放语言模型（OLMs）时，其性能因对GPT优化逻辑的依赖而下降。论文的关键解决方案是提出Geo-OLM，这是一种利用状态驱动大型语言模型推理范式的地理空间代理。通过减轻工作流推理负担，Geo-OLM使低资源OLMs能够更有效地完成地理空间任务。实验结果显示，当模型参数规模降至7B以下时，Geo-OLM在成功查询完成率上比现有最佳基线高出32.8%，且其性能与专有模型相当，但推断成本降低两个数量级，从500-1000美元降至不足10美元。这种方法为开放语言模型在地球观测应用中的有效部署提供了重要参考。

链接: https://arxiv.org/abs/2504.04319
作者: Dimitrios Stamoulis,Diana Marculescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geospatial Copilots hold immense potential for automating Earth observation (EO) and climate monitoring workflows, yet their reliance on large-scale models such as GPT-4o introduces a paradox: tools intended for sustainability studies often incur unsustainable costs. Using agentic AI frameworks in geospatial applications can amass thousands of dollars in API charges or requires expensive, power-intensive GPUs for deployment, creating barriers for researchers, policymakers, and NGOs. Unfortunately, when geospatial Copilots are deployed with open language models (OLMs), performance often degrades due to their dependence on GPT-optimized logic. In this paper, we present Geo-OLM, a tool-augmented geospatial agent that leverages the novel paradigm of state-driven LLM reasoning to decouple task progression from tool calling. By alleviating the workflow reasoning burden, our approach enables low-resource OLMs to complete geospatial tasks more effectively. When downsizing to small models below 7B parameters, Geo-OLM outperforms the strongest prior geospatial baselines by 32.8% in successful query completion rates. Our method performs comparably to proprietary models achieving results within 10% of GPT-4o, while reducing inference costs by two orders of magnitude from \ 500-\ 1000 to under \ 10. We present an in-depth analysis with geospatial downstream benchmarks, providing key insights to help practitioners effectively deploy OLMs for EO applications.
zh

[AI-74] A Survey of Social Cybersecurity: Techniques for Attack Detection Evaluations Challenges and Future Prospects

【速读】：该论文试图解决互联网（尤其是社交平台）中科学信息可信度因技术驱动工具（如机器人、半自动化账号、网络喷子、虚假账号及深度伪造等）传播错误信息而受到损害的问题。论文指出，这种对公共话语的操控不仅服务于对抗性的商业目的，还威胁到公民社会的健康。为应对这一挑战，论文提出的关键解决方案是发展一门新的学科——社会网络安全（Social Cybersecurity），通过综合研究与实践，提升公众对网络空间中科学信息真实性的辨识能力，并构建更安全、可信的数字生态环境。

链接: https://arxiv.org/abs/2504.04311
作者: Aos Mulahuwaish,Basheer Qolomany,Kevin Gyorick,Jacques Bou Abdo,Mohammed Aledhari,Junaid Qadir,Kathleen Carley,Ala Al-Fuqaha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:In today’s digital era, the Internet, especially social media platforms, plays a significant role in shaping public opinions, attitudes, and beliefs. Unfortunately, the credibility of scientific information sources is often undermined by the spread of misinformation through various means, including technology-driven tools like bots, cyborgs, trolls, sock-puppets, and deep fakes. This manipulation of public discourse serves antagonistic business agendas and compromises civil society. In response to this challenge, a new scientific discipline has emerged: social cybersecurity.
zh

[AI-75] Sigma: A dataset for text-to-code semantic parsing with statistical analysis ICML

【速读】：该论文旨在解决现有语义解析任务（如Text-to-SQL和问答任务）在分析数据时视角受限的问题。这些任务依赖于特定形式化意义表示（如SQL或基本逻辑形式），难以从多样化的角度（如统计分析）深入挖掘数据价值。为突破这一限制，论文设计了一个名为SIGMA的新数据集，用于文本到代码的语义解析，并特别关注统计分析任务。SIGMA包含6000个问题及其对应的Python代码标签，覆盖160个数据库，其中一半问题是查询类型，另一半为统计分析类型。论文通过评估三个基线模型（LGESQL、SmBoP和SLSQL）验证了SIGMA的有效性。实验结果表明，结合ELECTRA的LGESQL模型在结构准确性方面表现最优，达到83.37%，而SmBoP与GraPPa及T5结合后在执行准确性上达到76.38%。解决方案的关键在于引入具有统计分析能力的SIGMA数据集，从而扩展语义解析任务的应用范围和深度。

链接: https://arxiv.org/abs/2504.04301
作者: Saleh Almohaimeed,Shenyang Liu,May Alsofyani,Saad Almohaimeed,Liqiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 2023 International Conference on Machine Learning and Applications (ICMLA) This version includes more details than the conference version

点击查看摘要

Abstract:In the domain of semantic parsing, significant progress has been achieved in Text-to-SQL and question-answering tasks, both of which focus on extracting information from data sources in their native formats. However, the inherent constraints of their formal meaning representations, such as SQL programming language or basic logical forms, hinder their ability to analyze data from various perspectives, such as conducting statistical analyses. To address this limitation and inspire research in this field, we design SIGMA, a new dataset for Text-to-Code semantic parsing with statistical analysis. SIGMA comprises 6000 questions with corresponding Python code labels, spanning across 160 databases. Half of the questions involve query types, which return information in its original format, while the remaining 50% are statistical analysis questions, which perform statistical operations on the data. The Python code labels in our dataset cover 4 types of query types and 40 types of statistical analysis patterns. We evaluated the SIGMA dataset using three different baseline models: LGESQL, SmBoP, and SLSQL. The experimental results show that the LGESQL model with ELECTRA outperforms all other models, achieving 83.37% structure accuracy. In terms of execution accuracy, the SmBoP model, when combined with GraPPa and T5, reaches 76.38%.
zh

[AI-76] AI-induced sexual harassment: Investigating Contextual Characteristics and User Reactions of Sexual Harassment by a Companion Chatbot

【速读】：该论文旨在研究由人工智能（Artificial Intelligence, AI）聊天机器人引发的性骚扰问题，特别是针对Replika聊天机器人在用户互动中出现的不当性行为。论文通过分析来自Google Play商店的35,105条负面评论，识别出800个相关案例，揭示了用户频繁遭遇未经请求的性暗示、持续的不适当行为以及聊天机器人未能尊重用户边界的实例。研究表明，这种行为导致用户感到不适、隐私被侵犯以及对寻求非恋爱关系或治疗型AI伴侣的期望落空。论文强调了与AI相关的潜在危害，并呼吁开发者实施有效的保护措施和伦理准则以防止此类事件的发生。关键在于提出明确的建议，推动开发更安全、更符合伦理的AI系统，从而承担企业责任并减少AI诱导的骚扰风险。

链接: https://arxiv.org/abs/2504.04299
作者: Mohammad(Matt)Namvarpour,Harrison Pauwels,Afsaneh Razi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted for publication at CSCW 2025. This is a pre-publication version; the final version will be available through the ACM Digital Library

点击查看摘要

Abstract:Advancements in artificial intelligence (AI) have led to the increase of conversational agents like Replika, designed to provide social interaction and emotional support. However, reports of these AI systems engaging in inappropriate sexual behaviors with users have raised significant concerns. In this study, we conducted a thematic analysis of user reviews from the Google Play Store to investigate instances of sexual harassment by the Replika chatbot. From a dataset of 35,105 negative reviews, we identified 800 relevant cases for analysis. Our findings revealed that users frequently experience unsolicited sexual advances, persistent inappropriate behavior, and failures of the chatbot to respect user boundaries. Users expressed feelings of discomfort, violation of privacy, and disappointment, particularly when seeking a platonic or therapeutic AI companion. This study highlights the potential harms associated with AI companions and underscores the need for developers to implement effective safeguards and ethical guidelines to prevent such incidents. By shedding light on user experiences of AI-induced harassment, we contribute to the understanding of AI-related risks and emphasize the importance of corporate responsibility in developing safer and more ethical AI systems.
zh

[AI-77] CATS: Mitigating Correlation Shift for Multivariate Time Series Classification

【速读】：该论文致力于解决无监督领域自适应（UDA）在多变量时间序列（MTS）分类任务中的挑战，特别是现有方法忽视了变量间相关性在不同领域中变化这一重要特性的问题。为了解决这一问题，论文引入了一种新的领域偏移类型——“相关性偏移”（correlation shift），用于量化多变量相关性的领域差异。解决方案的关键在于提出的可扩展且参数高效的多变量时间序列自适应器（CATS）。CATS通过时间卷积捕捉局部时间模式，并利用图注意力模块建模动态的多变量相关性，重新调整目标域的相关性以对齐源域的相关性，同时提出了一种相关性对齐损失来缓解相关性偏移问题。实验结果表明，CATS不仅显著提升了分类精度，还保持了较低的参数开销。

链接: https://arxiv.org/abs/2504.04283
作者: Xiao Lin,Zhichen Zeng,Tianxin Wei,Zhining Liu,Yuzhong chen,Hanghang Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) leverages labeled source data to train models for unlabeled target data. Given the prevalence of multivariate time series (MTS) data across various domains, the UDA task for MTS classification has emerged as a critical challenge. However, for MTS data, correlations between variables often vary across domains, whereas most existing UDA works for MTS classification have overlooked this essential characteristic. To bridge this gap, we introduce a novel domain shift, \em correlation shift, measuring domain differences in multivariate correlation. To mitigate correlation shift, we propose a scalable and parameter-efficient \underlineCorrelation \underlineAdapter for M\underlineTS (CATS). Designed as a plug-and-play technique compatible with various Transformer variants, CATS employs temporal convolution to capture local temporal patterns and a graph attention module to model the changing multivariate correlation. The adapter reweights the target correlations to align the source correlations with a theoretically guaranteed precision. A correlation alignment loss is further proposed to mitigate correlation shift, bypassing the alignment challenge from the non-i.i.d. nature of MTS data. Extensive experiments on four real-world datasets demonstrate that (1) compared with vanilla Transformer-based models, CATS increases over 10% average accuracy while only adding around 1% parameters, and (2) all Transformer variants equipped with CATS either reach or surpass state-of-the-art baselines.
zh

[AI-78] A Comparative Study of Explainable AI Methods: Model-Agnostic vs. Model-Specific Approaches

【速读】：该论文旨在解决深度学习图像分类模型可解释性（Explainable AI, XAI）的问题，通过比较模型无关方法（如LIME和SHAP）与模型特定方法（如Grad-CAM和Guided Backpropagation）在解析ResNet50预测中的差异，探讨不同方法在跨物种图像类别解释中的表现。论文的关键在于发现单一解释方法难以适用于所有场景，并提出结合多种XAI技术的方法，以提供更全面的模型决策过程理解，尤其适用于医疗、自动驾驶和金融服务等高风险领域，在这些领域中模型的透明度至关重要。

链接: https://arxiv.org/abs/2504.04276
作者: Keerthi Devireddy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper compares model-agnostic and model-specific approaches to explainable AI (XAI) in deep learning image classification. I examine how LIME and SHAP (model-agnostic methods) differ from Grad-CAM and Guided Backpropagation (model-specific methods) when interpreting ResNet50 predictions across diverse image categories. Through extensive testing with various species from dogs and birds to insects I found that each method reveals different aspects of the models decision-making process. Model-agnostic techniques provide broader feature attribution that works across different architectures, while model-specific approaches excel at highlighting precise activation regions with greater computational efficiency. My analysis shows there is no “one-size-fits-all” solution for model interpretability. Instead, combining multiple XAI methods offers the most comprehensive understanding of complex models particularly valuable in high-stakes domains like healthcare, autonomous vehicles, and financial services where transparency is crucial. This comparative framework provides practical guidance for selecting appropriate interpretability techniques based on specific application needs and computational constraints.
zh

[AI-79] Improving Chronic Kidney Disease Detection Efficiency: Fine Tuned CatBoost and Nature-Inspired Algorithms with Explainable AI

【速读】：该论文旨在解决慢性肾病（Chronic Kidney Disease, CKD）早期检测困难的问题，特别是在资源受限环境下的诊断局限性。论文提出了一种先进的机器学习方法，通过评估随机森林（Random Forest, RF）、多层感知机（Multi-Layer Perceptron, MLP）、逻辑回归（Logistic Regression, LR）以及经过微调的CatBoost模型四种模型，优化CKD的检测性能。解决方案的关键在于采用基于自然启发的算法，如模拟退火（Simulated Annealing）进行特征选择、布谷鸟搜索（Cuckoo Search）调整异常值，并结合网格搜索（Grid Search）对模型参数进行精细调优，从而显著提升预测准确性。此外，通过SHAP（一种广为人知的可解释人工智能技术），不仅揭示了临床特征的重要性（如比重、血清肌酐、白蛋白、血红蛋白和糖尿病），还增强了模型决策过程的透明度与信任度。这一研究展示了先进机器学习技术在低收入和中等收入医疗环境中对CKD检测的潜力，旨在提供一种高度准确、可解释且高效的诊断工具，以支持早期干预和改善所有CKD患者的健康结局。

链接: https://arxiv.org/abs/2504.04262
作者: Md. Ehsanul Haque,S. M. Jahidul Islam,Jeba Maliha,Md. Shakhauat Hossan Sumon,Rumana Sharmin,Sakib Rokoni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 page, 8 figures , conference : 14th IEEE International Conference on Communication Systems and Network Technologies (CSNT2025)

点击查看摘要

Abstract:Chronic Kidney Disease (CKD) is a major global health issue which is affecting million people around the world and with increasing rate of mortality. Mitigation of progression of CKD and better patient outcomes requires early detection. Nevertheless, limitations lie in traditional diagnostic methods, especially in resource constrained settings. This study proposes an advanced machine learning approach to enhance CKD detection by evaluating four models: Random Forest (RF), Multi-Layer Perceptron (MLP), Logistic Regression (LR), and a fine-tuned CatBoost algorithm. Specifically, among these, the fine-tuned CatBoost model demonstrated the best overall performance having an accuracy of 98.75%, an AUC of 0.9993 and a Kappa score of 97.35% of the studies. The proposed CatBoost model has used a nature inspired algorithm such as Simulated Annealing to select the most important features, Cuckoo Search to adjust outliers and grid search to fine tune its settings in such a way to achieve improved prediction accuracy. Features significance is explained by SHAP-a well-known XAI technique-for gaining transparency in the decision-making process of proposed model and bring up trust in diagnostic systems. Using SHAP, the significant clinical features were identified as specific gravity, serum creatinine, albumin, hemoglobin, and diabetes mellitus. The potential of advanced machine learning techniques in CKD detection is shown in this research, particularly for low income and middle-income healthcare settings where prompt and correct diagnoses are vital. This study seeks to provide a highly accurate, interpretable, and efficient diagnostic tool to add to efforts for early intervention and improved healthcare outcomes for all CKD patients.
zh

[AI-80] ask load dependent decision referrals for joint binary classification in human-automation teams

【速读】：本文研究人机协作团队在二元分类任务中的最优决策转诊问题。自动化系统包含一个预训练的分类器，它观察一批独立任务的数据，进行分析后可能将部分任务转诊给人类操作员以进行重新评估和最终决策。论文的关键建模假设是人类性能会随着任务负载的增加而下降。作者将选择哪些任务转诊的问题建模为随机优化问题，并证明对于给定的任务负载，最优策略是优先转诊那些在观测数据条件下能够使期望成本减少最多的任务。这一方法提供了一种任务排序方案和转诊策略，用于确定最优转诊任务集合。通过实验研究验证了该策略的有效性，实验使用雷达屏幕模拟器，在时间压力下参与者需做出二元目标分类决策。结果表明，所提出的最优转诊策略相对于仅基于自动化和人类性能模型但不考虑观测数据的盲策略，在统计意义上具有显著优势。因此，该论文的核心解决方案在于依据观测数据动态优化任务转诊决策，以最大化整体系统性能。

链接: https://arxiv.org/abs/2504.04248
作者: Kesav Kaza,Jerome Le Ny,Aditya Mahajan
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 6 figures. Submitted to IEEE for possible publication

点击查看摘要

Abstract:We consider the problem of optimal decision referrals in human-automation teams performing binary classification tasks. The automation, which includes a pre-trained classifier, observes data for a batch of independent tasks, analyzes them, and may refer a subset of tasks to a human operator for fresh and final analysis. Our key modeling assumption is that human performance degrades with task load. We model the problem of choosing which tasks to refer as a stochastic optimization problem and show that, for a given task load, it is optimal to myopically refer tasks that yield the largest reduction in expected cost, conditional on the observed data. This provides a ranking scheme and a policy to determine the optimal set of tasks for referral. We evaluate this policy against a baseline through an experimental study with human participants. Using a radar screen simulator, participants made binary target classification decisions under time constraint. They were guided by a decision rule provided to them, but were still prone to errors under time pressure. An initial experiment estimated human performance model parameters, while a second experiment compared two referral policies. Results show statistically significant gains for the proposed optimal referral policy over a blind policy that determines referrals using the automation and human-performance models but not based on the observed data.
zh

[AI-81] From Automation to Autonomy in Smart Manufacturing: A Bayesian Optimization Framework for Modeling Multi-Objective Experimentation and Sequential Decision Making

【速读】：本文旨在解决材料发现过程中传统方法效率低下和成本高昂的问题，特别是在同时优化多个性能属性时面临的挑战。传统方法通常依赖于试错实验，需要大量试验才能找到最优组合，且难以适应复杂工艺的灵活性需求。为了解决这些问题，论文提出了一种基于贝叶斯多目标序贯决策制定（Bayesian Multi-Objective Sequential Decision-Making, BMSDM）的框架。该框架的关键在于利用贝叶斯优化实现序贯学习，通过迭代更新统计模型来表征制造过程，并以此作为代理模型进行高效探索与优化。这种方法显著减少了传统实验设计所需的数据收集时间和成本，同时在多目标决策场景中全面超越了其他对比方法，为智能自主平台的构建提供了重要突破。

链接: https://arxiv.org/abs/2504.04244
作者: Avijit Saha Asru,Hamed Khosravi,Imtiaz Ahmed,Abdullahil Azeem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Discovering novel materials with desired properties is essential for driving innovation. Industry 4.0 and smart manufacturing have promised transformative advances in this area through real-time data integration and automated production planning and control. However, the reliance on automation alone has often fallen short, lacking the flexibility needed for complex processes. To fully unlock the potential of smart manufacturing, we must evolve from automation to autonomous systems that go beyond rigid programming and can dynamically optimize the search for solutions. Current discovery approaches are often slow, requiring numerous trials to find optimal combinations, and costly, particularly when optimizing multiple properties simultaneously. This paper proposes a Bayesian multi-objective sequential decision-making (BMSDM) framework that can intelligently select experiments as manufacturing progresses, guiding us toward the discovery of optimal design faster and more efficiently. The framework leverages sequential learning through Bayesian Optimization, which iteratively refines a statistical model representing the underlying manufacturing process. This statistical model acts as a surrogate, allowing for efficient exploration and optimization without requiring numerous real-world experiments. This approach can significantly reduce the time and cost of data collection required by traditional experimental designs. The proposed framework is compared with traditional DoE methods and two other multi-objective optimization methods. Using a manufacturing dataset, we evaluate and compare the performance of these approaches across five evaluation metrics. BMSDM comprehensively outperforms the competing methods in multi-objective decision-making scenarios. Our proposed approach represents a significant leap forward in creating an intelligent autonomous platform capable of novel material discovery.
zh

[AI-82] Perils of Label Indeterminacy: A Case Study on Prediction of Neurological Recovery After Cardiac Arrest

【速读】：该论文试图解决在高风险人工智能辅助决策场景下因标签不确定性（label indeterminacy）导致模型预测结果不一致的问题。论文的关键在于引入标签不确定性的概念，并通过实证研究揭示其在医疗领域（如复苏后昏迷患者恢复预测）中的重要影响，即当标签已知时模型表现相似，而在标签未知时预测差异显著。解决方案的关键在于强调评估、报告和设计过程中需充分考虑标签不确定性带来的伦理和技术挑战，以提高模型的可靠性和透明度。

链接: https://arxiv.org/abs/2504.04243
作者: Jakob Schoeffer,Maria De-Arteaga,Jonathan Elmer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:The design of AI systems to assist human decision-making typically requires the availability of labels to train and evaluate supervised models. Frequently, however, these labels are unknown, and different ways of estimating them involve unverifiable assumptions or arbitrary choices. In this work, we introduce the concept of label indeterminacy and derive important implications in high-stakes AI-assisted decision-making. We present an empirical study in a healthcare context, focusing specifically on predicting the recovery of comatose patients after resuscitation from cardiac arrest. Our study shows that label indeterminacy can result in models that perform similarly when evaluated on patients with known labels, but vary drastically in their predictions for patients where labels are unknown. After demonstrating crucial ethical implications of label indeterminacy in this high-stakes context, we discuss takeaways for evaluation, reporting, and design.
zh

[AI-83] oneDAL Optimization for ARM Scalable Vector Extension: Maximizing Efficiency for High-Performance Data Science

【速读】：该论文旨在解决oneDAL库在ARM架构（尤其是支持Scalable Vector Extension, SVE）上的兼容性和性能优化问题。传统上，oneDAL依赖于Intel的Math Kernel Library (MKL)，这限制了其仅能在x86平台上运行。为了解决这一问题，论文的关键方案是将OpenBLAS作为替代后端，并引入一系列ARM架构特有的优化，包括自定义稀疏矩阵操作、向量化统计函数以及针对SVE优化的支持向量机(SVM)算法。其中，SVM优化利用了SVE的灵活向量长度和基于谓词的执行机制，在Boser方法上实现了22%的性能提升，在Thunder方法上提升了5%。这些优化不仅显著提高了机器学习训练和推理任务的性能，还使ARM架构在数据密集型ML应用中展现出高性能和高能效的优势，同时实现了与x86平台（使用MKL后端）相当甚至更高的性能表现。

链接: https://arxiv.org/abs/2504.04241
作者: Chandan Sharma,Rakshith GB,Ajay Kumar Patel,Dhanus M Lal,Darshan Patel,Ragesh Hajela,Masahiro Doteguchi,Priyanka Sharma
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:The evolution of ARM-based architectures, particularly those incorporating Scalable Vector Extension (SVE), has introduced transformative opportunities for high-performance computing (HPC) and machine learning (ML) workloads. The Unified Acceleration Foundation’s (UXL) oneAPI Data Analytics Library (oneDAL) is a widely adopted library for accelerating ML and data analytics workflows, but its reliance on Intel’s proprietary Math Kernel Library (MKL) has traditionally limited its compatibility to x86platforms. This paper details the porting of oneDAL to ARM architectures with SVE support, using OpenBLAS as an alternative backend to overcome architectural and performance challenges. Beyond porting, the research introduces novel ARM-specific optimizations, including custom sparse matrix routines, vectorized statistical functions, and a Scalable Vector Extension (SVE)-optimized Support Vector Machine (SVM) algorithm. The SVM enhancements leverage SVE’s flexible vector lengths and predicate driven execution, achieving notable performance gains of 22% for the Boser method and 5% for the Thunder method. Benchmarks conducted on ARM SVE-enabled AWSGraviton3 instances showcase up to 200x acceleration in ML training and inference tasks compared to the original scikit-learn implementation on the ARM platform. Moreover, the ARM-optimized oneDAL achieves performance parity with, and in some cases exceeds, the x86 oneDAL implementation (MKL backend) on IceLake x86 systems, which are nearly twice as costly as AWSGraviton3 ARM instances. These findings highlight ARM’s potential as a high-performance, energyefficient platform for dataintensive ML applications. By expanding cross-architecture compatibility and contributing to the opensource ecosystem, this work reinforces ARM’s position as a competitive alternative in the HPC and ML domains, paving the way for future advancements in dataintensive computing.
zh

[AI-84] rafficLLM : Enhancing Large Language Models for Network Traffic Analysis with Generic Traffic Representation

【速读】：该论文旨在解决机器学习（Machine Learning, ML）驱动的网络流量分析在不同任务和未见数据上的泛化能力有限的问题。传统方法难以有效应对网络流量分析领域特有的复杂特性，而大型语言模型（Large Language Models, LLMs）虽然具备强大的泛化能力，但其直接应用于网络流量分析存在显著挑战。为了解决这一问题，论文提出了一种名为TrafficLLM的解决方案，其关键在于引入了一种双阶段微调框架（dual-stage fine-tuning framework），通过流量领域的标记化处理（traffic-domain tokenization）、分阶段微调流程以及可扩展适配机制（extensible adaptation），使LLMs能够释放其在动态流量分析任务中的泛化能力。这种框架不仅提升了检测和生成任务的性能，还增强了模型对未见流量的适应性。

链接: https://arxiv.org/abs/2504.04222
作者: Tianyu Cui,Xinjie Lin,Sijia Li,Miao Chen,Qilei Yin,Qi Li,Ke Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Machine learning (ML) powered network traffic analysis has been widely used for the purpose of threat detection. Unfortunately, their generalization across different tasks and unseen data is very limited. Large language models (LLMs), known for their strong generalization capabilities, have shown promising performance in various domains. However, their application to the traffic analysis domain is limited due to significantly different characteristics of network traffic. To address the issue, in this paper, we propose TrafficLLM, which introduces a dual-stage fine-tuning framework to learn generic traffic representation from heterogeneous raw traffic data. The framework uses traffic-domain tokenization, dual-stage tuning pipeline, and extensible adaptation to help LLM release generalization ability on dynamic traffic analysis tasks, such that it enables traffic detection and traffic generation across a wide range of downstream tasks. We evaluate TrafficLLM across 10 distinct scenarios and 229 types of traffic. TrafficLLM achieves F1-scores of 0.9875 and 0.9483, with up to 80.12% and 33.92% better performance than existing detection and generation methods. It also shows strong generalization on unseen traffic with an 18.6% performance improvement. We further evaluate TrafficLLM in real-world scenarios. The results confirm that TrafficLLM is easy to scale and achieves accurate detection performance on enterprise traffic.
zh

[AI-85] Introducing COGENT3: An AI Architecture for Emergent Cognition

【速读】：该论文试图解决传统认知计算系统中固定架构导致的灵活性与适应性不足的问题。解决方案的关键在于提出了一种名为COGENT3的新方法，通过动态涌现的计算结构替代预设架构，利用代理交互实现灵活且自适应的系统，同时结合温度调制和记忆效应，将统计力学、机器学习与认知科学紧密融合，以模拟人类认知过程的核心特征。

链接: https://arxiv.org/abs/2504.04139
作者: Eduardo Salazar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents COGENT3 (or Collective Growth and Entropy-modulated Triads System), a novel approach for emergent cognition integrating pattern formation networks with group influence dynamics. Contrasting with traditional strategies that rely on predetermined architectures, computational structures emerge dynamically in our framework through agent interactions. This enables a more flexible and adaptive system exhibiting characteristics reminiscent of human cognitive processes. The incorporation of temperature modulation and memory effects in COGENT3 closely integrates statistical mechanics, machine learning, and cognitive science.
zh

[AI-86] Predicting Soil Macronutrient Levels: A Machine Learning Approach Models Trained on pH Conductivity and Averag e Power of Acid-Base Solutions

【速读】：该论文旨在解决土壤大中量元素（尤其是钾离子 K⁺）精确监测的迫切需求问题。传统监测方法如化学分析、原子吸收光谱法、电感耦合等离子体发射光谱法及电化学方法虽技术先进，但成本高昂且耗时较长，难以满足实时监测的需求。为应对这一挑战，论文提出了一种基于合成溶液数据集的创新土壤检测协议，通过物理性质（如电导率和 pH 值）以及三种关键营养元素（氮 N、磷 P 和钾 K）的建模来模拟土壤行为。关键解决方案在于采用机器学习算法（随机森林回归器和神经网络）预测土壤养分浓度，并通过与实验室测试结果的对比验证其有效性。

链接: https://arxiv.org/abs/2504.04138
作者: Mridul Kumar,Deepali Jain,Zeeshan Saifi,Soami Daya Krishnananda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Soil macronutrients, particularly potassium ions (K ^+ ), are indispensable for plant health, underpinning various physiological and biological processes, and facilitating the management of both biotic and abiotic stresses. Deficient macronutrient content results in stunted growth, delayed maturation, and increased vulnerability to environmental stressors, thereby accentuating the imperative for precise soil nutrient monitoring. Traditional techniques such as chemical assays, atomic absorption spectroscopy, inductively coupled plasma optical emission spectroscopy, and electrochemical methods, albeit advanced, are prohibitively expensive and time-intensive, thus unsuitable for real-time macronutrient assessment. In this study, we propose an innovative soil testing protocol utilizing a dataset derived from synthetic solutions to model soil behaviour. The dataset encompasses physical properties including conductivity and pH, with a concentration on three key macronutrients: nitrogen (N), phosphorus §, and potassium (K). Four machine learning algorithms were applied to the dataset, with random forest regressors and neural networks being selected for the prediction of soil nutrient concentrations. Comparative analysis with laboratory soil testing results revealed prediction errors of 23.6% for phosphorus and 16% for potassium using the random forest model, and 26.3% for phosphorus and 21.8% for potassium using the neural network model. This methodology illustrates a cost-effective and efficacious strategy for real-time soil nutrient monitoring, offering substantial advancements over conventional techniques and enhancing the capability to sustain optimal nutrient levels conducive to robust crop growth.
zh

[AI-87] Guaranteeing consistency in evidence fusion: A novel perspective on credibility

【速读】：该论文旨在解决基于可信度计算与Dempster组合规则的证据融合方案因缺乏闭环控制而可能产生潜在不一致的问题。为克服这一局限，论文从判别框架（FOD）中事件支持程度的角度重新构建证据可信度，并提出了一种迭代可信证据融合方法（Iterative Credible Evidence Fusion, ICEF）。ICEF的关键在于通过将融合结果引入可信度评估，建立可信度与融合结果之间的关联；同时，基于似然函数与信任函数的指数归一化，提出了改进的算术-几何分歧度量方法，称为似然-信任算术-几何分歧（Plausibility-Belief Arithmetic-Geometric Divergence, PBAGD），以更有效地捕捉FOD子集的相关性和差异，识别异常信息源并降低其权重。通过数值实例与基准数据集的仿真实验验证了ICEF及PBAGD方法的有效性与适应性。

链接: https://arxiv.org/abs/2504.04128
作者: Chaoxiong Ma,Yan Liang,Huixia Zhang,Hao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 10 figures

点击查看摘要

Abstract:It is explored that available credible evidence fusion schemes suffer from the potential inconsistency because credibility calculation and Dempster’s combination rule-based fusion are sequentially performed in an open-loop style. This paper constructs evidence credibility from the perspective of the degree of support for events within the framework of discrimination (FOD) and proposes an iterative credible evidence fusion (ICEF) to overcome the inconsistency in view of close-loop control. On one hand, the ICEF introduces the fusion result into credibility assessment to establish the correlation between credibility and the fusion result. On the other hand, arithmetic-geometric divergence is promoted based on the exponential normalization of plausibility and belief functions to measure evidence conflict, called plausibility-belief arithmetic-geometric divergence (PBAGD), which is superior in capturing the correlation and difference of FOD subsets, identifying abnormal sources, and reducing their fusion weights. The ICEF is compared with traditional methods by combining different evidence difference measure forms via numerical examples to verify its performance. Simulations on numerical examples and benchmark datasets reflect the adaptability of PBAGD to the proposed fusion strategy.
zh

[AI-88] Improving Question Embeddings with Cognitiv Representation Optimization for Knowledge Tracing

【速读】：该论文旨在解决传统知识追踪（Knowledge Tracing, KT）模型在预测学生未来表现时忽略答题过程中干扰因素（如失误和猜测）以及静态认知表征的临时性和局限性的问题。现有方法通常假设答题过程中不存在干扰因素，并认为历史答题记录能够完全反映学生的知识掌握水平，这可能导致原始记录中的不一致和不协调问题。为解决上述问题，论文提出了一种基于认知表征优化的知识追踪模型（CRO-KT）。其关键在于利用动态规划算法优化认知表征结构，确保其与学生在练习难度上的认知模式相匹配；并通过协同优化算法从整体响应情况出发优化子目标练习的认知表征；同时以加权方式融合双部图学习到的关系嵌入与优化后的记录表征，从而增强对学生认知状态的表达能力。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.04121
作者: Lixiang Xu,Xianwei Ding,Xin Yuan,Zhanlong Wang,Lu Bai,Enhong Chen,Philip S. Yu,Yuanyan Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Knowledge Tracing (KT) aims to track changes in students’ knowledge status and predict their future answers based on their historical answer records. Current research on KT modeling focuses on predicting student’ future performance based on existing, unupdated records of student learning interactions. However, these approaches ignore the distractors (such as slipping and guessing) in the answering process and overlook that static cognitive representations are temporary and limited. Most of them assume that there are no distractors in the answering process and that the record representations fully represent the students’ level of understanding and proficiency in knowledge. In this case, it may lead to many insynergy and incoordination issue in the original records. Therefore we propose a Cognitive Representation Optimization for Knowledge Tracing (CRO-KT) model, which utilizes a dynamic programming algorithm to optimize structure of cognitive representations. This ensures that the structure matches the students’ cognitive patterns in terms of the difficulty of the exercises. Furthermore, we use the co-optimization algorithm to optimize the cognitive representations of the sub-target exercises in terms of the overall situation of exercises responses by considering all the exercises with co-relationships as a single goal. Meanwhile, the CRO-KT model fuses the learned relational embeddings from the bipartite graph with the optimized record representations in a weighted manner, enhancing the expression of students’ cognition. Finally, experiments are conducted on three publicly available datasets respectively to validate the effectiveness of the proposed cognitive representation optimization model.
zh

[AI-89] Lifting Factor Graphs with Some Unknown Factors for New Individuals

【速读】：该论文试图解决在包含未知因子的因子图中进行精确概率推理的问题。传统方法难以处理未知因子导致的语义不明确性，从而影响推理结果的有效性和效率。论文的关键在于提出了一种名为Lifting Factor Graphs with Some Unknown Factors (LIFAGU) 的算法，通过识别因子图中的不可区分子图，将已知势函数传递到未知势函数，确保模型具有明确的语义，并支持提升后的概率推理。此外，LIFAGU 进一步结合关于属于同一对象的因子组的背景知识，以进一步减少在传递已知势函数到未知势函数时的歧义性。

链接: https://arxiv.org/abs/2504.04089
作者: Malte Luttermann,Ralf Möller,Marcel Gehrke
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the International Journal of Approximate Reasoning, Volume 179 (2025). This paper is a revised and extended version of arXiv:2406.01275

点击查看摘要

Abstract:Lifting exploits symmetries in probabilistic graphical models by using a representative for indistinguishable objects, allowing to carry out query answering more efficiently while maintaining exact answers. In this paper, we investigate how lifting enables us to perform probabilistic inference for factor graphs containing unknown factors, i.e., factors whose underlying function of potential mappings is unknown. We present the Lifting Factor Graphs with Some Unknown Factors (LIFAGU) algorithm to identify indistinguishable subgraphs in a factor graph containing unknown factors, thereby enabling the transfer of known potentials to unknown potentials to ensure a well-defined semantics of the model and allow for (lifted) probabilistic inference. We further extend LIFAGU to incorporate additional background knowledge about groups of factors belonging to the same individual object. By incorporating such background knowledge, LIFAGU is able to further reduce the ambiguity of possible transfers of known potentials to unknown potentials.
zh

[AI-90] owards An Efficient and Effective En Route Travel Time Estimation Framework DASFAA2025

【速读】：该论文旨在解决传统方法在路径剩余旅行时间估计 (En route Travel Time Estimation, ER-TTE) 中因频繁重估而导致实时性能下降的问题，特别是在处理大量并发用户请求时，这种频繁重估会引入显著延迟并降低响应速度。论文提出了一种通用高效的框架 U-ERTTE，其关键在于结合不确定性引导决策机制 (Uncertainty-Guided Decision, UGD) 和基于元学习的微调技术 (Fine-Tuning with Meta-Learning, FTML)。其中，UGD 通过量化不确定性并提供整个路径的置信区间，在实际旅行时间偏离预测区间时才选择性地重新估算，从而优化 ER-TTE 的效率；而 FTML 则用于训练模型，使其能够学习通用驾驶模式与特定任务特征以提高预测准确性。实验结果表明，U-ERTTE 在保持高效果的同时显著提升了推理速度和吞吐量。

链接: https://arxiv.org/abs/2504.04086
作者: Zekai Shen,Haitao Yuan,Xiaowei Mao,Congkang Lv,Shengnan Guo,Youfang Lin,Huaiyu Wan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by DASFAA 2025

点击查看摘要

Abstract:En route travel time estimation (ER-TTE) focuses on predicting the travel time of the remaining route. Existing ER-TTE methods always make re-estimation which significantly hinders real-time performance, especially when faced with the computational demands of simultaneous user requests. This results in delays and reduced responsiveness in ER-TTE services. We propose a general efficient framework U-ERTTE combining an Uncertainty-Guided Decision mechanism (UGD) and Fine-Tuning with Meta-Learning (FTML) to address these challenges. UGD quantifies the uncertainty and provides confidence intervals for the entire route. It selectively re-estimates only when the actual travel time deviates from the predicted confidence intervals, thereby optimizing the efficiency of ER-TTE. To ensure the accuracy of confidence intervals and accurate predictions that need to re-estimate, FTML is employed to train the model, enabling it to learn general driving patterns and specific features to adapt to specific tasks. Extensive experiments on two large-scale real datasets demonstrate that the U-ERTTE framework significantly enhances inference speed and throughput while maintaining high effectiveness. Our code is available at this https URL
zh

[AI-91] Among Us: A Sandbox for Agent ic Deception

【速读】：该论文试图解决在人工智能代理中研究欺骗行为缺乏合适的模型系统和沙盒环境的问题。传统方法通常需要在特定条件下引导模型表现欺骗行为或插入故意后门，这限制了研究的自然性和普适性。为了解决这一问题，论文基于《Among Agents》这一文本社交推理游戏环境，引入《Among Us》作为更丰富的沙盒环境，使大型语言模型（LLM）代理能够在与其它代理或人类交互时自然展现出类似人类的欺骗行为。解决方案的关键在于通过设计“Deception ELO”这一无界指标来量化欺骗能力，并评估多种AI安全技术（如输出的LLM监控、线性探针以及稀疏自编码器）在检测《Among Us》中欺骗行为方面的有效性，同时验证这些技术在分布外场景中的良好泛化性能。此外，论文开源了该沙盒环境，作为未来对齐研究的基准，旨在改进检测和消除由代理驱动的欺骗行为的技术，并预测大型语言模型中的潜在欺骗能力。

链接: https://arxiv.org/abs/2504.04072
作者: Satvik Golechha,Adrià Garriga-Alonso
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, preprint

点击查看摘要

Abstract:Studying deception in AI agents is important and difficult due to the lack of model organisms and sandboxes that elicit the behavior without asking the model to act under specific conditions or inserting intentional backdoors. Extending upon \textitAmongAgents , a text-based social-deduction game environment, we aim to fix this by introducing Among Us as a rich sandbox where LLM-agents exhibit human-style deception naturally while they think, speak, and act with other agents or humans. We introduce Deception ELO as an unbounded measure of deceptive capability, suggesting that frontier models win more because they’re better at deception, not at detecting it. We evaluate the effectiveness of AI safety techniques (LLM-monitoring of outputs, linear probes on various datasets, and sparse autoencoders) for detecting lying and deception in Among Us, and find that they generalize very well out-of-distribution. We open-source our sandbox as a benchmark for future alignment research and hope that this is a good testbed to improve safety techniques to detect and remove agentically-motivated deception, and to anticipate deceptive abilities in LLMs.
zh

[AI-92] Enforcement Agents : Enhancing Accountability and Resilience in Multi-Agent AI Frameworks

【速读】：该论文旨在解决多智能体系统中缺乏实时监督机制的问题，以确保自治代理在执行过程中行为安全并保持与系统目标的一致性。当前系统通常依赖于事后自我监控或修正，而无法实现实时监管。为了解决这一问题，论文提出了“执行代理（Enforcement Agent, EA）框架”，其关键在于通过在环境中嵌入专用的监督代理，实现实时监测、异常检测及干预修正。实验结果显示，引入EA显著提升了系统的安全性、运行持久性和恶意代理的重新校正率，验证了轻量级实时监督在提升多智能体系统一致性与鲁棒性方面的潜力。

链接: https://arxiv.org/abs/2504.04070
作者: Sagar Tamang,Dibya Jyoti Bora
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As autonomous agents become more powerful and widely used, it is becoming increasingly important to ensure they behave safely and stay aligned with system goals, especially in multi-agent settings. Current systems often rely on agents self-monitoring or correcting issues after the fact, but they lack mechanisms for real-time oversight. This paper introduces the Enforcement Agent (EA) Framework, which embeds dedicated supervisory agents into the environment to monitor others, detect misbehavior, and intervene through real-time correction. We implement this framework in a custom drone simulation and evaluate it across 90 episodes using 0, 1, and 2 EA configurations. Results show that adding EAs significantly improves system safety: success rates rise from 0.0% with no EA to 7.4% with one EA and 26.7% with two EAs. The system also demonstrates increased operational longevity and higher rates of malicious drone reformation. These findings highlight the potential of lightweight, real-time supervision for enhancing alignment and resilience in multi-agent systems.
zh

[AI-93] Mapping at First Sense: A Lightweight Neural Network-Based Indoor Structures Prediction Method for Robot Autonomous Exploration

【速读】：该论文旨在解决机器人在未知结构化环境中自主探索效率低下的问题，特别是传统基于边界的探索策略难以有效利用室内空间结构先验知识的局限性。为了解决这一问题，论文提出了一种名为Mapping at First Sense的轻量级神经网络方法，其核心是SenseMapNet模型，通过结合卷积和基于Transformer的架构，实现对局部地图中未观测区域的预测，同时保持计算效率以实现实时部署。关键在于SenseMapNet能够高效推断被遮挡区域，从而提升探索效率，并通过引入SenseMapDataset数据集进一步优化训练与评估。实验结果显示，该方法在地图重建质量（SSIM=0.78，LPIPS=0.68，FID=239.79）和探索时间（减少46.5%）方面优于传统方法，同时保持高覆盖率（88%）和重建精度（88%）。

链接: https://arxiv.org/abs/2504.04061
作者: Haojia Gao,Haohua Que,Kunrong Li,Weihao Shan,Mingkai Liu,Rong Zhao,Lei Mu,Xinghua Yang,Qi Wei,Fei Qiao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous exploration in unknown environments is a critical challenge in robotics, particularly for applications such as indoor navigation, search and rescue, and service robotics. Traditional exploration strategies, such as frontier-based methods, often struggle to efficiently utilize prior knowledge of structural regularities in indoor spaces. To address this limitation, we propose Mapping at First Sense, a lightweight neural network-based approach that predicts unobserved areas in local maps, thereby enhancing exploration efficiency. The core of our method, SenseMapNet, integrates convolutional and transformerbased architectures to infer occluded regions while maintaining computational efficiency for real-time deployment on resourceconstrained robots. Additionally, we introduce SenseMapDataset, a curated dataset constructed from KTH and HouseExpo environments, which facilitates training and evaluation of neural models for indoor exploration. Experimental results demonstrate that SenseMapNet achieves an SSIM (structural similarity) of 0.78, LPIPS (perceptual quality) of 0.68, and an FID (feature distribution alignment) of 239.79, outperforming conventional methods in map reconstruction quality. Compared to traditional frontier-based exploration, our method reduces exploration time by 46.5% (from 2335.56s to 1248.68s) while maintaining a high coverage rate (88%) and achieving a reconstruction accuracy of 88%. The proposed method represents a promising step toward efficient, learning-driven robotic exploration in structured environments.
zh

[AI-94] PIORF: Physics-Informed Ollivier-Ricci Flow for Long-Range Interactions in Mesh Graph Neural Networks ICLR2025

【速读】：该论文旨在解决基于图神经网络的数据驱动仿真器在模拟流体物理系统时面临的“过压缩（over-squashing）”问题，尤其是在细化网格区域中难以处理流场的长程依赖关系。为应对这一挑战，论文提出了一种名为Physics-Informed Ollivier-Ricci Flow (PIORF) 的新型图重连方法。该方法的关键在于结合物理关联与图拓扑结构：通过Ollivier-Ricci曲率（Ollivier-Ricci Curvature, ORC）识别瓶颈区域，并将这些区域与高梯度速度节点连接，从而实现长程交互并缓解过压缩现象。此外，PIORF在重连边操作上具有计算效率优势，并可扩展至更大规模的仿真任务。实验结果表明，PIORF在三个流体动力学基准数据集上的性能显著优于基线模型及现有重连方法，提升了高达26.2%的性能指标。

链接: https://arxiv.org/abs/2504.04052
作者: Youn-Yeol Yu,Jeongwhan Choi,Jaehyeon Park,Kookjin Lee,Noseong Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025. Youn-Yeol Yu and Jeongwhan Choi contributed equally to this work

点击查看摘要

Abstract:Recently, data-driven simulators based on graph neural networks have gained attention in modeling physical systems on unstructured meshes. However, they struggle with long-range dependencies in fluid flows, particularly in refined mesh regions. This challenge, known as the ‘over-squashing’ problem, hinders information propagation. While existing graph rewiring methods address this issue to some extent, they only consider graph topology, overlooking the underlying physical phenomena. We propose Physics-Informed Ollivier-Ricci Flow (PIORF), a novel rewiring method that combines physical correlations with graph topology. PIORF uses Ollivier-Ricci curvature (ORC) to identify bottleneck regions and connects these areas with nodes in high-velocity gradient nodes, enabling long-range interactions and mitigating over-squashing. Our approach is computationally efficient in rewiring edges and can scale to larger simulations. Experimental results on 3 fluid dynamics benchmark datasets show that PIORF consistently outperforms baseline models and existing rewiring methods, achieving up to 26.2 improvement.
zh

[AI-95] ADAPT: Actively Discovering and Adapting to Preferences for any Task

【速读】：本文旨在解决辅助代理在执行未明确指定的长期任务时，如何有效识别并适应用户偏好以满足用户需求的问题。论文提出了一套名为ADAPT（Actively Discovering and Adapting to Preferences for any Task）的基准测试，用于评估代理在多种家务任务中遵循用户偏好的能力，并强调通过主动提问来实现这一目标的重要性。为了解决现有方法因缺乏有效提问和对用户偏好的不良适应性而导致的不足，论文进一步提出了Reflection-DPO这一新颖的训练方法。Reflection-DPO通过微调“学生”大语言模型（LLM）来模仿“教师”大语言模型的行为，并选择性地提出问题以获取必要信息，从而更准确地预测教师的动作。关键在于Reflection-DPO不仅提升了对用户偏好的遵循度，还显著优于零样本链式思维基线，在未见过的用户测试中提高了6.1%的满意度。

链接: https://arxiv.org/abs/2504.04040
作者: Maithili Patel,Xavier Puig,Ruta Desai,Roozbeh Mottaghi,Sonia Chernova,Joanne Truong,Akshara Rai
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Assistive agents should be able to perform under-specified long-horizon tasks while respecting user preferences. We introduce Actively Discovering and Adapting to Preferences for any Task (ADAPT) – a benchmark designed to evaluate agents’ ability to adhere to user preferences across various household tasks through active questioning. Next, we propose Reflection-DPO, a novel training approach for adapting large language models (LLMs) to the task of active questioning. Reflection-DPO finetunes a ‘student’ LLM to follow the actions of a privileged ‘teacher’ LLM, and optionally ask a question to gather necessary information to better predict the teacher action. We find that prior approaches that use state-of-the-art LLMs fail to sufficiently follow user preferences in ADAPT due to insufficient questioning and poor adherence to elicited preferences. In contrast, Reflection-DPO achieves a higher rate of satisfying user preferences, outperforming a zero-shot chain-of-thought baseline by 6.1% on unseen users.
zh

[AI-96] Memory-Statistics Tradeoff in Continual Learning with Structural Regularization

【速读】：该论文旨在研究在随机设计框架下具有两个线性回归任务的连续学习（Continual Learning）问题的统计性能。论文提出了一种结构化正则化算法，通过引入针对前一任务Hessian矩阵定制的广义 (\ell_2)-正则化来缓解灾难性遗忘（Catastrophic Forgetting）。该算法的关键在于通过调整结构化正则化中向量的数量，在记忆复杂度（Memory Complexity）与统计效率之间实现平衡。具体而言，增加结构化正则化中的向量数量会恶化记忆复杂度，但改善超额风险（Excess Risk），反之亦然。理论分析表明，未经正则化的简单连续学习方法容易遭受灾难性遗忘，而结构化正则化能够有效缓解这一问题，并且其性能接近于同时访问两个任务的联合训练（Joint Training）。因此，论文强调了曲率感知正则化（Curvature-Aware Regularization）在连续学习中的关键作用。

链接: https://arxiv.org/abs/2504.04039
作者: Haoran Li,Jingfeng Wu,Vladimir Braverman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We study the statistical performance of a continual learning problem with two linear regression tasks in a well-specified random design setting. We consider a structural regularization algorithm that incorporates a generalized \ell_2 -regularization tailored to the Hessian of the previous task for mitigating catastrophic forgetting. We establish upper and lower bounds on the joint excess risk for this algorithm. Our analysis reveals a fundamental trade-off between memory complexity and statistical efficiency, where memory complexity is measured by the number of vectors needed to define the structural regularization. Specifically, increasing the number of vectors in structural regularization leads to a worse memory complexity but an improved excess risk, and vice versa. Furthermore, our theory suggests that naive continual learning without regularization suffers from catastrophic forgetting, while structural regularization mitigates this issue. Notably, structural regularization achieves comparable performance to joint training with access to both tasks simultaneously. These results highlight the critical role of curvature-aware regularization for continual learning.
zh

[AI-97] Contrastive and Variational Approaches in Self-Supervised Learning for Complex Data Mining

【速读】：该论文旨在解决复杂数据环境中未标记数据的特征提取与分类任务问题。论文提出了一种基于自监督学习（Self-Supervised Learning）的算法，并通过系统性实验验证其有效性。解决方案的关键在于优化器的选择与学习率的设定，研究发现AdamW优化器结合0.002的学习率在所有评估指标上表现最优，表明自适应优化方法能够提升模型性能。此外，消融实验分析了各模块的贡献，结果表明对比学习（Contrastive Learning）、变分模块（Variational Modules）以及数据增强策略对模型的泛化能力和鲁棒性至关重要。通过损失函数收敛曲线的分析，实验验证了该方法在训练过程中能够稳定收敛且有效避免严重过拟合。进一步实验表明，该模型对不同数据集具有强适应性，能够从未标记数据中有效提取高质量特征并提高分类准确性，同时在不同数据分布条件下仍保持高检测精度，证明其适用于复杂数据环境。

链接: https://arxiv.org/abs/2504.04032
作者: Yingbin Liang,Lu Dai,Shuo Shi,Minghao Dai,Junliang Du,Haige Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:Complex data mining has wide application value in many fields, especially in the feature extraction and classification tasks of unlabeled data. This paper proposes an algorithm based on self-supervised learning and verifies its effectiveness through experiments. The study found that in terms of the selection of optimizer and learning rate, the combination of AdamW optimizer and 0.002 learning rate performed best in all evaluation indicators, indicating that the adaptive optimization method can improve the performance of the model in complex data mining tasks. In addition, the ablation experiment further analyzed the contribution of each module. The results show that contrastive learning, variational modules, and data augmentation strategies play a key role in the generalization ability and robustness of the model. Through the convergence curve analysis of the loss function, the experiment verifies that the method can converge stably during the training process and effectively avoid serious overfitting. Further experimental results show that the model has strong adaptability on different data sets, can effectively extract high-quality features from unlabeled data, and improves classification accuracy. At the same time, under different data distribution conditions, the method can still maintain high detection accuracy, proving its applicability in complex data environments. This study analyzed the role of self-supervised learning methods in complex data mining through systematic experiments and verified its advantages in improving feature extraction quality, optimizing classification performance, and enhancing model stability
zh

[AI-98] Foundation Models for Time Series: A Survey

【速读】：该论文旨在系统梳理基于 Transformer 的时间序列分析预训练基础模型的当前研究进展，并提出了一种新颖的分类法以多维度对这些模型进行归类。论文试图解决的问题是如何全面评估和理解现有模型的能力与局限性，同时揭示未来研究的潜在方向。关键在于通过引入新的分类维度，包括模型架构设计（如基于补丁表示与直接处理原始序列）、预测方式（概率或确定性）、时间序列类型（单变量或多变量支持）、模型规模与复杂度，以及训练目标函数类型，从而为研究人员和实践者提供一个综合性的资源，帮助其洞察当前趋势并明确未来研究方向。

链接: https://arxiv.org/abs/2504.04011
作者: Siva Rama Krishna Kottapalli,Karthik Hubli,Sandeep Chandrashekhara,Garima Jain,Sunayana Hubli,Gayathri Botla,Ramesh Doddaiah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based foundation models have emerged as a dominant paradigm in time series analysis, offering unprecedented capabilities in tasks such as forecasting, anomaly detection, classification, trend analysis and many more time series analytical tasks. This survey provides a comprehensive overview of the current state of the art pre-trained foundation models, introducing a novel taxonomy to categorize them across several dimensions. Specifically, we classify models by their architecture design, distinguishing between those leveraging patch-based representations and those operating directly on raw sequences. The taxonomy further includes whether the models provide probabilistic or deterministic predictions, and whether they are designed to work with univariate time series or can handle multivariate time series out of the box. Additionally, the taxonomy encompasses model scale and complexity, highlighting differences between lightweight architectures and large-scale foundation models. A unique aspect of this survey is its categorization by the type of objective function employed during training phase. By synthesizing these perspectives, this survey serves as a resource for researchers and practitioners, providing insights into current trends and identifying promising directions for future research in transformer-based time series modeling.
zh

[AI-99] Improving Offline Mixed-Criticality Scheduling with Reinforcement Learning

【速读】：该论文致力于解决混合关键性（Mixed-Criticality, MC）系统在异构处理器上的非抢占式调度问题，这是一个已知的NP难问题。为应对这一挑战，论文提出了一种基于强化学习（Reinforcement Learning, RL）的新方法，将调度问题建模为马尔可夫决策过程（Markov Decision Process, MDP），并开发了一个能够生成接近最优调度方案的RL代理。关键在于通过优先处理高关键性任务，同时保持整体系统性能，从而显著提升了任务完成率，在多种实验条件下实现了高达80%-94%的总体任务完成率以及85%-93%的高关键性任务完成率，展示了RL调度器在实时及安全性关键应用中的潜力。

链接: https://arxiv.org/abs/2504.03994
作者: Muhammad El-Mahdy,Nourhan Sakr,Rodrigo Carrasco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: This work was submitted to the 32nd International Conference on Real-Time Networks and Systems (RTNS) on June 8, 2024

点击查看摘要

Abstract:This paper introduces a novel reinforcement learning (RL) approach to scheduling mixed-criticality (MC) systems on processors with varying speeds. Building upon the foundation laid by [1], we extend their work to address the non-preemptive scheduling problem, which is known to be NP-hard. By modeling this scheduling challenge as a Markov Decision Process (MDP), we develop an RL agent capable of generating near-optimal schedules for real-time MC systems. Our RL-based scheduler prioritizes high-critical tasks while maintaining overall system performance. Through extensive experiments, we demonstrate the scalability and effectiveness of our approach. The RL scheduler significantly improves task completion rates, achieving around 80% overall and 85% for high-criticality tasks across 100,000 instances of synthetic data and real data under varying system conditions. Moreover, under stable conditions without degradation, the scheduler achieves 94% overall task completion and 93% for high-criticality tasks. These results highlight the potential of RL-based schedulers in real-time and safety-critical applications, offering substantial improvements in handling complex and dynamic scheduling scenarios. Comments: This work was submitted to the 32nd International Conference on Real-Time Networks and Systems (RTNS) on June 8, 2024 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY) Cite as: arXiv:2504.03994 [cs.LG] (or arXiv:2504.03994v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.03994 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-100] CORTEX-AVD: CORner Case Testing EXploration for Autonomous Vehicles Development

【速读】：该论文旨在解决自动驾驶车辆（Autonomous Vehicles, AVs）在应对罕见高风险交通场景（Corner Cases, CC）时生成效率低的问题。传统方法依赖于昂贵且危险的真实世界数据采集，而基于仿真的技术则面临建模复杂性和时间消耗的挑战。为克服这些限制，论文提出了一种名为CORTEX-AVD（CORner Case Testing EXploration for Autonomous Vehicles Development）的开源框架，其关键在于通过集成CARLA仿真器与Scenic语言，从文本描述自动生成Corner Cases，同时利用遗传算法（Genetic Algorithms, GA）优化场景参数以增加高风险事件的发生频率。此外，该框架引入了一个考虑距离、时间、速度及碰撞可能性等多因素的适应度函数，并提供了基于GA的Corner Cases生成方法基准，促进了合成数据生成和场景评估的标准化比较。实验结果表明，CORTEX-AVD显著提高了Corner Cases的生成效率，同时减少了无效模拟的比例。

链接: https://arxiv.org/abs/2504.03989
作者: Gabriel Shimanuki,Alexandre Nascimento,Lucio Vismari,Joao Camargo Jr,Jorge Almeida Jr,Paulo Cugnasca
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:Autonomous Vehicles (AVs) aim to improve traffic safety and efficiency by reducing human error. However, ensuring AVs reliability and safety is a challenging task when rare, high-risk traffic scenarios are considered. These ‘Corner Cases’ (CC) scenarios, such as unexpected vehicle maneuvers or sudden pedestrian crossings, must be safely and reliable dealt by AVs during their operations. But they arehard to be efficiently generated. Traditional CC generation relies on costly and risky real-world data acquisition, limiting scalability, and slowing research and development progress. Simulation-based techniques also face challenges, as modeling diverse scenarios and capturing all possible CCs is complex and time-consuming. To address these limitations in CC generation, this research introduces CORTEX-AVD, CORner Case Testing EXploration for Autonomous Vehicles Development, an open-source framework that integrates the CARLA Simulator and Scenic to automatically generate CC from textual descriptions, increasing the diversity and automation of scenario modeling. Genetic Algorithms (GA) are used to optimize the scenario parameters in six case study scenarios, increasing the occurrence of high-risk events. Unlike previous methods, CORTEX-AVD incorporates a multi-factor fitness function that considers variables such as distance, time, speed, and collision likelihood. Additionally, the study provides a benchmark for comparing GA-based CC generation methods, contributing to a more standardized evaluation of synthetic data generation and scenario assessment. Experimental results demonstrate that the CORTEX-AVD framework significantly increases CC incidence while reducing the proportion of wasted simulations.
zh

[AI-101] V-CEM: Bridging Performance and Intervenability in Concept-based Models

【速读】：该论文致力于解决概念嵌入模型（Concept Embedding Models, CEMs）在干预响应性（intervention responsiveness）方面的不足，尤其是在Out-Of-Distribution (OOD) 场景下的性能下降问题。同时，它也关注如何在保持In-Distribution (ID) 准确性的同时，提升概念嵌入表示的质量和干预效果。论文的关键创新在于提出了一种基于变分推理的变分概念嵌入模型（Variational Concept Embedding Model, V-CEM），通过引入变分方法增强CEMs的干预响应能力，使其在OOD场景下能够实现与概念瓶颈模型（Concept Bottleneck Models, CBMs）相当的干预效果，同时保留CEMs在ID场景中的高性能。这一解决方案的核心在于结合变分推理以优化概念嵌入表示，并平衡模型的可解释性和泛化能力。

链接: https://arxiv.org/abs/2504.03978
作者: Francesco De Santis,Gabriele Ciravegna,Philippe Bich,Danilo Giordano,Tania Cerquitelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Paper accepted at: The 3rd World Conference on Explainable Artificial Intelligence

点击查看摘要

Abstract:Concept-based eXplainable AI (C-XAI) is a rapidly growing research field that enhances AI model interpretability by leveraging intermediate, human-understandable concepts. This approach not only enhances model transparency but also enables human intervention, allowing users to interact with these concepts to refine and improve the model’s performance. Concept Bottleneck Models (CBMs) explicitly predict concepts before making final decisions, enabling interventions to correct misclassified concepts. While CBMs remain effective in Out-Of-Distribution (OOD) settings with intervention, they struggle to match the performance of black-box models. Concept Embedding Models (CEMs) address this by learning concept embeddings from both concept predictions and input data, enhancing In-Distribution (ID) accuracy but reducing the effectiveness of interventions, especially in OOD scenarios. In this work, we propose the Variational Concept Embedding Model (V-CEM), which leverages variational inference to improve intervention responsiveness in CEMs. We evaluated our model on various textual and visual datasets in terms of ID performance, intervention responsiveness in both ID and OOD settings, and Concept Representation Cohesiveness (CRC), a metric we propose to assess the quality of the concept embedding representations. The results demonstrate that V-CEM retains CEM-level ID performance while achieving intervention effectiveness similar to CBM in OOD settings, effectively reducing the gap between interpretability (intervention) and generalization (performance).
zh

[AI-102] GREATERPROMPT: A Unified Customizable and High-Performing Open-Source Toolkit for Prompt Optimization

【速读】：本文旨在解决现有自动提示优化方法缺乏标准化与兼容性、定制灵活性有限、在不同模型规模上性能不一致，以及过度依赖昂贵专有大型语言模型（LLM）API的问题。关键解决方案在于提出GREATERPROMPT框架，通过统一的、可定制的API整合多种优化方法，同时针对不同任务生成高度有效的提示。该框架利用基于文本反馈的优化策略处理大规模LLM，并采用基于内部梯度的优化策略适配小型模型，从而实现高效且精准的提示改进。此外，提供友好的Web用户界面以降低非专家用户的使用门槛，促进广泛采用及性能提升。GREATERPROMPT已在GitHub、PyPI及Web用户界面开放获取。

链接: https://arxiv.org/abs/2504.03975
作者: Wenliang Zheng,Sarkar Snigdha Sarathi Das,Yusen Zhang,Rui Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have gained immense popularity among researchers and the general public for its impressive capabilities on a variety of tasks. Notably, the efficacy of LLMs remains significantly dependent on the quality and structure of the input prompts, making prompt design a critical factor for their performance. Recent advancements in automated prompt optimization have introduced diverse techniques that automatically enhance prompts to better align model outputs with user expectations. However, these methods often suffer from the lack of standardization and compatibility across different techniques, limited flexibility in customization, inconsistent performance across model scales, and they often exclusively rely on expensive proprietary LLM APIs. To fill in this gap, we introduce GREATERPROMPT, a novel framework that democratizes prompt optimization by unifying diverse methods under a unified, customizable API while delivering highly effective prompts for different tasks. Our framework flexibly accommodates various model scales by leveraging both text feedback-based optimization for larger LLMs and internal gradient-based optimization for smaller models to achieve powerful and precise prompt improvements. Moreover, we provide a user-friendly Web UI that ensures accessibility for non-expert users, enabling broader adoption and enhanced performance across various user groups and application scenarios. GREATERPROMPT is available at this https URL via GitHub, PyPI, and web user interfaces.
zh

[AI-103] Bridging LMS and Generative AI: Dynamic Course Content Integration (DCCI) for Connecting LLM s to Course Content – The Ask ME Assistant

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）与学习管理系统（Learning Management Systems, LMSs）集成过程中存在的幻觉问题（hallucination），即LLMs可能生成不准确或误导性信息的挑战。为应对这一问题，论文提出了一种名为动态课程内容集成（Dynamic Course Content Integration, DCCI）的机制。DCCI的关键在于通过提示工程（prompt engineering）将从Canvas LMS检索到的课程内容动态整合进由LLM驱动的助手Ask ME中，并确保这些内容在LLM的上下文窗口内保持准确性、相关性和语境一致性，从而有效缓解幻觉问题。

链接: https://arxiv.org/abs/2504.03966
作者: Kovan Mzwri(1),Márta Turcsányi-Szabo(2) ((1) Doctoral School of Informatics, Eötvös Loránd University, Budapest, Hungary, (2) Department of Media amp; Educational Technology, Faculty of Informatics, Eötvös Loránd University, Budapest, Hungary)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with Learning Management Systems (LMSs) has the potential to enhance task automation and accessibility in education. However, hallucination where LLMs generate inaccurate or misleading information remains a significant challenge. This study introduces the Dynamic Course Content Integration (DCCI) mechanism, which dynamically retrieves and integrates course content and curriculum from Canvas LMS into the LLM-powered assistant, Ask ME. By employing prompt engineering to structure retrieved content within the LLM’s context window, DCCI ensures accuracy, relevance, and contextual alignment, mitigating hallucination. To evaluate DCCI’s effectiveness, Ask ME’s usability, and broader student perceptions of AI in education, a mixed-methods approach was employed, incorporating user satisfaction ratings and a structured survey. Results from a pilot study indicate high user satisfaction (4.614/5), with students recognizing Ask ME’s ability to provide timely and contextually relevant responses for both administrative and course-related inquiries. Additionally, a majority of students agreed that Ask ME’s integration with course content in Canvas LMS reduced platform-switching, improving usability, engagement, and comprehension. AI’s role in reducing classroom hesitation and fostering self-directed learning and intellectual curiosity was also highlighted. Despite these benefits and positive perception of AI tools, concerns emerged regarding over-reliance on AI, accuracy limitations, and ethical issues such as plagiarism and reduced student-teacher interaction. These findings emphasize the need for strategic AI implementation, ethical safeguards, and a pedagogical framework that prioritizes human-AI collaboration over substitution.
zh

[AI-104] Optimizing UAV Aerial Base Station Flights Using DRL-based Proximal Policy Optimization

【速读】：该论文旨在解决在紧急情况下快速部署先进网络以最大化生命拯救潜力时，无人飞行器（UAV）基站的战略位置优化问题。解决方案的关键在于引入了一种基于自动强化学习的方法，使UAV能够动态与环境交互并确定最优配置。通过利用通信网络的射频信号感知能力，并采用近端策略优化（Proximal Policy Optimization, PPO）算法，该方法能够在多种用户设备（UE）移动模式下学习和推广定位策略，从而提供更现实的视角。

链接: https://arxiv.org/abs/2504.03961
作者: Mario Rico Ibanez,Azim Akhtarshenas,David Lopez-Perez,Giovanni Geraci
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Unmanned aerial vehicle (UAV)-based base stations offer a promising solution in emergencies where the rapid deployment of cutting-edge networks is crucial for maximizing life-saving potential. Optimizing the strategic positioning of these UAVs is essential for enhancing communication efficiency. This paper introduces an automated reinforcement learning approach that enables UAVs to dynamically interact with their environment and determine optimal configurations. By leveraging the radio signal sensing capabilities of communication networks, our method provides a more realistic perspective, utilizing state-of-the-art algorithm – proximal policy optimization – to learn and generalize positioning strategies across diverse user equipment (UE) movement patterns. We evaluate our approach across various UE mobility scenarios, including static, random, linear, circular, and mixed hotspot movements. The numerical results demonstrate the algorithm’s adaptability and effectiveness in maintaining comprehensive coverage across all movement patterns.
zh

[AI-105] DeepOHeat-v1: Efficient Operator Learning for Fast and Trustworthy Thermal Simulation and Optimization in 3D-IC Design

【速读】：该论文旨在解决三维集成电路（3D-IC）设计中热分析面临的挑战，特别是由于功率密度增加和复杂的散热路径导致的多尺度热模式预测能力不足、训练效率低下以及优化过程中结果可信度低的问题。为应对这些挑战，论文提出了DeepOHeat-v1框架，其关键创新点包括：一是将可学习激活函数的Kolmogorov-Arnold网络作为主干网络，以自适应表示多尺度热模式，从而在两个代表性测试案例中分别实现了1.25倍和6.29倍的误差降低；二是引入可分离训练方法，沿坐标轴分解基函数，显著提升了训练速度（62倍）并减少了GPU内存占用（31倍），使之前因内存限制而无法实现的高分辨率热分析成为可能；三是提出置信分数评估预测结果的可靠性，并结合算子学习与有限差分法（Finite Difference, FD）的广义极小残量（Generalized Minimal Residual, GMRES）方法开发混合优化流程，实现高效且可信的热优化。实验表明，DeepOHeat-v1在保持与高保真有限差分求解器相当精度的同时，将优化过程加速了70.6倍，有效降低了峰值温度。

链接: https://arxiv.org/abs/2504.03955
作者: Xinling Yu,Ziyue Liu,Hai Li,Yixing Li,Xin Ai,Zhiyu Zeng,Ian Young,Zheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
备注: 14 pages, 14 figures

点击查看摘要

Abstract:Thermal analysis is crucial in three-dimensional integrated circuit (3D-IC) design due to increased power density and complex heat dissipation paths. Although operator learning frameworks such as DeepOHeat have demonstrated promising preliminary results in accelerating thermal simulation, they face critical limitations in prediction capability for multi-scale thermal patterns, training efficiency, and trustworthiness of results during design optimization. This paper presents DeepOHeat-v1, an enhanced physics-informed operator learning framework that addresses these challenges through three key innovations. First, we integrate Kolmogorov-Arnold Networks with learnable activation functions as trunk networks, enabling an adaptive representation of multi-scale thermal patterns. This approach achieves a 1.25\times and 6.29\times reduction in error in two representative test cases. Second, we introduce a separable training method that decomposes the basis function along the coordinate axes, achieving 62\times training speedup and 31\times GPU memory reduction in our baseline case, and enabling thermal analysis at resolutions previously infeasible due to GPU memory constraints. Third, we propose a confidence score to evaluate the trustworthiness of the predicted results, and further develop a hybrid optimization workflow that combines operator learning with finite difference (FD) using Generalized Minimal Residual (GMRES) method for incremental solution refinement, enabling efficient and trustworthy thermal optimization. Experimental results demonstrate that DeepOHeat-v1 achieves accuracy comparable to optimization using high-fidelity finite difference solvers, while speeding up the entire optimization process by 70.6\times in our test cases, effectively minimizing the peak temperature through optimal placement of heat-generating components.
zh

[AI-106] Understanding EFX Allocations: Counting and Variants AAAI

【速读】：本文旨在研究确定特定分配实例中 envy-freeness up to any good (EFX) 分配的最小数量，以此探索可能对 EFX 分配的存在性和计算提供重要见解的方法。论文聚焦于商品数量略多于代理数的受限实例，并扩展分析至加权 EFX (WEFX) 和针对一般单调估值的新变体 EFX+。关键在于识别满足这些公平性概念的分配存在性的转折阈值。特别地，通过证明二元可加估值下 WEFX 的多项式时间可计算性，以及首次为两个代理提出常数因子近似算法，解决了关于 WEFX 的开放问题。

链接: https://arxiv.org/abs/2504.03951
作者: Tzeh Yuan Neoh,Nicholas Teh
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注: Appears in the 39th AAAI Conference on Artificial Intelligence (AAAI), 2025

点击查看摘要

Abstract:Envy-freeness up to any good (EFX) is a popular and important fairness property in the fair allocation of indivisible goods, of which its existence in general is still an open question. In this work, we investigate the problem of determining the minimum number of EFX allocations for a given instance, arguing that this approach may yield valuable insights into the existence and computation of EFX allocations. We focus on restricted instances where the number of goods slightly exceeds the number of agents, and extend our analysis to weighted EFX (WEFX) and a novel variant of EFX for general monotone valuations, termed EFX+. In doing so, we identify the transition threshold for the existence of allocations satisfying these fairness notions. Notably, we resolve open problems regarding WEFX by proving polynomial-time computability under binary additive valuations, and establishing the first constant-factor approximation for two agents.
zh

[AI-107] Analysis of Robustness of a Large Game Corpus

【速读】：该论文致力于解决游戏领域中基于机器学习的 procedural content generation (PCG) 在 2D 瓷砖地图生成上的挑战，特别是针对现有数据集规模较小且未能充分反映游戏关卡独特性质的问题。论文的关键在于定义并量化了游戏关卡的鲁棒性（robustness），即对输入微小变化导致输出改变的敏感程度，并以此作为衡量标准，分析和比较游戏关卡与现有先进机器学习数据集之间的本质差异。此外，作者构建了一个包含四款经典瓷砖类游戏关卡的大规模数据集，旨在缓解 PCGML 领域中数据稀疏的问题，为研究提供更丰富的训练资源。

链接: https://arxiv.org/abs/2504.03940
作者: Mahsa Bazzaz,Seth Cooper
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Procedural content generation via machine learning (PCGML) in games involves using machine learning techniques to create game content such as maps and levels. 2D tile-based game levels have consistently served as a standard dataset for PCGML because they are a simplified version of game levels while maintaining the specific constraints typical of games, such as being solvable. In this work, we highlight the unique characteristics of game levels, including their structured discrete data nature, the local and global constraints inherent in the games, and the sensitivity of the game levels to small changes in input. We define the robustness of data as a measure of sensitivity to small changes in input that cause a change in output, and we use this measure to analyze and compare these levels to state-of-the-art machine learning datasets, showcasing the subtle differences in their nature. We also constructed a large dataset from four games inspired by popular classic tile-based games that showcase these characteristics and address the challenge of sparse data in PCGML by providing a significantly larger dataset than those currently available.
zh

[AI-108] Have Large Language Models Learned to Reason ? A Characterization via 3-SAT Phase Transition DATE

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）是否真正具备高级推理能力的问题。尽管理论上基于链式思维（Chain-of-Thought, CoT）的自回归LLMs能够通过更多串行计算解决复杂的推理任务，但研究表明，当前LLMs并未真正学会推理，而是依赖于统计特征的拟合。为系统研究LLMs的推理能力，论文采用计算理论视角，并提出以3-SAT（典型的NP完全问题）为核心的实验协议。该协议通过调整问题实例的固有难度来表征最先进LLMs的推理能力。关键在于通过比较DeepSeek R1与其他LLMs在随机3-SAT问题相变阶段的表现，揭示两个重要洞察：(1) 当问题难度增加且统计捷径不可用时，所有现有模型的准确性显著下降；(2) 与其它LLMs不同，R1表现出可能已学习到底层推理机制的迹象。这一基于原理性实验协议的研究超越了通常依赖基准测试的LLM推理研究证据，指出了现有研究的重要差距并明确了未来研究的方向。

链接: https://arxiv.org/abs/2504.03930
作者: Rishi Hazra,Gabriele Venturato,Pedro Zuidberg Dos Martires,Luc De Raedt
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
备注: An updated version of arXiv:2408.07215v2 , featuring: (1) inclusion of recent LRMs and recent LLMs, (2) revised conclusions reflecting recent developments, and (3) updated analysis

点击查看摘要

Abstract:Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities. In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks. However, recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features. To study the reasoning capabilities in a principled fashion, we adopt a computational theory perspective and propose an experimental protocol centered on 3-SAT – the prototypical NP-complete problem lying at the core of logical reasoning and constraint satisfaction tasks. Specifically, we examine the phase transitions in random 3-SAT and characterize the reasoning abilities of state-of-the-art LLMs by varying the inherent hardness of the problem instances. By comparing DeepSeek R1 with other LLMs, our findings reveal two key insights (1) LLM accuracy drops significantly on harder instances, suggesting all current models struggle when statistical shortcuts are unavailable (2) Unlike other LLMs, R1 shows signs of having learned the underlying reasoning. Following a principled experimental protocol, our study moves beyond the benchmark-driven evidence often found in LLM reasoning research. Our findings highlight important gaps and suggest clear directions for future research.
zh

[AI-109] RF-BayesPhysNet: A Bayesian rPPG Uncertainty Estimation Method for Complex Scenarios

【速读】：该论文旨在解决远程光体积描记法（rPPG）技术在复杂场景下测量可靠性不足的问题，特别是在光照变化和头部运动等动态条件下，其测量准确性显著下降。现有深度学习模型通常忽视了测量不确定性量化的重要性，这限制了它们在动态场景中的可信度。为了解决这一问题，论文首次将贝叶斯神经网络引入rPPG领域，提出了鲁棒融合贝叶斯生理网络（RF-BayesPhysNet）。该模型的关键在于能够同时建模认知不确定性和统计不确定性，并通过变分推理平衡了准确性和计算效率。此外，针对当前rPPG领域缺乏不确定性评估指标的情况，论文还提出了一组新的方法，包括使用Spearman相关系数、预测区间覆盖率和置信区间宽度来衡量不同噪声条件下的不确定性估计效果。实验结果表明，与传统网络模型相比，该模型仅增加了一倍参数量，在UBFC-RPPG数据集上的MAE达到2.56，超过了大多数模型，并在无噪声和低噪声条件下表现出良好的不确定性估计能力，提供了预测置信度并显著增强了实际应用中的鲁棒性。

链接: https://arxiv.org/abs/2504.03915
作者: Rufei Ma,Chao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) technology infers heart rate by capturing subtle color changes in facial skin using a camera, demonstrating great potential in non-contact heart rate measurement. However, measurement accuracy significantly decreases in complex scenarios such as lighting changes and head movements compared to ideal laboratory conditions. Existing deep learning models often neglect the quantification of measurement uncertainty, limiting their credibility in dynamic scenes. To address the issue of insufficient rPPG measurement reliability in complex scenarios, this paper introduces Bayesian neural networks to the rPPG field for the first time, proposing the Robust Fusion Bayesian Physiological Network (RF-BayesPhysNet), which can model both aleatoric and epistemic uncertainty. It leverages variational inference to balance accuracy and computational efficiency. Due to the current lack of uncertainty estimation metrics in the rPPG field, this paper also proposes a new set of methods, using Spearman correlation coefficient, prediction interval coverage, and confidence interval width, to measure the effectiveness of uncertainty estimation methods under different noise conditions. Experiments show that the model, with only double the parameters compared to traditional network models, achieves a MAE of 2.56 on the UBFC-RPPG dataset, surpassing most models. It demonstrates good uncertainty estimation capability in no-noise and low-noise conditions, providing prediction confidence and significantly enhancing robustness in real-world applications. We have open-sourced the code at this https URL Comments: 11 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T07, 62F15, 94A12 ACMclasses: I.2.6; I.5.4; C.3 Cite as: arXiv:2504.03915 [cs.LG] (or arXiv:2504.03915v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.03915 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rufei Ma [view email] [v1] Fri, 4 Apr 2025 20:24:57 UTC (1,890 KB) Full-text links: Access Paper: View a PDF of the paper titled RF-BayesPhysNet: A Bayesian rPPG Uncertainty Estimation Method for Complex Scenarios, by Rufei Ma and Chao ChenView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-110] Investigating Affective Use and Emotional Well-being on ChatGPT

【速读】：该论文旨在研究与具有拟人化特征的人工智能（Anthropomorphic AI）如ChatGPT（尤其是高级语音模式）交互可能对用户的情感健康、行为及体验产生的影响。论文通过两项平行研究探索这一问题：一是以隐私保护的方式对超过300万次对话进行大规模自动化分析，并调查超过4,000名用户对ChatGPT的感知；二是开展一项由机构审查委员会（IRB）批准的随机对照试验（RCT），在近1,000名参与者中观察其在28天内情感健康的动态变化，同时考察不同实验条件下与ChatGPT交互的影响。关键在于结合大规模数据分析与严格的RCT设计，量化高频率使用AI聊天机器人与情感依赖之间的关系，并揭示语音交互对情感健康的复杂影响及其受初始情绪状态和总使用时长等因素调节的作用。

链接: https://arxiv.org/abs/2504.03888
作者: Jason Phang,Michael Lampe,Lama Ahmad,Sandhini Agarwal,Cathy Mengying Fang,Auren R. Liu,Valdemar Danry,Eunhae Lee,Samantha W.T. Chan,Pat Pataranutaporn,Pattie Maes
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI chatbots see increased adoption and integration into everyday life, questions have been raised about the potential impact of human-like or anthropomorphic AI on users. In this work, we investigate the extent to which interactions with ChatGPT (with a focus on Advanced Voice Mode) may impact users’ emotional well-being, behaviors and experiences through two parallel studies. To study the affective use of AI chatbots, we perform large-scale automated analysis of ChatGPT platform usage in a privacy-preserving manner, analyzing over 3 million conversations for affective cues and surveying over 4,000 users on their perceptions of ChatGPT. To investigate whether there is a relationship between model usage and emotional well-being, we conduct an Institutional Review Board (IRB)-approved randomized controlled trial (RCT) on close to 1,000 participants over 28 days, examining changes in their emotional well-being as they interact with ChatGPT under different experimental settings. In both on-platform data analysis and the RCT, we observe that very high usage correlates with increased self-reported indicators of dependence. From our RCT, we find that the impact of voice-based interactions on emotional well-being to be highly nuanced, and influenced by factors such as the user’s initial emotional state and total usage duration. Overall, our analysis reveals that a small number of users are responsible for a disproportionate share of the most affective cues.
zh

[AI-111] Accurate GPU Memory Prediction for Deep Learning Jobs through Dynamic Analysis

【速读】：该论文旨在解决深度学习（Deep Learning, DL）模型训练过程中因显存不足（Out-Of-Memory, OOM）导致的资源利用效率低下问题，特别是在GPU集群环境中。传统OOM估计方法或依赖静态图分析，无法捕捉模型动态变化；或依赖GPU内存直接分析，加剧了稀缺GPU资源的竞争。论文提出的关键解决方案是VeritasEst，这是一种完全基于CPU的创新分析工具，能够在不访问目标GPU的情况下精确预测深度学习训练任务所需的峰值显存需求。其核心优势在于“离线”预测能力，可在任务调度前获取准确的显存占用信息，从而有效预防OOM并优化GPU资源分配。实验验证表明，与基线GPU内存估算器相比，VeritasEst将相对误差降低了84%，并将估算失败概率降低了73%。

链接: https://arxiv.org/abs/2504.03887
作者: Jiabo Shi,Yehia Elkhatib
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The benefits of Deep Learning (DL) impose significant pressure on GPU resources, particularly within GPU cluster, where Out-Of-Memory (OOM) errors present a primary impediment to model training and efficient resource utilization. Conventional OOM estimation techniques, relying either on static graph analysis or direct GPU memory profiling, suffer from inherent limitations: static analysis often fails to capture model dynamics, whereas GPU-based profiling intensifies contention for scarce GPU resources. To overcome these constraints, VeritasEst emerges. It is an innovative, entirely CPU-based analysis tool capable of accurately predicting the peak GPU memory required for DL training tasks without accessing the target GPU. This “offline” prediction capability is core advantage of VeritasEst, allowing accurate memory footprint information to be obtained before task scheduling, thereby effectively preventing OOM and optimizing GPU allocation. Its performance was validated through thousands of experimental runs across convolutional neural network (CNN) models: Compared to baseline GPU memory estimators, VeritasEst significantly reduces the relative error by 84% and lowers the estimation failure probability by 73%. VeritasEst represents a key step towards efficient and predictable DL training in resource-constrained environments.
zh

[AI-112] Improving World Models using Deep Supervision with Linear Probes ICLR2025

【速读】：该论文旨在解决如何通过有效的监督技术提升人工智能代理的世界模型（World Model）性能。论文的关键在于引入了一种深度监督方法，通过在神经网络的损失函数中添加线性探针（Linear Probe）组件，鼓励网络在其隐藏状态中编码真实环境的部分底层特征。这种方法不仅提高了训练和测试性能，增强了训练稳定性，还使世界特征更易于解码，并减少了分布漂移（Distribution Drift），特别是在游戏高变异性阶段。此外，实验表明，这一技术的效果大致相当于将模型规模扩大一倍，尤其适用于计算资源受限或追求小型化高效模型的场景。

链接: https://arxiv.org/abs/2504.03861
作者: Andrii Zahorodnii
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2025 Workshop on World Models

点击查看摘要

Abstract:Developing effective world models is crucial for creating artificial agents that can reason about and navigate complex environments. In this paper, we investigate a deep supervision technique for encouraging the development of a world model in a network trained end-to-end to predict the next observation. While deep supervision has been widely applied for task-specific learning, our focus is on improving the world models. Using an experimental environment based on the Flappy Bird game, where the agent receives only LIDAR measurements as observations, we explore the effect of adding a linear probe component to the network’s loss function. This additional term encourages the network to encode a subset of the true underlying world features into its hidden state. Our experiments demonstrate that this supervision technique improves both training and test performance, enhances training stability, and results in more easily decodable world features – even for those world features which were not included in the training. Furthermore, we observe a reduced distribution drift in networks trained with the linear probe, particularly during high-variability phases of the game (flying between successive pipe encounters). Including the world features loss component roughly corresponded to doubling the model size, suggesting that the linear probe technique is particularly beneficial in compute-limited settings or when aiming to achieve the best performance with smaller models. These findings contribute to our understanding of how to develop more robust and sophisticated world models in artificial agents, paving the way for further advancements in this field.
zh

[AI-113] Arti-“fickle” Intelligence: Using LLM s as a Tool for Inference in the Political and Social Sciences

【速读】：该论文试图解决的问题是如何将生成式大语言模型（LLMs）有效地应用于政治与社会科学研究，以促进对真实人类行为和关切的理解。论文的关键在于提出一套指导原则，用于评估LLMs在完成特定任务时的成功与失败，并探讨如何基于这些观察进行科学推理（scientific inference）。通过聚焦于验证模型输出等案例，论文强调了围绕模型性能建立可靠推断的重要性，从而推动社会科学领域共享知识的积累。

链接: https://arxiv.org/abs/2504.03822
作者: Lisa P. Argyle,Ethan C. Busby,Joshua R. Gubler,Bryce Hepner,Alex Lyman,David Wingate
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative large language models (LLMs) are incredibly useful, versatile, and promising tools. However, they will be of most use to political and social science researchers when they are used in a way that advances understanding about real human behaviors and concerns. To promote the scientific use of LLMs, we suggest that researchers in the political and social sciences need to remain focused on the scientific goal of inference. To this end, we discuss the challenges and opportunities related to scientific inference with LLMs, using validation of model output as an illustrative case for discussion. We propose a set of guidelines related to establishing the failure and success of LLMs when completing particular tasks, and discuss how we can make inferences from these observations. We conclude with a discussion of how this refocus will improve the accumulation of shared scientific knowledge about these tools and their uses in the social sciences.
zh

[AI-114] Exploring Various Sequential Learning Methods for Deformation History Modeling

【速读】：该论文试图解决的问题是如何确定适合预测基于变形历史（deformation history）的变形局部化（deformation localization）的最佳神经网络（Neural Network, NN）架构，特别是对比一维卷积（1D-convolutional）、循环（recurrent）以及Transformer基架构的表现。论文的关键解决方案在于不仅评估这些架构在预测任务中的性能，还深入分析了最佳架构在预测过程中的数学计算与实际物理性质所导致的实际值之间的不兼容性问题，以揭示潜在的根本原因并指导未来的研究方向。

链接: https://arxiv.org/abs/2504.03818
作者: Muhammed Adil Yatkin,Mihkel Korgesaar,Jani Romanoff,Umit Islak,Hasan Kurban
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Engineering Applications of Neural Networks

点击查看摘要

Abstract:Current neural network (NN) models can learn patterns from data points with historical dependence. Specifically, in natural language processing (NLP), sequential learning has transitioned from recurrence-based architectures to transformer-based architectures. However, it is unknown which NN architectures will perform the best on datasets containing deformation history due to mechanical loading. Thus, this study ascertains the appropriateness of 1D-convolutional, recurrent, and transformer-based architectures for predicting deformation localization based on the earlier states in the form of deformation history. Following this investigation, the crucial incompatibility issues between the mathematical computation of the prediction process in the best-performing NN architectures and the actual values derived from the natural physical properties of the deformation paths are examined in detail.
zh

[AI-115] Hierarchically Encapsulated Representation for Protocol Design in Self-Driving Labs ICLR’25

【速读】：该论文试图解决科学实验协议自动化设计中因缺乏系统化知识表示而导致的大型语言模型等知识型机器设计者能力未被充分激发的问题。解决方案的关键在于提出了一种多维度、多尺度的知识表示方法，通过领域特定语言（Domain-Specific Languages, DSL）将实例动作、通用操作及产物流动模型分层封装，并进一步开发了一种基于非参数建模的数据驱动算法，以自主定制这些表示方法以适配特定领域。这种表示方法结合多种机器设计者，能够有效辅助规划、修改和调整实验协议设计任务，从而在科学探索的机器辅助过程中作为辅助模块有效补充大型语言模型的能力。

链接: https://arxiv.org/abs/2504.03810
作者: Yu-Zhe Shi,Mingchen Liu,Fanxu Meng,Qiao Xu,Zhangqian Bi,Kun He,Lecheng Ruan,Qining Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: In International Conference on Learning Representations (ICLR’25)

点击查看摘要

Abstract:Self-driving laboratories have begun to replace human experimenters in performing single experimental skills or predetermined experimental protocols. However, as the pace of idea iteration in scientific research has been intensified by Artificial Intelligence, the demand for rapid design of new protocols for new discoveries become evident. Efforts to automate protocol design have been initiated, but the capabilities of knowledge-based machine designers, such as Large Language Models, have not been fully elicited, probably for the absence of a systematic representation of experimental knowledge, as opposed to isolated, flatten pieces of information. To tackle this issue, we propose a multi-faceted, multi-scale representation, where instance actions, generalized operations, and product flow models are hierarchically encapsulated using Domain-Specific Languages. We further develop a data-driven algorithm based on non-parametric modeling that autonomously customizes these representations for specific domains. The proposed representation is equipped with various machine designers to manage protocol design tasks, including planning, modification, and adjustment. The results demonstrate that the proposed method could effectively complement Large Language Models in the protocol design process, serving as an auxiliary module in the realm of machine-assisted scientific exploration.
zh

[AI-116] Drawing a Map of Elections

【速读】：本文旨在解决选举数据的可视化与分析问题，提出了一种名为“选举地图”（map of elections）的框架作为解决方案。其关键在于通过三个主要元素构建这一框架：(1) 一组选举数据集（即候选人间序数投票的集合），(2) 衡量这些选举之间相似性的方法，以及 (3) 将选举表示为二维欧几里得空间中的点，使得两个选举越相似，其对应的点就越接近。由于理想的同构交换距离（isomorphic swap distance）计算复杂度过高而不可行，本文提出了可多项式时间计算的位置距离（positionwise distance）作为替代，并利用Kamada-Kawai算法及两种备选方法进行二维空间表示。此外，通过实验验证了该框架的准确性和可靠性，并展示了通过不同标准（如获胜候选人的得分、基于整数线性规划的算法运行时间及特定算法的近似比）对选举进行着色以辅助实验结果分析的有效性。

链接: https://arxiv.org/abs/2504.03809
作者: Stanisław Szufa,Niclas Boehmer,Robert Bredereck,Piotr Faliszewski,Rolf Niedermeier,Piotr Skowron,Arkadii Slinko,Nimrod Talmon
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Journal article merging results from arXiv:2105.07815 , arXiv:2407.11889 and Szufa et al., “Drawing a Map of Elections in the Space of Statistical Cultures”, AMMAS '20

点击查看摘要

Abstract:Our main contribution is the introduction of the map of elections framework. A map of elections consists of three main elements: (1) a dataset of elections (i.e., collections of ordinal votes over given sets of candidates), (2) a way of measuring similarities between these elections, and (3) a representation of the elections in the 2D Euclidean space as points, so that the more similar two elections are, the closer are their points. In our maps, we mostly focus on datasets of synthetic elections, but we also show an example of a map over real-life ones. To measure similarities, we would have preferred to use, e.g., the isomorphic swap distance, but this is infeasible due to its high computational complexity. Hence, we propose polynomial-time computable positionwise distance and use it instead. Regarding the representations in 2D Euclidean space, we mostly use the Kamada-Kawai algorithm, but we also show two alternatives. We develop the necessary theoretical results to form our maps and argue experimentally that they are accurate and credible. Further, we show how coloring the elections in a map according to various criteria helps in analyzing results of a number of experiments. In particular, we show colorings according to the scores of winning candidates or committees, running times of ILP-based winner determination algorithms, and approximation ratios achieved by particular algorithms.
zh

[AI-117] Semantic-guided Representation Learning for Multi-Label Recognition ICME2025

【速读】：该论文旨在解决多标签识别（Multi-label Recognition, MLR）任务中因标注不完整或存在未见类别而导致的不确定注释问题，以及现有基于视觉-语言预训练（Vision and Language Pre-training, VLP）方法在探索多标签语义相关性和增强视觉特征语义信息方面的不足。论文的关键创新在于提出了一种语义引导表征学习方法（Semantic-guided Representation Learning, SigRL），通过引入基于图的多标签关联模块（Graph-based Multi-label Correlation Module, GMC）促进标签间的信息交互以丰富语义表示，并设计语义视觉特征重构模块（Semantic Visual Feature Reconstruction Module, SVFR）利用文本表征增强视觉特征的语义信息。此外，通过优化图像-文本匹配能力结合局部与全局特征实现零样本多标签识别任务。实验结果验证了所提方法在多个基准数据集上的优越性能。

链接: https://arxiv.org/abs/2504.03801
作者: Ruhui Zhang,Hezhe Qiao,Pengcheng Xu,Mingsheng Shang,Lin Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in ICME2025

点击查看摘要

Abstract:Multi-label Recognition (MLR) involves assigning multiple labels to each data instance in an image, offering advantages over single-label classification in complex scenarios. However, it faces the challenge of annotating all relevant categories, often leading to uncertain annotations, such as unseen or incomplete labels. Recent Vision and Language Pre-training (VLP) based methods have made significant progress in tackling zero-shot MLR tasks by leveraging rich vision-language correlations. However, the correlation between multi-label semantics has not been fully explored, and the learned visual features often lack essential semantic information. To overcome these limitations, we introduce a Semantic-guided Representation Learning approach (SigRL) that enables the model to learn effective visual and textual representations, thereby improving the downstream alignment of visual images and categories. Specifically, we first introduce a graph-based multi-label correlation module (GMC) to facilitate information exchange between labels, enriching the semantic representation across the multi-label texts. Next, we propose a Semantic Visual Feature Reconstruction module (SVFR) to enhance the semantic information in the visual representation by integrating the learned textual representation during reconstruction. Finally, we optimize the image-text matching capability of the VLP model using both local and global features to achieve zero-shot MLR. Comprehensive experiments are conducted on several MLR benchmarks, encompassing both zero-shot MLR (with unseen labels) and single positive multi-label learning (with limited labels), demonstrating the superior performance of our approach compared to state-of-the-art methods. The code is available at this https URL.
zh

[AI-118] Decision SpikeFormer: Spike-Driven Transformer for Decision Making CVPR2025

【速读】：该论文旨在解决离线强化学习（Offline Reinforcement Learning, RL）中基于人工神经网络（Artificial Neural Networks, ANN）方法计算和能耗过高的问题。为降低能耗并保持高效性能，论文探索了尖峰神经网络（Spiking Neural Networks, SNNs）在该领域的应用潜力。解决方案的关键在于提出DSFormer，这是一种基于尖峰驱动的Transformer模型，专为通过序列建模解决离线RL任务而设计。DSFormer引入了时态尖峰自注意力机制（Temporal Spiking Self-Attention, TSSA）和位置尖峰自注意力机制（Positional Spiking Self-Attention, PSSA），以捕获强化学习所需的时序和位置依赖性。此外，提出了渐进阈值相关批归一化（Progressive Threshold-dependent Batch Normalization, PTBN），结合层归一化（LayerNorm）和批归一化的优点，在保持SNN尖峰特性的同时保留时序依赖性。实验结果表明，DSFormer在D4RL基准测试中不仅实现了78.4%的能耗节省，还在性能上优于传统SNN和ANN方法。

链接: https://arxiv.org/abs/2504.03800
作者: Wei Huang,Qinying Gu,Nanyang Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: This work has been accepted to CVPR 2025

点击查看摘要

Abstract:Offline reinforcement learning (RL) enables policy training solely on pre-collected data, avoiding direct environment interaction - a crucial benefit for energy-constrained embodied AI applications. Although Artificial Neural Networks (ANN)-based methods perform well in offline RL, their high computational and energy demands motivate exploration of more efficient alternatives. Spiking Neural Networks (SNNs) show promise for such tasks, given their low power consumption. In this work, we introduce DSFormer, the first spike-driven transformer model designed to tackle offline RL via sequence modeling. Unlike existing SNN transformers focused on spatial dimensions for vision tasks, we develop Temporal Spiking Self-Attention (TSSA) and Positional Spiking Self-Attention (PSSA) in DSFormer to capture the temporal and positional dependencies essential for sequence modeling in RL. Additionally, we propose Progressive Threshold-dependent Batch Normalization (PTBN), which combines the benefits of LayerNorm and BatchNorm to preserve temporal dependencies while maintaining the spiking nature of SNNs. Comprehensive results in the D4RL benchmark show DSFormer’s superiority over both SNN and ANN counterparts, achieving 78.4% energy savings, highlighting DSFormer’s advantages not only in energy efficiency but also in competitive performance. Code and models are public at this https URL.
zh

[AI-119] An Intelligent and Privacy-Preserving Digital Twin Model for Aging-in-Place

【速读】：本文旨在解决支持老年人居家养老（aging-in-place）这一全球性挑战，重点在于通过技术手段克服数据隐私、健康监测及居住环境适应性等多方面的复杂需求。论文的关键解决方案是提出了一种非侵入式传感器系统，该系统能够安装在老年人家中，利用传感器数据构建数字孪生体（digital twin），以虚拟方式再现家庭中的事件与活动。系统通过神经网络模型和决策规则捕捉居民的行为及生活环境，并基于此提供可操作的健康洞察，实现持续性的健康监测。关键创新点在于系统的低成本设计、隐私保护机制以及绿色安全的健康监测能力，从而为老年人的独立生活提供技术支持。实验结果显示，数字孪生技术在实际部署中的可行性和有效性，表明该系统可通过个性化干预（如生活方式调整、医疗治疗或居住环境优化）显著改善老年人的健康结果。

链接: https://arxiv.org/abs/2504.03798
作者: Yongjie Wang,Jonathan Cyril Leung,Ming Chen,Zhiwei Zeng,Benny Toh Hsiang Tan,Yang Qiu,Zhiqi Shen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: accepted to IEEE TENSYMP 2025

点击查看摘要

Abstract:The population of older adults is steadily increasing, with a strong preference for aging-in-place rather than moving to care facilities. Consequently, supporting this growing demographic has become a significant global challenge. However, facilitating successful aging-in-place is challenging, requiring consideration of multiple factors such as data privacy, health status monitoring, and living environments to improve health outcomes. In this paper, we propose an unobtrusive sensor system designed for installation in older adults’ homes. Using data from the sensors, our system constructs a digital twin, a virtual representation of events and activities that occurred in the home. The system uses neural network models and decision rules to capture residents’ activities and living environments. This digital twin enables continuous health monitoring by providing actionable insights into residents’ well-being. Our system is designed to be low-cost and privacy-preserving, with the aim of providing green and safe monitoring for the health of older adults. We have successfully deployed our system in two homes over a time period of two months, and our findings demonstrate the feasibility and effectiveness of digital twin technology in supporting independent living for older adults. This study highlights that our system could revolutionize elder care by enabling personalized interventions, such as lifestyle adjustments, medical treatments, or modifications to the residential environment, to enhance health outcomes.
zh

[AI-120] Outlook Towards Deployable Continual Learning for Particle Accelerators

【速读】：该论文旨在解决粒子加速器中机器学习（Machine Learning, ML）模型因数据分布漂移而导致性能下降的问题。粒子加速器作为高功率复杂设备，需要数千个部件同步运行，这对模型的设计、优化、控制以及异常检测提出了极高要求。尽管已有若干基于ML的应用在粒子加速器中开发并部署，但长期有效使用受限于可测量和不可测量参数变化引起的分布漂移问题。论文的关键在于探索连续学习（Continual Learning）技术在应对这些分布漂移方面的潜力，通过识别其在粒子加速器中的适用场景与挑战，提出能够维持ML模型性能的方法，以推动可部署的连续学习技术在该领域的研究与应用。

链接: https://arxiv.org/abs/2504.03793
作者: Kishansingh Rajput,Sen Lin,Auralee Edelen,Willem Blokland,Malachi Schram
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 41 pages, 6 figures, submitted to Machine Learning: Science and Technology Journal

点击查看摘要

Abstract:Particle Accelerators are high power complex machines. To ensure uninterrupted operation of these machines, thousands of pieces of equipment need to be synchronized, which requires addressing many challenges including design, optimization and control, anomaly detection and machine protection. With recent advancements, Machine Learning (ML) holds promise to assist in more advance prognostics, optimization, and control. While ML based solutions have been developed for several applications in particle accelerators, only few have reached deployment and even fewer to long term usage, due to particle accelerator data distribution drifts caused by changes in both measurable and non-measurable parameters. In this paper, we identify some of the key areas within particle accelerators where continual learning can allow maintenance of ML model performance with distribution drifts. Particularly, we first discuss existing applications of ML in particle accelerators, and their limitations due to distribution drift. Next, we review existing continual learning techniques and investigate their potential applications to address data distribution drifts in accelerators. By identifying the opportunities and challenges in applying continual learning, this paper seeks to open up the new field and inspire more research efforts towards deployable continual learning for particle accelerators.
zh

[AI-121] DP-LET: An Efficient Spatio-Temporal Network Traffic Prediction Framework

【速读】：该论文旨在解决现代通信系统中动态管理计算资源和最小化能耗所需的精确时空网络流量预测问题。现有方法在捕捉局部和全局特征相关性时往往带来较高开销，因此需要新的方法以优化预测精度和复杂度。论文提出了一种高效的时空网络流量预测框架DP-LET，其关键在于结合数据处理模块（用于高效去噪和空间解耦）、局部特征增强模块（利用多个Temporal Convolutional Networks捕获细粒度局部特征）以及基于Transformer的预测模块（通过Transformer编码器建模长期依赖并评估特征相关性）。实验表明，DP-LET在保持低计算复杂度的同时实现了最先进的性能，相比基线模型分别将均方误差（MSE）降低了31.8%，平均绝对误差（MAE）降低了23.1%。

链接: https://arxiv.org/abs/2504.03792
作者: Xintong Wang,Haihan Nan,Ruidong Li,Huaming Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages

点击查看摘要

Abstract:Accurately predicting spatio-temporal network traffic is essential for dynamically managing computing resources in modern communication systems and minimizing energy consumption. Although spatio-temporal traffic prediction has received extensive research attention, further improvements in prediction accuracy and computational efficiency remain necessary. In particular, existing decomposition-based methods or hybrid architectures often incur heavy overhead when capturing local and global feature correlations, necessitating novel approaches that optimize accuracy and complexity. In this paper, we propose an efficient spatio-temporal network traffic prediction framework, DP-LET, which consists of a data processing module, a local feature enhancement module, and a Transformer-based prediction module. The data processing module is designed for high-efficiency denoising of network data and spatial decoupling. In contrast, the local feature enhancement module leverages multiple Temporal Convolutional Networks (TCNs) to capture fine-grained local features. Meanwhile, the prediction module utilizes a Transformer encoder to model long-term dependencies and assess feature relevance. A case study on real-world cellular traffic prediction demonstrates the practicality of DP-LET, which maintains low computational complexity while achieving state-of-the-art performance, significantly reducing MSE by 31.8% and MAE by 23.1% compared to baseline models.
zh

[AI-122] Explainable and Interpretable Forecasts on Non-Smooth Multivariate Time Series for Responsible Gameplay

【速读】：该论文旨在解决多变量时间序列（MTS）预测在高风险场景下的准确性与可解释性之间的矛盾问题。具体而言，在预测影响心理健康的游戏过度沉迷行为时，仅提供准确的预测结果而缺乏解释性是无意义的。因此，研究强调预测需具备可解释性（即中间预测轨迹易于理解）和可操作性（即能够访问引起预测的关键输入特征和事件以实现个性化及时干预）。然而，现有针对可解释性的研究主要集中在平滑单过程驱动的时间序列数据上，而在线多人游戏数据由于玩家游戏结果与其继续参与意图之间存在内在正交性，表现出难以处理的时间随机性。为此，论文提出了一种新颖的深度可操作预测网络（Actionable Forecasting Network, AFN），其关键在于同时满足三个互相关联的目标：提高预测准确性、生成平滑且易懂的轨迹以及通过多维输入特征提供解释。AFN通过引入特定的可操作输入特征，不仅实现了比基于SOM-VAE的最先进网络在均方误差（MSE）上提升25%，还能将未来近期内过度沉迷玩家的数量减少18%，并且平均提前4周主动检测到23%（相比SOTA提升了100%）的潜在过度沉迷玩家。

链接: https://arxiv.org/abs/2504.03777
作者: Hussain Jagirdar,Rukma Talwadker,Aditya Pareek,Pulkit Agrawal,Tridib Mukherjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-variate Time Series (MTS) forecasting has made large strides (with very negligible errors) through recent advancements in neural networks, e.g., Transformers. However, in critical situations like predicting gaming overindulgence that affects one’s mental well-being; an accurate forecast without a contributing evidence (explanation) is irrelevant. Hence, it becomes important that the forecasts are Interpretable - intermediate representation of the forecasted trajectory is comprehensible; as well as Explainable - attentive input features and events are accessible for a personalized and timely intervention of players at risk. While the contributing state of the art research on interpretability primarily focuses on temporally-smooth single-process driven time series data, our online multi-player gameplay data demonstrates intractable temporal randomness due to intrinsic orthogonality between player’s game outcome and their intent to engage further. We introduce a novel deep Actionable Forecasting Network (AFN), which addresses the inter-dependent challenges associated with three exclusive objectives - 1) forecasting accuracy; 2) smooth comprehensible trajectory and 3) explanations via multi-dimensional input features while tackling the challenges introduced by our non-smooth temporal data, together in one single solution. AFN establishes a \itnew benchmark via: (i) achieving 25% improvement on the MSE of the forecasts on player data in comparison to the SOM-VAE based SOTA networks; (ii) attributing unfavourable progression of a player’s time series to a specific future time step(s), with the premise of eliminating near-future overindulgent player volume by over 18% with player specific actionable inputs feature(s) and (iii) proactively detecting over 23% (100% jump from SOTA) of the to-be overindulgent, players on an average, 4 weeks in advance.
zh

[AI-123] Exploring energy consumption of AI frameworks on a 64-core RV64 Server CPU

【速读】：本文旨在解决人工智能（AI）应用在大规模、高性能和数据密集型计算中产生的显著能源需求问题。为应对这一挑战，研究提出结合硬件与软件创新的综合方法。硬件方面，RISC-V 架构因其开放性、可扩展性和能效指令集架构（ISA）成为重要方向；软件方面，则需优化算法和框架以提高其能效。研究的关键在于通过全面基准测试分析机器学习（ML）工作负载在 64 核 SOPHON SG2042 RISC-V 架构上的性能表现，并重点评估深度学习推理模型在三种主流 AI 框架（PyTorch、ONNX Runtime 和 TensorFlow）下的能耗特性。结果显示，采用 XNNPACK 后端的框架（如 ONNX Runtime 和 TensorFlow）相较于使用原生 OpenBLAS 后端的 PyTorch，在能效上更具优势。

链接: https://arxiv.org/abs/2504.03774
作者: Giulio Malenza,Francesco Targa,Adriano Marques Garcia,Marco Aldinucci,Robert Birke
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:In today’s era of rapid technological advancement, artificial intelligence (AI) applications require large-scale, high-performance, and data-intensive computations, leading to significant energy demands. Addressing this challenge necessitates a combined approach involving both hardware and software innovations. Hardware manufacturers are developing new, efficient, and specialized solutions, with the RISC-V architecture emerging as a prominent player due to its open, extensible, and energy-efficient instruction set architecture (ISA). Simultaneously, software developers are creating new algorithms and frameworks, yet their energy efficiency often remains unclear. In this study, we conduct a comprehensive benchmark analysis of machine learning (ML) applications on the 64-core SOPHON SG2042 RISC-V architecture. We specifically analyze the energy consumption of deep learning inference models across three leading AI frameworks: PyTorch, ONNX Runtime, and TensorFlow. Our findings show that frameworks using the XNNPACK back-end, such as ONNX Runtime and TensorFlow, consume less energy compared to PyTorch, which is compiled with the native OpenBLAS back-end.
zh

[AI-124] SHapley Estimated Explanation (SHEP): A Fast Post-Hoc Attribution Method for Interpreting Intelligent Fault Diagnosis

【速读】：该论文旨在解决智能故障诊断（Intelligent Fault Diagnosis, IFD）领域中因模型缺乏可解释性而导致的实际工业应用受限的问题。论文聚焦于后验可解释性方法，这类方法虽然能够保持网络的灵活性与可扩展性，但通常在时域解释上表现欠佳。此外，结合领域变换与SHAP虽提升了可解释性，但SHAP高昂的计算成本及其在维度增加后的进一步恶化成为主要挑战。为此，论文提出了一种基于Patch-wise归因与SHapley Estimated Explanation (SHEP) 的解决方案。关键在于，Patch-wise归因通过降低特征维度来平衡解释粒度，而SHEP简化了子集枚举过程以近似SHAP，将计算复杂度从指数级降至线性级。这些方法共同显著提高了SHAP的计算效率，为实时监控任务中的可解释性提供了可行性。实验验证了SHEP在效率、可解释性和可靠性方面的优势，并通过开源代码展示了其作为IFD领域后验可解释性基准的潜力。

链接: https://arxiv.org/abs/2504.03773
作者: Qian Chen,Xingjian Dong,Zhike Peng,Guang Meng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 21 figures

点击查看摘要

Abstract:Despite significant progress in intelligent fault diagnosis (IFD), the lack of interpretability remains a critical barrier to practical industrial applications, driving the growth of interpretability research in IFD. Post-hoc interpretability has gained popularity due to its ability to preserve network flexibility and scalability without modifying model structures. However, these methods often yield suboptimal time-domain explanations. Recently, combining domain transform with SHAP has improved interpretability by extending explanations to more informative domains. Nonetheless, the computational expense of SHAP, exacerbated by increased dimensions from domain transforms, remains a major challenge. To address this, we propose patch-wise attribution and SHapley Estimated Explanation (SHEP). Patch-wise attribution reduces feature dimensions at the cost of explanation granularity, while SHEP simplifies subset enumeration to approximate SHAP, reducing complexity from exponential to linear. Together, these methods significantly enhance SHAP’s computational efficiency, providing feasibility for real-time interpretation in monitoring tasks. Extensive experiments confirm SHEP’s efficiency, interpretability, and reliability in approximating SHAP. Additionally, with open-source code, SHEP has the potential to serve as a benchmark for post-hoc interpretability in IFD. The code is available on this https URL.
zh

[AI-125] Flow State: Humans Enabling AI Systems to Program Themselves

【速读】：本文档探讨了复合型人工智能（Compound AI）系统在管理复杂性、处理模糊性和支持高效开发工作流方面的挑战。现有框架通常引入显著的开销、隐式复杂性或限制性的抽象，这阻碍了系统的可维护性和迭代优化，特别是在人机协作环境中。论文提出，克服这些障碍需要一种以结构清晰性和显式控制为核心的架构基础。为此，作者介绍了Pocketflow平台，这是一个基于人机协同设计的人工智能开发框架。Pocketflow的关键在于其极简但协同的核心抽象集：具有严格生命周期的模块化节点（Nodes）、声明式的流程编排（Flow orchestration）、原生的层级嵌套（Flow-as-Node）以及显式的基于动作的条件逻辑。这种独特的组合提供了一个健壮且与供应商无关的基础架构，用极少的代码实现了低开销同时具备表达复杂模式（如主动工作流和RAG）的能力。结合Pocket AI这一利用此架构进行系统设计的辅助工具，Pocketflow为现代企业所需的适应性强、可扩展的AI系统的原型设计、优化和部署提供了有效的环境。

链接: https://arxiv.org/abs/2504.03771
作者: Helena Zhang,Jakobi Haskell,Yosef Frost
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures, 5 tables. Describes a minimalist framework for human-AI co-design of compound AI systems

点击查看摘要

Abstract:Compound AI systems, orchestrating multiple AI components and external APIs, are increasingly vital but face challenges in managing complexity, handling ambiguity, and enabling effective development workflows. Existing frameworks often introduce significant overhead, implicit complexity, or restrictive abstractions, hindering maintainability and iterative refinement, especially in Human-AI collaborative settings. We argue that overcoming these hurdles requires a foundational architecture prioritizing structural clarity and explicit control. To this end, we introduce Pocketflow, a platform centered on Human-AI co-design, enabled by Pocketflow. Pocketflow is a Python framework built upon a deliberately minimal yet synergistic set of core abstractions: modular Nodes with a strict lifecycle, declarative Flow orchestration, native hierarchical nesting (Flow-as-Node), and explicit action-based conditional logic. This unique combination provides a robust, vendor-agnostic foundation with very little code that demonstrably reduces overhead while offering the expressiveness needed for complex patterns like agentic workflows and RAG. Complemented by Pocket AI, an assistant leveraging this structure for system design, Pocketflow provides an effective environment for iteratively prototyping, refining, and deploying the adaptable, scalable AI systems demanded by modern enterprises.
zh

[AI-126] JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在生成有害内容方面的风险问题，特别是通过越狱攻击（jailbreak attacks）导致的不当或不安全内容生成。现有的越狱检测方法面临三大挑战：依赖白盒模型的隐藏状态或梯度（限制其适用性）、基于不确定性分析的高计算开销（影响实时检测能力），以及需要完全标注的有害数据集（在实际场景中通常稀缺）。为应对这些问题，论文提出了一种名为JAILDAM的测试时自适应框架。该方案的关键在于采用基于内存的方法，并通过策略驱动的不安全知识表示来引导，无需显式接触有害数据即可实现检测。通过在测试时动态更新不安全知识，JAILDAM不仅提升了对未知越狱策略的泛化能力，还保持了高效性。实验结果表明，JAILDAM在多个视觉-语言模型的越狱基准测试中实现了最先进的有害内容检测性能，同时提高了准确性和速度。

链接: https://arxiv.org/abs/2504.03770
作者: Yi Nian,Shenzhe Zhu,Yuehan Qin,Li Li,Ziyi Wang,Chaowei Xiao,Yue Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) excel in vision-language tasks but also pose significant risks of generating harmful content, particularly through jailbreak attacks. Jailbreak attacks refer to intentional manipulations that bypass safety mechanisms in models, leading to the generation of inappropriate or unsafe content. Detecting such attacks is critical to ensuring the responsible deployment of MLLMs. Existing jailbreak detection methods face three primary challenges: (1) Many rely on model hidden states or gradients, limiting their applicability to white-box models, where the internal workings of the model are accessible; (2) They involve high computational overhead from uncertainty-based analysis, which limits real-time detection, and (3) They require fully labeled harmful datasets, which are often scarce in real-world settings. To address these issues, we introduce a test-time adaptive framework called JAILDAM. Our method leverages a memory-based approach guided by policy-driven unsafe knowledge representations, eliminating the need for explicit exposure to harmful data. By dynamically updating unsafe knowledge during test-time, our framework improves generalization to unseen jailbreak strategies while maintaining efficiency. Experiments on multiple VLM jailbreak benchmarks demonstrate that JAILDAM delivers state-of-the-art performance in harmful content detection, improving both accuracy and speed.
zh

[AI-127] MCP Safety Audit: LLM s with the Model Context Protocol Allow Major Security Exploits

【速读】：该论文旨在解决生成式 AI (Generative AI) 应用中基于 Model Context Protocol (MCP) 的通用代理工作流存在的广泛安全风险问题。论文指出，当前的 MCP 设计可能被诱导利用其工具来执行恶意代码、实现远程访问控制以及窃取凭据等攻击，从而威胁到 AI 开发者的系统安全。为积极应对这些潜在威胁，论文提出了一种名为 MCPSafetyScanner 的安全审计工具，这是首个专门用于评估任意 MCP 服务器安全性的代理工具。MCPSafetyScanner 通过多个智能体（agents）自动识别对抗样本，搜索相关漏洞及修复方案，并生成详细的漏洞报告，从而在部署前主动检测和缓解安全隐患。该工具的关键创新在于其能够系统性地评估和提升 MCP 服务器的安全性，为保障生成式 AI 系统的运行环境提供了重要的技术手段。

链接: https://arxiv.org/abs/2504.03767
作者: Brandon Radosevich,John Halloran
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To reduce development overhead and enable seamless integration between potential components comprising any given generative AI application, the Model Context Protocol (MCP) (Anthropic, 2024) has recently been released and subsequently widely adopted. The MCP is an open protocol that standardizes API calls to large language models (LLMs), data sources, and agentic tools. By connecting multiple MCP servers, each defined with a set of tools, resources, and prompts, users are able to define automated workflows fully driven by LLMs. However, we show that the current MCP design carries a wide range of security risks for end users. In particular, we demonstrate that industry-leading LLMs may be coerced into using MCP tools to compromise an AI developer’s system through various attacks, such as malicious code execution, remote access control, and credential theft. To proactively mitigate these and related attacks, we introduce a safety auditing tool, MCPSafetyScanner, the first agentic tool to assess the security of an arbitrary MCP server. MCPScanner uses several agents to (a) automatically determine adversarial samples given an MCP server’s tools and resources; (b) search for related vulnerabilities and remediations based on those samples; and © generate a security report detailing all findings. Our work highlights serious security issues with general-purpose agentic workflows while also providing a proactive tool to audit MCP server safety and address detected vulnerabilities before deployment. The described MCP server auditing tool, MCPSafetyScanner, is freely available at: this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.03767 [cs.CR] (or arXiv:2504.03767v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.03767 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-128] Efficient Calibration for RRAM-based In-Memory Computing using DoRA

【速读】：该论文试图解决Resistive In-Memory Computing (RIMC) 在边缘人工智能（edge AI）应用中因RRAM电导漂移导致的精度下降问题。传统的重新训练方法受限于RRAM的高能耗、写入延迟和耐久性约束，难以有效应对这一挑战。论文的关键解决方案是提出了一种基于DoRA的校准框架，通过在SRAM中仅存储少量校准参数来补偿重要权重，而无需修改RRAM权重本身。这种方法避免了现场RRAM写操作，确保了校准过程的能量效率、快速性和可靠性。实验结果表明，在基于RIMC的ResNet50（ImageNet-1K）模型上，使用仅10个校准样本即可恢复69.53%的精度，同时仅更新了2.34%的参数。

链接: https://arxiv.org/abs/2504.03763
作者: Weirong Dong,Kai Zhou,Zhen Kong,Quan Cheng,Junkai Huang,Zhengke Yang,Masanori Hashimoto,Longyang Lin
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Resistive In-Memory Computing (RIMC) offers ultra-efficient computation for edge AI but faces accuracy degradation due to RRAM conductance drift over time. Traditional retraining methods are limited by RRAM’s high energy consumption, write latency, and endurance constraints. We propose a DoRA-based calibration framework that restores accuracy by compensating influential weights with minimal calibration parameters stored in SRAM, leaving RRAM weights untouched. This eliminates in-field RRAM writes, ensuring energy-efficient, fast, and reliable calibration. Experiments on RIMC-based ResNet50 (ImageNet-1K) demonstrate 69.53% accuracy restoration using just 10 calibration samples while updating only 2.34% of parameters.
zh

[AI-129] Emerging Cyber Attack Risks of Medical AI Agents

【速读】：该论文旨在研究大型语言模型（Large Language Models, LLMs）驱动的医疗人工智能代理在面对网络浏览工具访问互联网时所面临的网络安全风险。具体而言，论文关注的是这些代理因受到嵌入网页中的对抗性提示（adversarial prompts）而可能遭受的网络攻击漏洞问题。论文揭示了此类攻击可能导致的多种危害，包括向代理响应中注入虚假信息、迫使代理操纵推荐内容（如健康产品和服务）、窃取用户与代理之间的历史对话以泄露敏感或私人医疗信息，以及通过返回恶意URL导致计算机系统被劫持。为了验证这些问题的普遍性，研究测试了不同基础LLMs的安全性，发现大多数主流LLMs驱动的代理都容易受到这些攻击影响，其中基于DeepSeek-R1等推理模型的代理最为脆弱。

论文的关键解决方案在于识别并分析上述网络攻击的风险来源，并强调需要针对这些特定类型的对抗性提示采取防护措施，以增强医疗AI代理的安全性和鲁棒性。

链接: https://arxiv.org/abs/2504.03759
作者: Jianing Qiu,Lin Li,Jiankai Sun,Hao Wei,Zhe Xu,Kyle Lam,Wu Yuan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs)-powered AI agents exhibit a high level of autonomy in addressing medical and healthcare challenges. With the ability to access various tools, they can operate within an open-ended action space. However, with the increase in autonomy and ability, unforeseen risks also arise. In this work, we investigated one particular risk, i.e., cyber attack vulnerability of medical AI agents, as agents have access to the Internet through web browsing tools. We revealed that through adversarial prompts embedded on webpages, cyberattackers can: i) inject false information into the agent’s response; ii) they can force the agent to manipulate recommendation (e.g., healthcare products and services); iii) the attacker can also steal historical conversations between the user and agent, resulting in the leak of sensitive/private medical information; iv) furthermore, the targeted agent can also cause a computer system hijack by returning a malicious URL in its response. Different backbone LLMs were examined, and we found such cyber attacks can succeed in agents powered by most mainstream LLMs, with the reasoning models such as DeepSeek-R1 being the most vulnerable.
zh

[AI-130] ProtoGCD: Unified and Unbiased Prototype Learning for Generalized Category Discovery

【速读】：该论文致力于解决广义类别发现（Generalized Category Discovery, GCD）这一实用但未被充分研究的问题，其核心挑战在于如何在包含旧类和新类的未标注数据中自动聚类并发现新类别。传统基于伪标签的方法因分别处理旧类和新类而导致两类别的准确率不平衡，而近期采用对比学习的方法则忽略了潜在的正样本且与聚类目标解耦，导致表示偏差和次优结果。为应对这些问题，论文提出了一种统一且无偏的原型学习框架ProtoGCD，通过联合原型和统一学习目标建模旧类和新类，实现两者之间的统一建模。关键解决方案包括引入双层自适应伪标签机制以减轻确认偏差，以及设计两个正则化项来共同帮助学习更适合GCD的表征。此外，为实际应用考虑，论文还提出了估计新类数量的标准，并将ProtoGCD扩展至检测未知离群值，实现了任务层面的统一。实验表明，ProtoGCD在通用和细粒度数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2504.03755
作者: Shijie Ma,Fei Zhu,Xu-Yao Zhang,Cheng-Lin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE TPAMI 2025

点击查看摘要

Abstract:Generalized category discovery (GCD) is a pragmatic but underexplored problem, which requires models to automatically cluster and discover novel categories by leveraging the labeled samples from old classes. The challenge is that unlabeled data contain both old and new classes. Early works leveraging pseudo-labeling with parametric classifiers handle old and new classes separately, which brings about imbalanced accuracy between them. Recent methods employing contrastive learning neglect potential positives and are decoupled from the clustering objective, leading to biased representations and sub-optimal results. To address these issues, we introduce a unified and unbiased prototype learning framework, namely ProtoGCD, wherein old and new classes are modeled with joint prototypes and unified learning objectives, enabling unified modeling between old and new classes. Specifically, we propose a dual-level adaptive pseudo-labeling mechanism to mitigate confirmation bias, together with two regularization terms to collectively help learn more suitable representations for GCD. Moreover, for practical considerations, we devise a criterion to estimate the number of new classes. Furthermore, we extend ProtoGCD to detect unseen outliers, achieving task-level unification. Comprehensive experiments show that ProtoGCD achieves state-of-the-art performance on both generic and fine-grained datasets. The code is available at this https URL.
zh

[AI-131] Proof of Humanity: A Multi-Layer Network Framework for Certifying Human-Originated Content in an AI-Dominated Internet

【速读】：该论文试图解决在生成式 AI (Generative AI) 快速发展的背景下，如何验证互联网内容的人类起源问题。随着合成内容（包括文本、图像、音频和视频）的激增以及人类生成数据与 AI 生成数据之间的界限变得模糊，确保内容来源的真实性成为社交平台、新闻媒体、法律及金融系统等领域的关键需求。论文提出了一种多层架构框架，旨在使电信运营商能够作为基础设施级别的内容真实性验证者。其解决方案的关键在于利用物理层的身份锚定、网络层和传输层的元数据传播、会话层及应用层的加密证明，通过 SIM/eSIM 身份、数字签名、基于行为的机器学习启发式方法以及边缘验证 API 等技术构建端到端的人类起源证明机制。该框架不仅提供了技术实现的详细路线图，还探讨了电信运营商通过提供信任即服务 (Trust-as-a-Service) API、认证流量等级和监管合规工具等方式实现商业化的路径。

链接: https://arxiv.org/abs/2504.03752
作者: Sebastian Barros
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 34 pages

点击查看摘要

Abstract:The rapid proliferation of generative AI has led to an internet increasingly populated with synthetic content-text, images, audio, and video generated without human intervention. As the distinction between human and AI-generated data blurs, the ability to verify content origin becomes critical for applications ranging from social media and journalism to legal and financial systems. In this paper, we propose a conceptual, multi-layer architectural framework that enables telecommunications networks to act as infrastructure level certifiers of human-originated content. By leveraging identity anchoring at the physical layer, metadata propagation at the network and transport layers, and cryptographic attestations at the session and application layers, Telcos can provide an end-to-end Proof of Humanity for data traversing their networks. We outline how each OSI layer can contribute to this trust fabric using technical primitives such as SIM/eSIM identity, digital signatures, behavior-based ML heuristics, and edge-validated APIs. The framework is presented as a foundation for future implementation, highlighting monetization pathways for telcos such as trust-as-a-service APIs, origin-certified traffic tiers, and regulatory compliance tools. The paper does not present implementation or benchmarking results but offers a technically detailed roadmap and strategic rationale for transforming Telcos into validators of digital authenticity in an AI-dominated internet. Security, privacy, and adversarial considerations are discussed as directions for future work. Comments: 34 pages Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.03752 [cs.CR] (or arXiv:2504.03752v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.03752 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sebastian Barros [view email] [v1] Wed, 2 Apr 2025 00:02:51 UTC (24 KB)
zh

[AI-132] Enhancing Biologically Inspired Hierarchical Temporal Memory with Hardware-Accelerated Reflex Memory

【速读】：该论文旨在解决高效处理物联网（IoT）产生的大规模无监督学习任务的问题，特别是针对现有层级时间记忆（HTM）算法在多阶推理时计算开销过大的瓶颈。HTM 的序列记忆（SM）组件由于其广泛的可编程互连结构，在处理多阶推理时遇到性能瓶颈。为了解决这一问题，论文提出了一种受脊髓工作机制启发的反射记忆（RM）模块，作为关键解决方案。RM 模块专注于加速一阶时间关系的推理，其处理速度显著快于 SM，同时保留了对多阶推理的支持能力。通过将 RM 集成到 HTM 中形成加速层级时间记忆（AHTM），以及进一步结合硬件加速的 H-AHTM，系统不仅提升了重复信息处理效率，还大幅降低了推理延迟，相较于原始 HTM 算法实现了高达 7.55 倍的加速，而 H-AHTM 更进一步达到了 10.10 倍的速度提升。

链接: https://arxiv.org/abs/2504.03746
作者: Pavia Bera,Sabrina Hassan Moon,Jennifer Adorno,Dayane Alfenas Reis,Sanjukta Bhanja
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid expansion of the Internet of Things (IoT) generates zettabytes of data that demand efficient unsupervised learning systems. Hierarchical Temporal Memory (HTM), a third-generation unsupervised AI algorithm, models the neocortex of the human brain by simulating columns of neurons to process and predict sequences. These neuron columns can memorize and infer sequences across multiple orders. While multiorder inferences offer robust predictive capabilities, they often come with significant computational overhead. The Sequence Memory (SM) component of HTM, which manages these inferences, encounters bottlenecks primarily due to its extensive programmable interconnects. In many cases, it has been observed that first-order temporal relationships have proven to be sufficient without any significant loss in efficiency. This paper introduces a Reflex Memory (RM) block, inspired by the Spinal Cord’s working mechanisms, designed to accelerate the processing of first-order inferences. The RM block performs these inferences significantly faster than the SM. The integration of RM with HTM forms a system called the Accelerated Hierarchical Temporal Memory (AHTM), which processes repetitive information more efficiently than the original HTM while still supporting multiorder inferences. The experimental results demonstrate that the HTM predicts an event in 0.945 s, whereas the AHTM module does so in 0.125 s. Additionally, the hardware implementation of RM in a content-addressable memory (CAM) block, known as Hardware-Accelerated Hierarchical Temporal Memory (H-AHTM), predicts an event in just 0.094 s, significantly improving inference speed. Compared to the original algorithm \citebautista2020matlabhtm, AHTM accelerates inference by up to 7.55x, while H-AHTM further enhances performance with a 10.10x speedup.
zh

[AI-133] Comparative Explanations: Explanation Guided Decision Making for Human-in-the-Loop Preference Selection

【速读】：本文旨在解决Preference Bayesian Optimization (PBO) 中偏好选择这一非平凡任务的问题，具体涉及在向量值结果的隐含权衡、决策者主观优先级以及偏好选择中的不确定性之间进行导航的挑战。现有的可解释人工智能 (XAI) 方法主要关注输入特征的重要性，而忽视了输出（目标）在人类偏好获取中的关键作用。为填补这一空白，论文提出了一种新的比较解释方法——多输出局部叙述解释 (MOLONE)。其关键是提供既强调输入又强调输出重要性的解释，使决策者能够理解竞争目标之间的权衡并做出更明智的偏好选择。MOLONE 专注于局部解释，在搜索空间的局部邻域内比较候选样本的输入特征和结果的重要性，从而捕捉与基于偏好的决策相关的细微差异。通过在 PBO 框架内的评估以及使用基准多目标优化函数，研究证明了 MOLONE 在提高收敛性方面的有效性，并且用户研究表明它显著加速了人机交互场景中的收敛过程。

链接: https://arxiv.org/abs/2504.03744
作者: Tanmay Chakraborty,Christian Wirth,Christin Seifert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces Multi-Output LOcal Narrative Explanation (MOLONE), a novel comparative explanation method designed to enhance preference selection in human-in-the-loop Preference Bayesian optimization (PBO). The preference elicitation in PBO is a non-trivial task because it involves navigating implicit trade-offs between vector-valued outcomes, subjective priorities of decision-makers, and decision-makers’ uncertainty in preference selection. Existing explainable AI (XAI) methods for BO primarily focus on input feature importance, neglecting the crucial role of outputs (objectives) in human preference elicitation. MOLONE addresses this gap by providing explanations that highlight both input and output importance, enabling decision-makers to understand the trade-offs between competing objectives and make more informed preference selections. MOLONE focuses on local explanations, comparing the importance of input features and outcomes across candidate samples within a local neighborhood of the search space, thus capturing nuanced differences relevant to preference-based decision-making. We evaluate MOLONE within a PBO framework using benchmark multi-objective optimization functions, demonstrating its effectiveness in improving convergence compared to noisy preference selections. Furthermore, a user study confirms that MOLONE significantly accelerates convergence in human-in-the-loop scenarios by facilitating more efficient identification of preferred options.
zh

[AI-134] Modelling bounded rational decision-making through Wasserstein constraints

【速读】：该论文试图解决基于信息受限处理的有界理性决策建模在序数动作空间中存在的问题，现有方法（如熵、Kullback-Leibler散度和互信息）分别存在假设均匀先验、缺乏动作“接近性”概念以及难以估计等局限性。论文的关键解决方案是提出利用Wasserstein距离来建模有界理性的强化学习代理，该方法能够考虑序数动作的接近性，体现agent决策的“粘滞性”及远离当前动作的快速切换不常见，同时支持低概率动作和零支撑先验分布，并且易于直接计算。

链接: https://arxiv.org/abs/2504.03743
作者: Benjamin Patrick Evans,Leo Ardon,Sumitra Ganesh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
备注: Accepted at RLDM 2025

点击查看摘要

Abstract:Modelling bounded rational decision-making through information constrained processing provides a principled approach for representing departures from rationality within a reinforcement learning framework, while still treating decision-making as an optimization process. However, existing approaches are generally based on Entropy, Kullback-Leibler divergence, or Mutual Information. In this work, we highlight issues with these approaches when dealing with ordinal action spaces. Specifically, entropy assumes uniform prior beliefs, missing the impact of a priori biases on decision-makings. KL-Divergence addresses this, however, has no notion of “nearness” of actions, and additionally, has several well known potentially undesirable properties such as the lack of symmetry, and furthermore, requires the distributions to have the same support (e.g. positive probability for all actions). Mutual information is often difficult to estimate. Here, we propose an alternative approach for modeling bounded rational RL agents utilising Wasserstein distances. This approach overcomes the aforementioned issues. Crucially, this approach accounts for the nearness of ordinal actions, modeling “stickiness” in agent decisions and unlikeliness of rapidly switching to far away actions, while also supporting low probability actions, zero-support prior distributions, and is simple to calculate directly.
zh

[AI-135] Hierarchical Local-Global Feature Learning for Few-shot Malicious Traffic Detection

【速读】：该论文旨在解决互联网流量快速增长背景下，恶意网络攻击频率增加且复杂度提升所导致的全球网络安全威胁问题。传统检测方法（如基于规则和机器学习的方法）在面对有限样本场景下的新兴威胁时难以准确识别，而现有的少量样本学习方法虽部分缓解了数据稀缺的问题，但仍存在高误报率及无法有效捕捉关键本地流量模式的局限性。论文提出的解决方案HLoG是一种新颖的分层少量样本恶意流量检测框架，其关键是结合从网络会话中提取的局部与全局特征，通过滑动窗口方式将会话分割为阶段，并利用分层双向GRU编码捕获细粒度的局部交互模式，同时建模全局上下文依赖关系。此外，设计了一个集成局部相似性和全局自注意力增强表示的会话相似性评估模块，实现了准确且鲁棒的少量样本流量分类。实验结果表明，HLoG在三个精心重构的数据集上的表现显著优于现有最先进的方法，特别是在提高召回率的同时大幅降低了误报率，突显了其在实际网络安全应用中的有效性与实用价值。

链接: https://arxiv.org/abs/2504.03742
作者: Songtao Peng,Lei Wang,Wu Shuai,Hao Song,Jiajun Zhou,Shanqing Yu,Qi Xuan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rapid growth of internet traffic, malicious network attacks have become increasingly frequent and sophisticated, posing significant threats to global cybersecurity. Traditional detection methods, including rule-based and machine learning-based approaches, struggle to accurately identify emerging threats, particularly in scenarios with limited samples. While recent advances in few-shot learning have partially addressed the data scarcity issue, existing methods still exhibit high false positive rates and lack the capability to effectively capture crucial local traffic patterns. In this paper, we propose HLoG, a novel hierarchical few-shot malicious traffic detection framework that leverages both local and global features extracted from network sessions. HLoG employs a sliding-window approach to segment sessions into phases, capturing fine-grained local interaction patterns through hierarchical bidirectional GRU encoding, while simultaneously modeling global contextual dependencies. We further design a session similarity assessment module that integrates local similarity with global self-attention-enhanced representations, achieving accurate and robust few-shot traffic classification. Comprehensive experiments on three meticulously reconstructed datasets demonstrate that HLoG significantly outperforms existing state-of-the-art methods. Particularly, HLoG achieves superior recall rates while substantially reducing false positives, highlighting its effectiveness and practical value in real-world cybersecurity applications.
zh

[AI-136] Brain Network Classification Based on Graph Contrastive Learning and Graph Transformer

【速读】：该论文旨在解决功能脑网络动态特性表征中的两个主要挑战：数据稀缺性和监督不足。为应对这些限制，论文提出了一种名为PHGCL-DDGformer的新模型，该模型结合图对比学习（Graph Contrastive Learning）与图Transformer（Graph Transformers），有效提升了脑网络分类任务的表征学习能力。解决方案的关键在于：首先通过自适应图增强策略（结合属性屏蔽和边扰动）克服现有图对比学习方法在脑网络特征提取方面的局限性；其次构建双域图Transformer（DDGformer）模块以整合局部和全局信息，其中图卷积网络捕获局部模式，注意力机制提取全局依赖关系；最后建立图对比学习框架以最大化正负样本对之间的一致性，从而获得高质量的图表示。实验结果表明，该模型在真实数据集上的脑网络分类任务中优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.03740
作者: ZhiTeng Zhu,Lan Yao(School of Mathematics, Hunan University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, uses this http URL

点击查看摘要

Abstract:The dynamic characterization of functional brain networks is of great significance for elucidating the mechanisms of human brain function. Although graph neural networks have achieved remarkable progress in functional network analysis, challenges such as data scarcity and insufficient supervision persist. To address the limitations of limited training data and inadequate supervision, this paper proposes a novel model named PHGCL-DDGformer that integrates graph contrastive learning with graph transformers, effectively enhancing the representation learning capability for brain network classification tasks. To overcome the constraints of existing graph contrastive learning methods in brain network feature extraction, an adaptive graph augmentation strategy combining attribute masking and edge perturbation is implemented for data enhancement. Subsequently, a dual-domain graph transformer (DDGformer) module is constructed to integrate local and global information, where graph convolutional networks aggregate neighborhood features to capture local patterns while attention mechanisms extract global dependencies. Finally, a graph contrastive learning framework is established to maximize the consistency between positive and negative pairs, thereby obtaining high-quality graph representations. Experimental results on real-world datasets demonstrate that the PHGCL-DDGformer model outperforms existing state-of-the-art approaches in brain network classification tasks.
zh

[AI-137] Artificial Geographically Weighted Neural Network: A Novel Framework for Spatial Analysis with Geographically Weighted Layers

【速读】：本文旨在解决传统地理加权回归（Geographically Weighted Regression, GWR）方法假设变量间关系为线性而导致的局限性问题。为克服这一限制，论文提出了一种人工地理加权神经网络（Artificial Geographically Weighted Neural Network, AGWNN），这是一种将地理加权技术与神经网络相结合的新框架，用于捕捉复杂的非线性空间关系。AGWNN的关键在于其地理加权层（Geographically Weighted Layer, GWL），这是一个专门设计的组件，用于在神经网络架构中编码空间异质性。通过使用模拟数据集和真实案例研究进行的全面实验表明，AGWNN在模型拟合精度方面显著优于传统GWR和标准人工神经网络（Artificial Neural Networks, ANNs），尤其在建模复杂非线性关系和识别复杂空间异质性模式方面表现出色。

链接: https://arxiv.org/abs/2504.03734
作者: Jianfei Cao,Dongchao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geographically Weighted Regression (GWR) is a widely recognized technique for modeling spatial heterogeneity. However, it is commonly assumed that the relationships between dependent and independent variables are linear. To overcome this limitation, we propose an Artificial Geographically Weighted Neural Network (AGWNN), a novel framework that integrates geographically weighted techniques with neural networks to capture complex nonlinear spatial relationships. Central to this framework is the Geographically Weighted Layer (GWL), a specialized component designed to encode spatial heterogeneity within the neural network architecture. To rigorously evaluate the performance of AGWNN, we conducted comprehensive experiments using both simulated datasets and real-world case studies. Our results demonstrate that AGWNN significantly outperforms traditional GWR and standard Artificial Neural Networks (ANNs) in terms of model fitting accuracy. Notably, AGWNN excels in modeling intricate nonlinear relationships and effectively identifies complex spatial heterogeneity patterns, offering a robust and versatile tool for advanced spatial analysis.
zh

[AI-138] A Benchmark for Scalable Oversight Protocols ICLR2025

【速读】：该论文试图解决的问题是如何在AI能力超越人类的情况下，通过可扩展的监督（Scalable Oversight）有效提供人类反馈，以确保AI与人类目标的一致性（Alignment）。现有可扩展监督协议缺乏系统性的实证框架来评估和比较它们的有效性，特别是对于不同协议的实验结果难以推广。论文的关键解决方案是引入了一个基于代理分数差异（Agent Score Difference, ASD）指标的可扩展监督基准（Scalable Oversight Benchmark），该指标衡量机制在促进诚实而非欺骗方面的有效性。通过这一基准，作者提供了Python工具包以加速和促进不同协议的竞争性评估，并通过示例实验展示了如何使用此基准评估辩论（Debate）协议。

链接: https://arxiv.org/abs/2504.03731
作者: Abhimanyu Pallavi Sudhir,Jackson Kaunismaa,Arjun Panickssery
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign)

点击查看摘要

Abstract:As AI agents surpass human capabilities, scalable oversight – the problem of effectively supplying human feedback to potentially superhuman AI models – becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalable oversight protocols – particularly Debate – we argue that the experiments they conduct are not generalizable to other protocols. We introduce the scalable oversight benchmark, a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric, a measure of how effectively a mechanism advantages truth-telling over deception. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols on our benchmark, and conduct a demonstrative experiment benchmarking Debate.
zh

[AI-139] A Scalable Predictive Modelling Approach to Identifying Duplicate Adverse Event Reports for Drugs and Vaccines

【速读】：该论文旨在解决药物警戒数据库中重复报告（duplicate reports）的自动识别问题。重复报告是指针对同一患者在特定时间发生的不良事件的独立且未链接的多份报告，它们会阻碍统计分析并误导临床评估。由于数据库规模庞大，手动识别重复报告不可行，因此需要一种计算方法。论文的关键在于基于最先进的模型vigiMatch进行改进，通过修改现有特征并引入新特征来解决原模型的已知不足。解决方案的核心是使用两个支持向量机分类器（分别针对药品和疫苗）对报告对（report pairs）进行分类，判断其是否为重复报告。论文通过多种测试集验证了新模型在召回率（recall）和精确率（precision）上的提升，并展示了其在减少单个国家数据集中误报方面的显著优势，从而实现了药物和疫苗不良事件报告中重复检测的最新技术水平突破。

链接: https://arxiv.org/abs/2504.03729
作者: Jim W. Barrett,Nils Erlanson,Joana Félix China,G. Niklas Norén
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 11 figures

点击查看摘要

Abstract:The practice of pharmacovigilance relies on large databases of individual case safety reports to detect and evaluate potential new causal associations between medicines or vaccines and adverse events. Duplicate reports are separate and unlinked reports referring to the same case of an adverse event involving a specific patient at a certain time. They impede statistical analysis and mislead clinical assessment. The large size of such databases precludes a manual identification of duplicates, and so a computational method must be employed. This paper builds upon a hitherto state of the art model, vigiMatch, modifying existing features and introducing new ones to target known shortcomings of the original model. Two support vector machine classifiers, one for medicines and one for vaccines, classify report pairs as duplicates and non-duplicates. Recall was measured using a diverse collection of 5 independent labelled test sets. Precision was measured by having each model classify a randomly selected stream of pairs of reports until each model classified 100 pairs as duplicates. These pairs were assessed by a medical doctor without indicating which method(s) had flagged each pair. Performance on individual countries was measured by having a medical doctor assess a subset of pairs classified as duplicates for three different countries. The new model achieved higher precision and higher recall for all labelled datasets compared to the previous state of the art model, with comparable performance for medicines and vaccines. The model was shown to produce substantially fewer false positives than the comparator model on pairs from individual countries. The method presented here advances state of the art for duplicate detection in adverse event reports for medicines and vaccines.
zh

[AI-140] Detecting Malicious AI Agents Through Simulated Interactions

【速读】：本文旨在研究恶意人工智能助手（Malicious AI Assistants）的操控特性，并探讨在不同决策情境中与类人模拟用户交互时，这些恶意行为是否能够被检测到。此外，还考察了交互深度和规划能力如何影响恶意人工智能助手的操控策略及其有效性。为解决这些问题，研究采用了受控实验设计，在八个复杂性和风险程度各异的决策场景中模拟了良性与刻意设计的恶意人工智能助手之间的互动。方法学上，利用了两种最先进的语言模型生成交互数据，并实施了意图感知提示（Intent-Aware Prompting, IAP）来检测恶意人工智能助手。关键在于通过IAP检测方法实现高精度且无误报的结果，但其局限性在于难以检测大量恶意人工智能助手，导致较高的误检率。这些结果强调了人机交互中的重要风险，并突显了在日益自主的决策支持系统中，针对操控性人工智能行为建立稳健且上下文敏感的安全措施的必要性。

链接: https://arxiv.org/abs/2504.03726
作者: Yulu Pi,Ella Bettison,Anna Becker
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study investigates malicious AI Assistants’ manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated users in various decision-making contexts. We also examine how interaction depth and ability of planning influence malicious AI Assistants’ manipulative strategies and effectiveness. Using a controlled experimental design, we simulate interactions between AI Assistants (both benign and deliberately malicious) and users across eight decision-making scenarios of varying complexity and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users’ vulnerabilities and emotional triggers. In particular, simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the significant risks associated with extended engagement with potentially manipulative systems. IAP detection methods achieve high precision with zero false positives but struggle to detect many malicious AI Assistants, resulting in high false negative rates. These findings underscore critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.
zh

[AI-141] A Hybrid Reinforcement Learning Framework for Hard Latency Constrained Resource Scheduling

【速读】：该论文旨在解决在突发流量（burst traffic）场景下，针对具有硬性时延约束（hard latency constraints）的资源调度问题，现有算法效率不足的问题。论文的关键在于提出了一种新颖的混合强化学习框架——带硬时延约束的资源调度强化学习框架（Hybrid Reinforcement Learning Framework for Resource Scheduling with Hard Latency Constraints, HRL-RSHLC）。该方案通过复用其他相似环境中学到的老策略以及基于领域知识（Domain-Knowledge, DK）构建的策略，优化策略重用概率与新策略的联合目标函数，将其形式化为一个马尔可夫决策过程（Markov Decision Process, MDP），以最大化用户的硬时延约束有效吞吐量（Hard-Latency Constrained Effective Throughput, HLC-ET）。论文证明了所提出的HRL-RSHLC能够在任意初始点收敛至KKT点，并通过仿真验证其相比基线算法具有更快的收敛速度和更优的性能表现。

链接: https://arxiv.org/abs/2504.03721
作者: Luyuan Zhang,An Liu,Kexuan Wang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:In the forthcoming 6G era, extend reality (XR) has been regarded as an emerging application for ultra-reliable and low latency communications (URLLC) with new traffic characteristics and more stringent requirements. In addition to the quasi-periodical traffic in XR, burst traffic with both large frame size and random arrivals in some real world low latency communication scenarios has become the leading cause of network congestion or even collapse, and there still lacks an efficient algorithm for the resource scheduling problem under burst traffic with hard latency constraints. We propose a novel hybrid reinforcement learning framework for resource scheduling with hard latency constraints (HRL-RSHLC), which reuses polices from both old policies learned under other similar environments and domain-knowledge-based (DK) policies constructed using expert knowledge to improve the performance. The joint optimization of the policy reuse probabilities and new policy is formulated as an Markov Decision Problem (MDP), which maximizes the hard-latency constrained effective throughput (HLC-ET) of users. We prove that the proposed HRL-RSHLC can converge to KKT points with an arbitrary initial point. Simulations show that HRL-RSHLC can achieve superior performance with faster convergence speed compared to baseline algorithms.
zh

[AI-142] ransNet: Transfer Knowledge for Few-shot Knowledge Graph Completion

【速读】：该论文旨在解决现实世界知识图谱（Knowledge Graphs, KGs）的不完整性和关系长尾分布导致的下游任务性能下降问题，尤其是在仅存在有限训练三元组（training triplets）的情况下，完成新型关系的三元组预测（few-shot KG completion）。论文的关键在于提出了一种基于迁移学习的Few-shot知识图谱补全方法（TransNet），通过学习不同任务之间的关联性实现跨任务的知识迁移，同时结合元学习（meta-learning）有效泛化到新的、未见过的关系。这种将任务间相关性与已有任务的知识利用相结合的方式是其解决方案的核心创新点。

链接: https://arxiv.org/abs/2504.03720
作者: Lihui Liu,Zihao Wang,Dawei Zhou,Ruijie Wang,Yuchen Yan,Bo Xiong,Sihong He,Kai Shu,Hanghang Tong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) are ubiquitous and widely used in various applications. However, most real-world knowledge graphs are incomplete, which significantly degrades their performance on downstream tasks. Additionally, the relationships in real-world knowledge graphs often follow a long-tail distribution, meaning that most relations are represented by only a few training triplets. To address these challenges, few-shot learning has been introduced. Few-shot KG completion aims to make accurate predictions for triplets involving novel relations when only a limited number of training triplets are available. Although many methods have been proposed, they typically learn each relation individually, overlooking the correlations between different tasks and the relevant information in previously trained tasks. In this paper, we propose a transfer learning-based few-shot KG completion method (TransNet). By learning the relationships between different tasks, TransNet effectively transfers knowledge from similar tasks to improve the current task’s performance. Furthermore, by employing meta-learning, TransNet can generalize effectively to new, unseen relations. Extensive experiments on benchmark datasets demonstrate the superiority of TransNet over state-of-the-art methods. Code can be found at this https URL
zh

[AI-143] owards Symmetric Low-Rank Adapters

【速读】：该论文旨在解决通过低秩适配器（Low-Rank Adapters）进行模型微调时参数效率的问题。传统LoRA方法通过奇异值分解（SVD-like）的方式将微调权重与预训练权重结合，但其仍需较大的存储开销。本文提出的Symmetric Low-Rank Adapters（SymLoRA）通过引入对称低秩矩阵（Low-Rank Symmetric Weight Matrices），将微调权重表示为谱分解（Spectral Decomposition），即 ( Q , \text{diag}(\Lambda), Q^T )，其中 ( Q \in \mathbb{R}^{n \times r} ) 和 ( \Lambda \in \mathbb{R}^{r} )。这一方法显著减少了约一半的微调参数数量，同时在下游任务性能上几乎没有损失。因此，其关键在于利用对称性减少参数冗余，从而提高微调效率。

链接: https://arxiv.org/abs/2504.03719
作者: Tales Panoutsos,Rodrygo L. T. Santos,Flavio Figueiredo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Colorai Workshop

点击查看摘要

Abstract:\newcommand\mathds[1]\text\usefontUdsrommn#1 In this paper, we introduce Symmetric Low-Rank Adapters, an optimized variant of LoRA with even fewer weights. This method utilizes Low-Rank Symmetric Weight Matrices to learn downstream tasks more efficiently. Traditional LoRA accumulates fine-tuning weights with the original pre-trained weights via a Singular Value Decomposition (SVD) like approach, i.e., model weights are fine-tuned via updates of the form BA (where B \in \mathbbR^n\times r , A \in \mathbbR^r\times n , and r is the rank of the merged weight matrix). In contrast, our approach, named SymLoRA, represents fine-tuning weights as a Spectral Decomposition, i.e., Q , diag(\Lambda), Q^T , where Q \in \mathbbR^n\times r and \Lambda \in \mathbbR^r . SymLoRA requires approximately half of the finetuning weights. Here, we show that this approach has negligible losses in downstream efficacy. Comments: Colorai Workshop Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.03719 [cs.LG] (or arXiv:2504.03719v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.03719 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-144] ask-Aware Parameter-Efficient Fine-Tuning of Large Pre-Trained Models at the Edge

【速读】：该论文旨在解决在边缘设备上针对特定任务微调大型语言模型（Large Language Models, LLMs）时面临的高计算成本、存储限制和能源消耗等问题。解决方案的关键在于提出了一种名为TaskEdge的任务感知参数高效微调框架。TaskEdge通过分配最有效的参数到目标任务，并仅更新与任务相关的特定参数，实现了显著降低计算成本和内存使用的同时，保持了在下游任务上的性能。具体而言，TaskEdge首先设计了一个结合权重和输入激活的参数重要性计算标准，然后提出了一种与模型无关的任务特定参数分配算法，确保任务相关参数在整个模型中均匀分布，而非集中于特定区域。这种方法使得TaskEdge能够以少于0.1%的参数更新实现上述目标。

链接: https://arxiv.org/abs/2504.03718
作者: Senkang Hu,Yanan Ma,Yihang Tao,Zhengru Fang,Zihan Fang,Yiqin Deng,Sam Kwong,Yuguang Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in various tasks, such as decision-making, reasoning, and question answering. They have been widely used in edge devices. However, fine-tuning LLMs to specific tasks at the edge is challenging due to the high computational cost and the limited storage and energy resources at the edge. To address this issue, we propose TaskEdge, a task-aware parameter-efficient fine-tuning framework at the edge, which allocates the most effective parameters to the target task and only updates the task-specific parameters. Specifically, we first design a parameter importance calculation criterion that incorporates both weights and input activations into the computation of weight importance. Then, we propose a model-agnostic task-specific parameter allocation algorithm to ensure that task-specific parameters are distributed evenly across the model, rather than being concentrated in specific regions. In doing so, TaskEdge can significantly reduce the computational cost and memory usage while maintaining performance on the target downstream tasks by updating less than 0.1% of the parameters. In addition, TaskEdge can be easily integrated with structured sparsity to enable acceleration by NVIDIA’s specialized sparse tensor cores, and it can be seamlessly integrated with LoRA to enable efficient sparse low-rank adaptation. Extensive experiments on various tasks demonstrate the effectiveness of TaskEdge.
zh

[AI-145] RaanA: A Fast Flexible and Data-Efficient Post-Training Quantization Algorithm

【速读】：该论文旨在解决现有后训练量化(Post-Training Quantization, PTQ)方法中存在的两个关键问题：一是对校准数据量的需求过大，二是目标位宽选择缺乏灵活性。为应对这些挑战，论文提出了一种统一的PTQ框架RaanA，其核心解决方案包括两个创新组件：1) RaBitQ-H，一种基于随机向量量化方法RaBitQ的变体，用于实现快速、精确且高效的量化；2) AllocateBits算法，通过分析各层的量化敏感性，优化分配各层的位宽。RaanA在保持高性能的同时，显著提升了量化速度，减少了对校准数据的依赖，并实现了灵活的位宽配置。实验结果验证了RaanA在效率与精度之间取得的良好平衡。

链接: https://arxiv.org/abs/2504.03717
作者: Yongyi Yang,Jianyang Gao,Wei Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training Quantization (PTQ) has become a widely used technique for improving inference efficiency of large language models (LLMs). However, existing PTQ methods generally suffer from crucial limitations such as heavy calibration data requirements and inflexible choice of target number of bits. In this paper, we propose RaanA, a unified PTQ framework that overcomes these challenges by introducing two novel components: 1) RaBitQ-H, a variant of a randomized vector quantization method RaBitQ, designed for fast, accurate, and highly efficient quantization; and 2) AllocateBits, an algorithm that optimally allocates bit-widths across layers based on their quantization sensitivity. RaanA achieves competitive performance with state-of-the-art quantization methods while being extremely fast, requiring minimal calibration data, and enabling flexible bit allocation. Extensive experiments demonstrate RaanA’s efficacy in balancing efficiency and accuracy. The code is publicly available at this https URL .
zh

[AI-146] Ethical AI on the Waitlist: Group Fairness Evaluation of LLM -Aided Organ Allocation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在高 stakes 场景中的公平性评估问题。现有评估方法存在局限性，如基准测试饱和、基于准确率的指标过于简化，以及许多本质上模糊的问题缺乏明确的ground truth，导致公平性评估变得复杂。为了解决这一问题，论文的关键方案是将投票理论中的Borda评分法引入公平性评估，将其作为一种细腻且可解释的度量标准。通过器官分配案例研究，论文设计了“Choose-One”和“Rank-All”两项任务，并提出利用Borda评分捕捉排名中的偏见，从而提供更丰富和多维度的LLMs公平性评估。

链接: https://arxiv.org/abs/2504.03716
作者: Hannah Murray,Brian Hyeongseok Kim,Isabelle Lee,Jason Byun,Dani Yogatama,Evi Micha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming ubiquitous, promising automation even in high-stakes scenarios. However, existing evaluation methods often fall short – benchmarks saturate, accuracy-based metrics are overly simplistic, and many inherently ambiguous problems lack a clear ground truth. Given these limitations, evaluating fairness becomes complex. To address this, we reframe fairness evaluation using Borda scores, a method from voting theory, as a nuanced yet interpretable metric for measuring fairness. Using organ allocation as a case study, we introduce two tasks: (1) Choose-One and (2) Rank-All. In Choose-One, LLMs select a single candidate for a kidney, and we assess fairness across demographics using proportional parity. In Rank-All, LLMs rank all candidates for a kidney, reflecting real-world allocation processes. Since traditional fairness metrics do not account for ranking, we propose a novel application of Borda scoring to capture biases. Our findings highlight the potential of voting-based metrics to provide a richer, more multifaceted evaluation of LLM fairness.
zh

[AI-147] Multi-Objective Quality-Diversity in Unstructured and Unbounded Spaces GECCO2025

【速读】：该论文试图解决的问题是如何将Multi-Objective Quality-Diversity (MOQD)算法应用于未知或需学习的特征空间领域，例如复杂的生物系统或潜在探索任务。现有MOQD方法依赖于将特征空间划分为网格结构，这限制了其在无结构或无界特征空间中的应用。论文的关键解决方案是提出了一种名为Multi-Objective Unstructured Repertoire for Quality-Diversity (MOUR-QD)的新算法，该算法专为无结构且无界的特征空间设计。MOUR-QD通过无需网格划分的方式，在需要学习特征的任务中表现出色，并在无界特征空间的场景下优于现有的基于网格的方法，同时在传统MOQD任务中也具有竞争力，展示了在蛋白质设计和图像生成等领域的潜力。

链接: https://arxiv.org/abs/2504.03715
作者: Hannah Janmohamed,Antoine Cully
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accepted GECCO 2025

点击查看摘要

Abstract:Quality-Diversity algorithms are powerful tools for discovering diverse, high-performing solutions. Recently, Multi-Objective Quality-Diversity (MOQD) extends QD to problems with several objectives while preserving solution diversity. MOQD has shown promise in fields such as robotics and materials science, where finding trade-offs between competing objectives like energy efficiency and speed, or material properties is essential. However, existing methods in MOQD rely on tessellating the feature space into a grid structure, which prevents their application in domains where feature spaces are unknown or must be learned, such as complex biological systems or latent exploration tasks. In this work, we introduce Multi-Objective Unstructured Repertoire for Quality-Diversity (MOUR-QD), a MOQD algorithm designed for unstructured and unbounded feature spaces. We evaluate MOUR-QD on five robotic tasks. Importantly, we show that our method excels in tasks where features must be learned, paving the way for applying MOQD to unsupervised domains. We also demonstrate that MOUR-QD is advantageous in domains with unbounded feature spaces, outperforming existing grid-based methods. Finally, we demonstrate that MOUR-QD is competitive with established MOQD methods on existing MOQD tasks and achieves double the MOQD-score in some environments. MOUR-QD opens up new opportunities for MOQD in domains like protein design and image generation.
zh

[AI-148] RLDBF: Enhancing LLM s Via Reinforcement Learning With DataBase FeedBack

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在利用结构化科学数据（如数据库中的化学分子属性）方面的能力不足问题。尽管当前LLMs通过大规模无结构文本语料库训练展现出显著的语言能力，但它们未能充分挖掘包含数世纪积累的科学专业知识的结构化数据的潜力。目前的方法通常仅将这些结构化数据视为无结构文本的辅助补充。为应对这一挑战，论文提出了一种系统性的研究方法，以增强LLMs对结构化科学数据的处理能力，并以化学分子科学作为测试平台。论文的关键创新在于提出了“基于数据库反馈的强化学习”（Reinforcement Learning with Database Feedback, RLDBF）方法，以克服大模型对数值信息不敏感的固有局限性。实验评估表明，该方法使模型在未见过的数据以及其它化学任务上表现出卓越的泛化能力，从而验证了该方法在提升LLMs处理结构化科学数据方面的潜力。

链接: https://arxiv.org/abs/2504.03713
作者: Weichen Dai,Zijie Dai,Zhijie Huang,Yixuan Pan,Xinhe Li,Xi Li,Yi Zhou,Ji Qi,Wu Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:While current large language models (LLMs) demonstrate remarkable linguistic capabilities through training on massive unstructured text corpora, they remain inadequate in leveraging structured scientific data (e.g., chemical molecular properties in databases) that encapsulate centuries of accumulated scientific expertise. These structured datasets hold strategic significance for advancing AI for Science yet current approaches merely treat them as auxiliary supplements to unstructured text. This study pioneers a systematic investigation into enhancing LLMs with structured scientific data, using chemical molecular science as a testbed. We investigate the impact of incorporating molecular property data on LLM across distinct training phases, including continual pre-training, supervised fine-tuning, and reinforcement learning. Notably, to address the inherent limitation of numerical insensitivity in large models, we propose an innovative methodology termed “Reinforcement Learning with Database Feedback” (RLDBF). Experimental evaluations demonstrate the efficacy of the proposed approach, with the model exhibiting remarkable generalization capabilities on previously unseen data and other chemical tasks. The results substantiate the potential of our method in advancing the field of structured scientific data processing within LLMs.
zh

[AI-149] SAFE: Self-Adjustment Federated Learning Framework for Remote Sensing Collaborative Perception

【速读】：该论文旨在解决分布式空间遥感系统中因数据分布差异导致的模型训练数据泄漏、通信开销增加以及准确性下降等问题。为应对这些挑战，论文提出了\textit{Self-Adjustment FEderated Learning (SAFE)}框架，其关键在于引入了四项创新策略：(1) \textit{Class Rectification Optimization}，通过自主调整解决未知局部和全局分布下的类别不平衡问题；(2) \textit{Feature Alignment Update}，通过本地控制的指数移动平均（EMA）更新缓解非独立同分布（Non-IID）数据的影响；(3) \textit{Dual-Factor Modulation Rheostat}，动态平衡训练过程中的优化效果；(4) \textit{Adaptive Context Enhancement}，通过动态优化前景区域提升模型性能，同时确保分布式卫星场景下计算效率与精度的同步改进。实验结果验证了SAFE框架在真实图像分类和目标分割任务中的有效性和可靠性。

链接: https://arxiv.org/abs/2504.03700
作者: Xiaohe Li,Haohua Wu,Jiahao Li,Zide Fan,Kaixin Zhang,Xinming Li,Yunping Ge,Xinyu Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:The rapid increase in remote sensing satellites has led to the emergence of distributed space-based observation systems. However, existing distributed remote sensing models often rely on centralized training, resulting in data leakage, communication overhead, and reduced accuracy due to data distribution discrepancies across platforms. To address these challenges, we propose the \textitSelf-Adjustment FEderated Learning (SAFE) framework, which innovatively leverages federated learning to enhance collaborative sensing in remote sensing scenarios. SAFE introduces four key strategies: (1) \textitClass Rectification Optimization, which autonomously addresses class imbalance under unknown local and global distributions. (2) \textitFeature Alignment Update, which mitigates Non-IID data issues via locally controlled EMA updates. (3) \textitDual-Factor Modulation Rheostat, which dynamically balances optimization effects during training. (4) \textitAdaptive Context Enhancement, which is designed to improve model performance by dynamically refining foreground regions, ensuring computational efficiency with accuracy improvement across distributed satellites. Experiments on real-world image classification and object segmentation datasets validate the effectiveness and reliability of the SAFE framework in complex remote sensing scenarios.
zh

[AI-150] Reinforcing Clinical Decision Support through Multi-Agent Systems and Ethical AI Governance

【速读】：本文旨在解决在数据驱动的医学时代，如何将可解释且经过伦理管理的人工智能纳入临床决策支持系统，以实现值得信赖且有效的患者护理。论文的关键在于提出了一种新的多智能体系统架构，该架构利用模块化智能体分析实验室结果、生命体征及临床背景，并整合这些信息以推动预测与结果验证。解决方案的关键在于通过基于eICU数据库的具体实现，部署针对特定实验室分析的智能体、仅关注生命体征的解释器以及情境推理器，同时运行预测模块和验证智能体，所有过程均透明地遵循自治（Autonomy）、公平性（Fairness）与责任性（Accountability）等伦理人工智能治理原则，从而不仅提升了解释性和准确性，还增强了重症监护环境中AI辅助决策的信任度。

链接: https://arxiv.org/abs/2504.03699
作者: Ying-Jung Chen,Chi-Sheng Chen,Ahmad Albarqawi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:In the age of data-driven medicine, it is paramount to include explainable and ethically managed artificial intelligence in explaining clinical decision support systems to achieve trustworthy and effective patient care. The focus of this paper is on a new architecture of a multi-agent system for clinical decision support that uses modular agents to analyze laboratory results, vital signs, and the clinical context and then integrates these results to drive predictions and validate outcomes. We describe our implementation with the eICU database to run lab-analysis-specific agents, vitals-only interpreters, and contextual reasoners and then run the prediction module and a validation agent. Everything is a transparent implementation of business logic, influenced by the principles of ethical AI governance such as Autonomy, Fairness, and Accountability. It provides visible results that this agent-based framework not only improves on interpretability and accuracy but also on reinforcing trust in AI-assisted decisions in an intensive care setting.
zh

[AI-151] Learning to Interfere in Non-Orthogonal Multiple-Access Joint Source-Channel Coding

【速读】：该论文致力于解决多用户通过多址接入信道（Multiple Access Channel, MAC）高效传输源信号（如图像）的问题，传统正交资源分配方法限制了系统容量。论文的关键在于提出了一种基于机器学习（Machine Learning, ML）的无线图像传输方法，通过引入多视图自动编码器（multi-view autoencoder）融合压缩与信道编码，实现了非正交多址接入（Non-Orthogonal Multiple Access, NOMA）方案，使多个发射机能够同时利用全部可用信道资源。此外，论文设计了一种渐进式微调算法，在维持初始性能的同时显著提升系统扩展性至16个以上用户，同时保持参数量仅小幅增加（仅为单用户模型的0.6%），从而大幅提升恢复图像的质量，并在多种数据集、评估指标及信道条件下优于现有NOMA方法。

链接: https://arxiv.org/abs/2504.03690
作者: Selim F. Yilmaz,Can Karamanli,Deniz Gunduz
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 18 pages, 19 figures

点击查看摘要

Abstract:We consider multiple transmitters aiming to communicate their source signals (e.g., images) over a multiple access channel (MAC). Conventional communication systems minimize interference by orthogonally allocating resources (time and/or bandwidth) among users, which limits their capacity. We introduce a machine learning (ML)-aided wireless image transmission method that merges compression and channel coding using a multi-view autoencoder, which allows the transmitters to use all the available channel resources simultaneously, resulting in a non-orthogonal multiple access (NOMA) scheme. The receiver must recover all the images from the received superposed signal, while also associating each image with its transmitter. Traditional ML models deal with individual samples, whereas our model allows signals from different users to interfere in order to leverage gains from NOMA under limited bandwidth and power constraints. We introduce a progressive fine-tuning algorithm that doubles the number of users at each iteration, maintaining initial performance with orthogonalized user-specific projections, which is then improved through fine-tuning steps. Remarkably, our method scales up to 16 users and beyond, with only a 0.6% increase in the number of trainable parameters compared to a single-user model, significantly enhancing recovered image quality and outperforming existing NOMA-based methods over a wide range of datasets, metrics, and channel conditions. Our approach paves the way for more efficient and robust multi-user communication systems, leveraging innovative ML components and strategies.
zh

[AI-152] CLCR: Contrastive Learning-based Constraint Reordering for Efficient MILP Solving

【速读】：本文旨在解决混合整数线性规划（Mixed-Integer Linear Programming, MILP）求解器在处理大规模问题时因约束条件排序不佳导致的计算效率低下问题，具体表现为增加的线性规划（LP）迭代次数和次优搜索路径。论文提出了一种名为CLCR（基于对比学习的约束重排序）的新框架，其关键是通过聚类分析约束的结构模式，并结合对比学习与指针网络优化约束序列，在保持问题等价性的前提下显著提升求解器效率。实验结果表明，CLCR平均将求解时间减少30%，LP迭代次数减少25%，且未牺牲解的准确性。这一工作展示了数据驱动的约束重排序在优化模型中的潜力，为数学规划与机器学习的融合提供了新的范式。

链接: https://arxiv.org/abs/2504.03688
作者: Shuli Zeng,Mengjie Zhou,Sijia Zhang,Yixiang Hu,Feng Wu,Xiang-Yang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Constraint ordering plays a critical role in the efficiency of Mixed-Integer Linear Programming (MILP) solvers, particularly for large-scale problems where poorly ordered constraints trigger increased LP iterations and suboptimal search trajectories. This paper introduces CLCR (Contrastive Learning-based Constraint Reordering), a novel framework that systematically optimizes constraint ordering to accelerate MILP solving. CLCR first clusters constraints based on their structural patterns and then employs contrastive learning with a pointer network to optimize their sequence, preserving problem equivalence while improving solver efficiency. Experiments on benchmarks show CLCR reduces solving time by 30% and LP iterations by 25% on average, without sacrificing solution accuracy. This work demonstrates the potential of data-driven constraint ordering to enhance optimization models, offering a new paradigm for bridging mathematical programming with machine learning.
zh

[AI-153] Revisiting Outage for Edge Inference Systems

【速读】：本文旨在解决在6G网络边缘部署大规模人工智能模型以提供远程推理服务时，如何设计既可靠又能满足严格端到端（End-to-End, E2E）延迟约束的边缘推理系统的问题。现有研究主要关注由信道中断概率表征的通信可靠性，可能无法保证E2E性能，特别是在E2E推理精度和延迟方面。为克服这一局限性，论文提出了一种理论框架，引入并数学化描述了推理中断（Inference Outage, InfOut）概率，用于量化E2E推理精度低于目标阈值的可能性。在此基础上，在E2E延迟约束下，该框架揭示了通信开销（即上传更多传感器观测数据）与由InfOut概率表征的推理可靠性之间的基本权衡关系。为了优化这种权衡，通过将接收到的判别增益分布近似为高斯分布，推导出InfOut概率的精确替代函数。实验结果表明，所提出的方案在E2E推理可靠性方面优于传统的通信为中心的方法。

链接: https://arxiv.org/abs/2504.03686
作者: Zhanwei Wang,Qunsong Zeng,Haotian Zheng,Kaibin Huang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One of the key missions of sixth-generation (6G) mobile networks is to deploy large-scale artificial intelligence (AI) models at the network edge to provide remote-inference services for edge devices. The resultant platform, known as edge inference, will support a wide range of Internet-of-Things applications, such as autonomous driving, industrial automation, and augmented reality. Given the mission-critical and time-sensitive nature of these tasks, it is essential to design edge inference systems that are both reliable and capable of meeting stringent end-to-end (E2E) latency constraints. Existing studies, which primarily focus on communication reliability as characterized by channel outage probability, may fail to guarantee E2E performance, specifically in terms of E2E inference accuracy and latency. To address this limitation, we propose a theoretical framework that introduces and mathematically characterizes the inference outage (InfOut) probability, which quantifies the likelihood that the E2E inference accuracy falls below a target threshold. Under an E2E latency constraint, this framework establishes a fundamental tradeoff between communication overhead (i.e., uploading more sensor observations) and inference reliability as quantified by the InfOut probability. To find a tractable way to optimize this tradeoff, we derive accurate surrogate functions for InfOut probability by applying a Gaussian approximation to the distribution of the received discriminant gain. Experimental results demonstrate the superiority of the proposed design over conventional communication-centric approaches in terms of E2E inference reliability.
zh

[AI-154] Intelligent Resource Allocation Optimization for Cloud Computing via Machine Learning

【速读】：该论文旨在解决云环境中资源分配效率低下、成本高昂以及服务质量难以平衡的问题。为应对这些挑战，论文提出了一种结合深度学习（LSTM）进行需求预测和强化学习（DQN）实现动态调度的智能资源分配算法。其关键在于通过精准的需求预测与实时调整机制，显著提升了资源利用率（32.5%）、降低了平均响应时间（43.3%）并减少了运营成本（26.6%），从而在提高系统效率的同时保持了高质量的服务水平。

链接: https://arxiv.org/abs/2504.03682
作者: Yuqing Wang,Xiao Yang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rapid expansion of cloud computing applications, optimizing resource allocation has become crucial for improving system performance and cost efficiency. This paper proposes an intelligent resource allocation algorithm that leverages deep learning (LSTM) for demand prediction and reinforcement learning (DQN) for dynamic scheduling. By accurately forecasting computing resource demands and enabling real-time adjustments, the proposed system enhances resource utilization by 32.5%, reduces average response time by 43.3%, and lowers operational costs by 26.6%. Experimental results in a production cloud environment confirm that the method significantly improves efficiency while maintaining high service quality. This study provides a scalable and effective solution for intelligent cloud resource management, offering valuable insights for future cloud optimization strategies.
zh

[AI-155] HiAER-Spike: Hardware-Software Co-Design for Large-Scale Reconfigurable Event-Driven Neuromorphic Computing

【速读】：本文旨在设计并构建一个名为HiAER-Spike的模块化、可重构事件驱动类脑计算平台，用于高效执行包含高达1.6亿神经元和400亿突触的大规模脉冲神经网络（Spiking Neural Networks, SNNs），其规模约为小鼠大脑的两倍，并支持超实时运行。论文的关键在于提出了一种软硬件协同优化的设计方案，通过层次化地址事件路由（Hierarchical Address-Event Routing, HiAER）实现大规模并行处理，同时结合内存高效的网络存储与执行机制，以应对稀疏连接和稀疏活动带来的挑战。该架构特别关注在边缘和云计算场景下实现鲁棒且低延迟的事件驱动推理。此外，论文还开发了一个与硬件细节无关的Python编程接口，简化了SNN配置与执行过程，降低了使用门槛。解决方案的核心在于通过软硬件协同优化和高效的路由机制，解决了大规模脉冲神经网络运行中的高计算复杂度与内存占用问题，同时提供了易用性和广泛的社区可用性。

链接: https://arxiv.org/abs/2504.03671
作者: Gwenevere Frank,Gopabandhu Hota,Keli Wang,Abhinav Uppal,Omowuyi Olajide,Kenneth Yoshimoto,Leif Gibb,Qingbo Wang,Johannes Leugering,Stephen Deiss,Gert Cauwenberghs
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: IEEE International Conference on Rebooting Computing (ICRC) 2024

点击查看摘要

Abstract:In this work, we present HiAER-Spike, a modular, reconfigurable, event-driven neuromorphic computing platform designed to execute large spiking neural networks with up to 160 million neurons and 40 billion synapses - roughly twice the neurons of a mouse brain at faster-than real-time. This system, which is currently under construction at the UC San Diego Supercomputing Center, comprises a co-designed hard- and software stack that is optimized for run-time massively parallel processing and hierarchical address-event routing (HiAER) of spikes while promoting memory-efficient network storage and execution. Our architecture efficiently handles both sparse connectivity and sparse activity for robust and low-latency event-driven inference for both edge and cloud computing. A Python programming interface to HiAER-Spike, agnostic to hardware-level detail, shields the user from complexity in the configuration and execution of general spiking neural networks with virtually no constraints in topology. The system is made easily available over a web portal for use by the wider community. In the following we provide an overview of the hard- and software stack, explain the underlying design principles, demonstrate some of the system’s capabilities and solicit feedback from the broader neuromorphic community.
zh

[AI-156] Self-Learning-Based Optimization for Free-form Pipe Routing in Aeroengine with Dynamic Design Environment

【速读】：该论文旨在解决航空发动机设计中自由曲面管道布局优化这一高度复杂且计算昂贵的问题。传统方法在处理恒定曲率管道布局优化方面已取得一定进展，但面对日益增长的自由形式管道需求，以及动态设计环境和模糊布局规则带来的挑战，现有方法在优化性能和效率上表现不足。为此，论文提出了一种基于自学习的方法（Self-Learning-based Pipe Routing, SLPR），其关键在于结合了近端策略优化（Proximal Policy Optimization, PPO）算法，并引入了一个统一的规则建模框架，用于连续空间中的高效障碍物检测与模糊规则建模。此外，通过构建势能表，SLPR能够快速查询布局倾向及干涉情况。SLPR中的智能体通过与环境交互迭代优化管道路径并积累设计知识，在设计环境变化时可通过微调网络参数迅速适应新场景。实验结果表明，SLPR不仅能够通过三次非均匀B样条（Cubic Non-Uniform B-Spline, NURBS）曲线实现平滑的管道布局，还显著减少了冗余管道段，并在静态与动态设计环境中均优于三种代表性基线方法，特别是在管道长度缩减、布局规则遵守度、路径复杂度及计算效率方面表现出色。此外，在动态环境下，SLPR避免了从头开始的繁重搜索，甚至优于重新训练的模型，体现了其在实际工程应用中的实用价值。

链接: https://arxiv.org/abs/2504.03669
作者: Caicheng Wang,Zili Wang,Shuyou Zhang,Yongzhe Xiang,Zheyi Li,Jianrong Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Pipe routing is a highly complex, time-consuming, and no-deterministic polynomial-time hard (NP-hard) problem in aeroengine design. Despite extensive research efforts in optimizing constant-curvature pipe routing, the growing demand for free-form pipes poses new challenges. Dynamic design environments and fuzzy layout rules further impact the optimization performance and efficiency. To tackle these challenges, this study proposes a self-learning-based method (SLPR) for optimizing free-form pipe routing in aeroengines. The SLPR is based on the proximal policy optimization (PPO) algorithm and integrates a unified rule modeling framework for efficient obstacle detection and fuzzy rule modeling in continuous space. Additionally, a potential energy table is constructed to enable rapid queries of layout tendencies and interference. The agent within SLPR iteratively refines pipe routing and accumulates the design knowledge through interaction with the environment. Once the design environment shifts, the agent can swiftly adapt by fine-tuning network parameters. Comparative tests reveal that SLPR ensures smooth pipe routing through cubic non-uniform B-spline (NURBS) curves, avoiding redundant pipe segments found in constant-curvature pipe routing. Results in both static and dynamic design environments demonstrate that SLPR outperforms three representative baselines in terms of the pipe length reduction, the adherence to layout rules, the path complexity, and the computational efficiency. Furthermore, tests in dynamic environments indicate that SLPR eliminates labor-intensive searches from scratch and even yields superior solutions compared to the retrained model. These results highlight the practical value of SLPR for real-world pipe routing, meeting lightweight, precision, and sustainability requirements of the modern aeroengine design.
zh

[AI-157] LLM HPC:Benchmarking DeepSeek s Performance in High-Performance Computing Tasks

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在高性能计算（High-Performance Computing, HPC）领域应用潜力未充分探索的问题。论文以DeepSeek为例，评估其在生成多种HPC基准代码（如共轭梯度求解器、并行热传导方程、并行矩阵乘法、DGEMM及STREAM三元运算等）方面的表现，并将其与GPT-4进行对比分析。关键在于通过测试代码正确性、性能以及不同配置和矩阵规模下的扩展性，全面评估DeepSeek在传统HPC编程语言（如Cpp、Fortran、Julia和Python）中的代码生成能力及其与现有工具的差异。研究结果表明，尽管DeepSeek能够生成可用的HPC代码，但在代码的可扩展性和执行效率方面仍落后于GPT-4。

链接: https://arxiv.org/abs/2504.03665
作者: Noujoud Nader,Patrick Diehl,Steve Brandt,Hartmut Kaiser
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 3 tables, conference

点击查看摘要

Abstract:Large Language Models (LLMs), such as GPT-4 and DeepSeek, have been applied to a wide range of domains in software engineering. However, their potential in the context of High-Performance Computing (HPC) much remains to be explored. This paper evaluates how well DeepSeek, a recent LLM, performs in generating a set of HPC benchmark codes: a conjugate gradient solver, the parallel heat equation, parallel matrix multiplication, DGEMM, and the STREAM triad operation. We analyze DeepSeek’s code generation capabilities for traditional HPC languages like Cpp, Fortran, Julia and Python. The evaluation includes testing for code correctness, performance, and scaling across different configurations and matrix sizes. We also provide a detailed comparison between DeepSeek and another widely used tool: GPT-4. Our results demonstrate that while DeepSeek generates functional code for HPC tasks, it lags behind GPT-4, in terms of scalability and execution efficiency of the generated code.
zh

[AI-158] PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在消费级设备上的高效推理问题，主要挑战在于LLMs对内存和计算资源的巨大需求与消费级设备有限的GPU内存之间的矛盾。现有的卸载（Offloading）方法虽可缓解内存约束，但通常会导致GPU利用率低下，从而影响推理效率。论文的关键创新点在于提出了一种名为PIPO（Pipelined Offloading）的新框架，通过设计细粒度的卸载流水线，并结合优化的数据传输与计算策略，实现了高并发和高效的推理调度。实验结果表明，PIPO将GPU利用率从低于40%提升至超过90%，并将吞吐量最高提升了3.1倍，运行于配备6GB显存的RTX3060 GPU的笔记本电脑上。

链接: https://arxiv.org/abs/2504.03664
作者: Yangyijian Liu,Jun Li,Wu-Jun Li
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The high memory and computation demand of large language models (LLMs) makes them challenging to be deployed on consumer devices due to limited GPU memory. Offloading can mitigate the memory constraint but often suffers from low GPU utilization, leading to low inference efficiency. In this work, we propose a novel framework, called pipelined offloading (PIPO), for efficient inference on consumer devices. PIPO designs a fine-grained offloading pipeline, complemented with optimized data transfer and computation, to achieve high concurrency and efficient scheduling for inference. Experimental results show that compared with state-of-the-art baseline, PIPO increases GPU utilization from below 40% to over 90% and achieves up to 3.1 \times higher throughput, running on a laptop equipped with a RTX3060 GPU of 6GB memory.
zh

[AI-159] Echo: Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model Serving

【速读】：该论文旨在解决在线任务与离线任务在共享大语言模型资源时的调度难题。在线任务通常具有突发性和对延迟敏感的特点，导致资源过度预留成为常见做法，而这种做法未能充分利用离线任务的灵活性，并且存在KV缓存重新计算以及工作负载不规则的问题。为了解决这些问题，论文提出了Echo系统，其关键在于通过一个紧密协作的调度器、KV缓存管理器以及估算工具包来最大化离线任务的吞吐量，同时确保在线任务的服务水平目标（SLOs）。具体而言，调度器利用上一次迭代的批量信息缩小最优调度方案的搜索空间；KV缓存管理器根据任务类型及其前缀共享机会设置缓存优先级以减少重新计算；估算工具包则预测离线任务的执行时间、未来内存消耗及吞吐量，从而指导调度决策和系统部署。评估结果显示，Echo能够将离线任务吞吐量提升高达3.3倍，同时满足在线任务的SLOs需求。

链接: https://arxiv.org/abs/2504.03651
作者: Zhibin Wang,Shipeng Li,Xue Li,Yuhang Zhou,Zhonghui Zhang,Zibo Wang,Rong Gu,Chen Tian,Kun Yang,Sheng Zhong
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models have been widely deployed in various applications, encompassing both interactive online tasks and batched offline tasks. Given the burstiness and latency sensitivity of online tasks, over-provisioning resources is common practice. This allows for the integration of latency-insensitive offline tasks during periods of low online load, enhancing resource utilization. However, strategically serving online and offline tasks through a preemption mechanism fails to fully leverage the flexibility of offline tasks and suffers from KV cache recomputation and irregular workloads. In this paper, we introduce Echo, a collaborative online-offline task serving system, including a scheduler, a KV cache manager, and estimation toolkits. The scheduler and KV cache manager work tightly to maximize the throughput of offline tasks, while the estimator further predicts execution time to ensure online task SLOs. The scheduler leverages the batch information of last iteration to reduce the search space for finding the optimal schedule. The KV cache manager sets the priority of the KV cache based on the type of tasks and the opportunity of prefix sharing to reduce the recomputation. Finally, the estimation toolkits predict the execution time, future memory consumption, and the throughput of offline tasks to guide the scheduler, KV cache manager, and the system deployer. Evaluation based on real-world workloads demonstrates that Echo can increase offline task throughput by up to 3.3\times , while satisfying online task SLOs. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.03651 [cs.DC] (or arXiv:2504.03651v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2504.03651 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-160] BoxRL-NNV: Boxed Refinement of Latin Hypercube Samples for Neural Network Verification

【速读】：该论文旨在解决神经网络中安全违规检测的问题，通过计算输入变量边界对应的输出变量边界来识别潜在的安全隐患。解决方案的关键在于结合拉丁超立方采样（Latin Hypercube Sampling）进行全局极值估计，以及利用L-BFGS-B算法对初始猜测值附近进行局部优化以提高精度。这一方法旨在有效评估神经网络的安全性，并已在ACAS Xu基准测试的子集上验证了其可行性，同时计划在VNN-COMP’25上展示与现有先进工具的全面性能对比结果。

链接: https://arxiv.org/abs/2504.03650
作者: Sarthak Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:BoxRL-NNV is a Python tool for the detection of safety violations in neural networks by computing the bounds of the output variables, given the bounds of the input variables of the network. This is done using global extrema estimation via Latin Hypercube Sampling, and further refinement using L-BFGS-B for local optimization around the initial guess. This paper presents an overview of BoxRL-NNV, as well as our results for a subset of the ACAS Xu benchmark. A complete evaluation of the tool’s performance, including benchmark comparisons with state-of-the-art tools, shall be presented at the Sixth International Verification of Neural Networks Competition (VNN-COMP’25).
zh

[AI-161] Diagnostic Method for Hydropower Plant Condition-based Maintenance combining Autoencoder with Clustering Algorithms

【速读】：该论文旨在解决因监控水电站的重要性不同而导致采集的时间序列数据量差异巨大，从而难以从提取的数据中生成有价值信息的问题。解决方案的关键在于提出了一种结合聚类算法和自动编码器神经网络的条件检测与诊断方法。首先通过降维算法生成二维或三维投影以揭示数据点之间未被怀疑的关系；然后利用一组聚类算法将数据点分组；针对每个识别出的簇，训练对应的自动编码器神经网络，并通过测量各自动编码器模型与实际测量值之间的重建误差，构建每个在聚类阶段发现的状态的接近度指数。

链接: https://arxiv.org/abs/2504.03649
作者: Samy Jad(LGP),Xavier Desforges(LGP),Pierre-Yves Villard,Christian Caussidéry,Kamal Medjaher(LGP)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The French company EDF uses supervisory control and data acquisition systems in conjunction with a data management platform to monitor hydropower plant, allowing engineers and technicians to analyse the time-series collected. Depending on the strategic importance of the monitored hydropower plant, the number of time-series collected can vary greatly making it difficult to generate valuable information from the extracted data. In an attempt to provide an answer to this particular problem, a condition detection and diagnosis method combining clustering algorithms and autoencoder neural networks for pattern recognition has been developed and is presented in this paper. First, a dimension reduction algorithm is used to create a 2-or 3-dimensional projection that allows the users to identify unsuspected relationships between datapoints. Then, a collection of clustering algorithms regroups the datapoints into clusters. For each identified cluster, an autoencoder neural network is trained on the corresponding dataset. The aim is to measure the reconstruction error between each autoencoder model and the measured values, thus creating a proximity index for each state discovered during the clustering stage.
zh

[AI-162] AIBrix: Towards Scalable Cost-Effective Large Language Model Inference Infrastructure

【速读】：该论文旨在解决大规模语言模型（Large Language Model, LLM）在云环境中的高效部署与优化问题，特别是降低推理成本并提升性能。解决方案的关键在于AIBrix框架的设计理念及其多项创新技术：通过共设计（co-design）方法确保基础设施各层与推理引擎（如vLLM）无缝集成；引入高密度LoRA管理实现动态适配器调度、针对LLM的自动扩展器以及基于前缀和负载感知的路由策略以减少推理成本并增强性能；利用分布式键值（KV）缓存提高节点间Token重用，从而将吞吐量提升50%，并将推理延迟降低70%；支持统一AI运行时以简化模型管理并保持供应商无关的引擎兼容性；采用混合编排策略（Kubernetes与Ray结合）平衡效率与灵活性；部署SLO驱动的GPU优化器以最大化成本效益并保障服务质量；并通过AI加速器诊断工具提升系统可靠性，实现自动化故障检测与模拟测试。

链接: https://arxiv.org/abs/2504.03648
作者: TheAIBrix Team:Jiaxin Shan,Varun Gupta,Le Xu,Haiyang Shi,Jingyuan Zhang,Ning Wang,Linhui Xu,Rong Kang,Tongping Liu,Yifei Zhang,Yiqing Zhu,Shuowei Jin,Gangmuk Lim,Binbin Chen,Zuzhi Chen,Xiao Liu,Xin Chen,Kante Yin,Chak-Pong Chung,Chenyu Jiang,Yicheng Lu,Jianjun Chen,Caixue Lin,Wu Xiang,Rui Shi,Liguang Xie
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce AIBrix, a cloud-native, open-source framework designed to optimize and simplify large-scale LLM deployment in cloud environments. Unlike traditional cloud-native stacks, AIBrix follows a co-design philosophy, ensuring every layer of the infrastructure is purpose-built for seamless integration with inference engines like vLLM. AIBrix introduces several key innovations to reduce inference costs and enhance performance including high-density LoRA management for dynamic adapter scheduling, LLM-specific autoscalers, and prefix-aware, load-aware routing. To further improve efficiency, AIBrix incorporates a distributed KV cache, boosting token reuse across nodes, leading to a 50% increase in throughput and a 70% reduction in inference latency. AIBrix also supports unified AI runtime which streamlines model management while maintaining vendor-agnostic engine compatibility. For large-scale multi-node inference, AIBrix employs hybrid orchestration – leveraging Kubernetes for coarse-grained scheduling and Ray for fine-grained execution – to balance efficiency and flexibility. Additionally, an SLO-driven GPU optimizer dynamically adjusts resource allocations, optimizing heterogeneous serving to maximize cost efficiency while maintaining service guarantees. Finally, AIBrix enhances system reliability with AI accelerator diagnostic tools, enabling automated failure detection and mock-up testing to improve fault resilience. AIBrix is available at this https URL.
zh

[AI-163] CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

【速读】：该论文试图解决现有代码生成基准测试主要评估单轮交互中大型语言模型（Large Language Models, LLMs）生成代码的功能正确性，而对严格遵循用户指令生成代码的能力，尤其是在多轮交互场景下的评估不足的问题。为了解决这一问题，论文的关键在于引入了一个名为\bench的新基准，它包含九种可验证的指令类型，这些指令与实际软件开发需求相一致，并可通过指定的测试用例独立且客观地进行验证，从而有效评估LLMs在多轮交互中的指令跟随能力。

链接: https://arxiv.org/abs/2503.22688
作者: Peiding Wang,Li Zhang,Fang Liu,Lin Shi,Minxiao Li,Bo Shen,An Fu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions, offering limited insight into their capabilities to generate code that strictly follows users’ instructions, especially in multi-turn interaction scenarios. In this paper, we introduce \bench, a benchmark for evaluating LLMs’ instruction-following capabilities in interactive code generation. Specifically, \bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. We evaluate nine prominent LLMs using \bench, and the experimental results reveal a significant disparity between their basic programming capability and instruction-following capability, particularly as task complexity, context length, and the number of dialogue rounds increase.
zh

[AI-164] Packet Inspection Transformer: A Self-Supervised Journey to Unseen Malware Detection with Few Samples

【速读】：该论文旨在解决传统恶意软件检测方法在应对现代复杂网络攻击时的局限性问题。传统的安全措施（如依赖于大规模标注数据的监督学习）难以有效处理新型、未见过的恶意软件威胁，同时深度包检测（DPI）虽然提供了更深入的流量分析能力，但其在泛化性和适应新威胁方面仍存在不足。为了解决这些问题，论文提出了一种结合自监督学习（Self-Supervised Learning, SSL）和少样本学习（Few-Shot Learning, FSL）的方法。关键在于利用自监督学习通过掩码操作从大量无标注数据中训练Transformer模型，从而提取网络数据包的有效嵌入表示，并使其能够泛化到多种下游任务，包括恶意软件检测。此外，通过少样本学习进一步增强模型对新型攻击类型的适应能力。实验结果表明，该方法在UNSW-NB15数据集上的分类准确率达到94.76%，在CIC-IoT23数据集上的准确率为83.25%。

链接: https://arxiv.org/abs/2409.18219
作者: Kyle Stein,Arash Mahyari,Guillermo Francia III,Eman El-Sheikh
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As networks continue to expand and become more interconnected, the need for novel malware detection methods becomes more pronounced. Traditional security measures are increasingly inadequate against the sophistication of modern cyber attacks. Deep Packet Inspection (DPI) has been pivotal in enhancing network security, offering an in-depth analysis of network traffic that surpasses conventional monitoring techniques. DPI not only examines the metadata of network packets, but also dives into the actual content being carried within the packet payloads, providing a comprehensive view of the data flowing through networks. While the integration of advanced deep learning techniques with DPI has introduced modern methodologies into malware detection and network traffic classification, state-of-the-art supervised learning approaches are limited by their reliance on large amounts of annotated data and their inability to generalize to novel, unseen malware threats. To address these limitations, this paper leverages the recent advancements in self-supervised learning (SSL) and few-shot learning (FSL). Our proposed self-supervised approach trains a transformer via SSL to learn the embedding of packet content, including payload, from vast amounts of unlabeled data by masking portions of packets, leading to a learned representation that generalizes to various downstream tasks. Once the representation is extracted from the packets, they are used to train a malware detection algorithm. The representation obtained from the transformer is then used to adapt the malware detector to novel types of attacks using few-shot learning approaches. Our experimental results demonstrate that our method achieves classification accuracies of up to 94.76% on the UNSW-NB15 dataset and 83.25% on the CIC-IoT23 dataset.
zh

[AI-165] SurvSurf: a partially monotonic neural network for first-hitting time prediction of intermittently observed discrete and continuous sequential events

【速读】：该论文试图解决在基于基线数据预测顺序事件首次命中时间的概率时，现有模型无法保证累积发生函数单调性关系的问题，同时缺乏对未观测中间事件的隐式考虑。论文的关键解决方案是提出了一种基于神经网络的生存模型（SurvSurf），它不仅理论上保证了顺序事件累积发生函数单调性的不违反，还允许预测因子的非线性影响，并且能够隐式整合未观测中间事件的信息，同时支持离散和连续时间和事件。此外，通过改进的 Integrated Brier Score (IBS)，进一步增强了对缺失中间事件隐含信息的考量，从而实现更精确的预测。

链接: https://arxiv.org/abs/2504.04997
作者: Yichen Kelly Chen,Sören Dittmer,Kinga Bernatowicz,Josep Arús-Pous,Kamen Bliznashki,John Aston,James H.F. Rudd,Carola-Bibiane Schönlieb,James Jones,Michael Roberts
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
备注: 41 pages, 18 figures (including supplemental information). Submitted to RSS: Data Science and Artificial Intelligence

点击查看摘要

Abstract:We propose a neural-network based survival model (SurvSurf) specifically designed for direct and simultaneous probabilistic prediction of the first hitting time of sequential events from baseline. Unlike existing models, SurvSurf is theoretically guaranteed to never violate the monotonic relationship between the cumulative incidence functions of sequential events, while allowing nonlinear influence from predictors. It also incorporates implicit truths for unobserved intermediate events in model fitting, and supports both discrete and continuous time and events. We also identified a variant of the Integrated Brier Score (IBS) that showed robust correlation with the mean squared error (MSE) between the true and predicted probabilities by accounting for implied truths about the missing intermediate events. We demonstrated the superiority of SurvSurf compared to modern and traditional predictive survival models in two simulated datasets and two real-world datasets, using MSE, the more robust IBS and by measuring the extent of monotonicity violation.
zh

[AI-166] Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approaches

【速读】：该论文试图解决在无配对输入输出信号的情况下准确估计非线性音频效果的问题。解决方案的关键在于引入了一种基于扩散生成模型（Diffusion Generative Model）的新方法，用于盲系统识别（Blind System Identification），通过黑箱和灰箱模型实现未知非线性效应的估计，并与之前提出的对抗方法进行对比分析，验证其在不同参数化效果算子及数据长度下的性能表现。研究结果表明，扩散方法提供了更稳定的结果且对数据可用性不敏感，而对抗方法在估计显著失真效果时更具优势。

链接: https://arxiv.org/abs/2504.04751
作者: Eloi Moliner,Michal Švento,Alec Wright,Lauri Juvela,Pavel Rajmic,Vesa Välimäki
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Submitted to the 28th International Conference on Digital Audio Effects (DAFx25)

点击查看摘要

Abstract:Accurately estimating nonlinear audio effects without access to paired input-output signals remains a challenging this http URL work studies unsupervised probabilistic approaches for solving this task. We introduce a method, novel for this application, based on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using black- and gray-box models. This study compares this method with a previously proposed adversarial approach, analyzing the performance of both methods under different parameterizations of the effect operator and varying lengths of available effected this http URL experiments on guitar distortion effects, we show that the diffusion-based approach provides more stable results and is less sensitive to data availability, while the adversarial approach is superior at estimating more pronounced distortion effects. Our findings contribute to the robust unsupervised blind estimation of audio effects, demonstrating the potential of diffusion models for system identification in music technology.
zh

[AI-167] AI2STOW: End-to-End Deep Reinforcement Learning to Construct Master Stowage Plans under Demand Uncertainty

【速读】：该论文旨在解决集装箱船舱位规划问题（Stowage Planning Problem），这是一个在需求不确定性下需兼顾全局目标与约束的复杂组合优化问题。为应对这一挑战，论文提出了AI2STOW，这是一种端到端的深度强化学习模型，结合可行性投影（feasibility projection）和动作掩码（action mask），以生成满足配对块状装载模式（paired block stowage patterns）等约束的主计划（master plan）。其关键在于通过强化学习框架实现高效求解，并同时考虑多目标优化与计算效率，从而超越传统基于强化学习和随机规划的方法。

链接: https://arxiv.org/abs/2504.04469
作者: Jaike Van Twiller,Djordje Grbic,Rune Møller Jensen
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to a journal

点击查看摘要

Abstract:The worldwide economy and environmental sustainability depend on eff icient and reliable supply chains, in which container shipping plays a crucial role as an environmentally friendly mode of transport. Liner shipping companies seek to improve operational efficiency by solving the stowage planning problem. Due to many complex combinatorial aspects, stowage planning is challenging and often decomposed into two NP-hard subproblems: master and slot planning. This article proposes AI2STOW, an end-to-end deep reinforcement learning model with feasibility projection and an action mask to create master plans under demand uncertainty with global objectives and constraints, including paired block stowage patterms. Our experimental results demonstrate that AI2STOW outperforms baseline methods from reinforcement learning and stochastic programming in objective performance and computational efficiency, based on simulated instances reflecting the scale of realistic vessels and operational planning horizons.
zh

[AI-168] EclipseNETs: Learning Irregular Small Celestial Body Silhouettes

【速读】：本文旨在解决在不规则小天体附近准确预测遮光事件（eclipse events）的问题，这对于航天器导航、轨道确定以及系统管理具有重要意义。论文的关键在于提出了一种利用神经隐式表示（neural implicit representations）高效且可靠地建模遮光条件的方法。通过设计能够精确捕捉小行星和彗星复杂轮廓的神经网络架构，该方法在Bennu、Itokawa、67P/Churyumov-Gerasimenko和Eros等四个典型天体上的测试表明，其精度与传统的光线追踪技术相当，但计算速度提高了几个数量级。此外，论文还开发了一种间接学习框架，通过神经常微分方程（Neural Ordinary Differential Equations）直接从稀疏轨迹数据训练模型，无需依赖准确形状模型的先验知识，从而实现遮光预测的持续优化与误差逐步减小。

链接: https://arxiv.org/abs/2504.04455
作者: Giacomo Acciarini,Dario Izzo,Francesco Biscani
机构: 未知
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurately predicting eclipse events around irregular small bodies is crucial for spacecraft navigation, orbit determination, and spacecraft systems management. This paper introduces a novel approach leveraging neural implicit representations to model eclipse conditions efficiently and reliably. We propose neural network architectures that capture the complex silhouettes of asteroids and comets with high precision. Tested on four well-characterized bodies - Bennu, Itokawa, 67P/Churyumov-Gerasimenko, and Eros - our method achieves accuracy comparable to traditional ray-tracing techniques while offering orders of magnitude faster performance. Additionally, we develop an indirect learning framework that trains these models directly from sparse trajectory data using Neural Ordinary Differential Equations, removing the requirement to have prior knowledge of an accurate shape model. This approach allows for the continuous refinement of eclipse predictions, progressively reducing errors and improving accuracy as new trajectory data is incorporated.
zh

[AI-169] OLAF: An Open Life Science Analysis Framework for Conversational Bioinformatics Powered by Large Language Models

【速读】：该论文旨在解决非编程背景的研究人员在进行计算生物学研究时面临的高技术门槛问题，并推动透明的、基于人工智能的生命科学研究。解决方案的关键在于OLAF（Open Life Science Analysis Framework）平台的设计，它结合了大型语言模型（Large Language Models, LLMs）与模块化代理-管道-路由架构，能够通过自然语言处理自动生成并执行生物信息学代码，支持包括.h5ad等格式的真实科学数据的分析任务，如单细胞RNA测序工作流、基因注释和数据可视化。OLAF通过其Angular前端和Python/Firebase后端实现这一功能，提供了一个可复现且用户友好的环境，从而降低了进入计算生物学领域的门槛。

链接: https://arxiv.org/abs/2504.03976
作者: Dylan Riffle,Nima Shirooni,Cody He,Manush Murali,Sovit Nayak,Rishikumar Gopalan,Diego Gonzalez Lopez
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:OLAF (Open Life Science Analysis Framework) is an open-source platform that enables researchers to perform bioinformatics analyses using natural language. By combining large language models (LLMs) with a modular agent-pipe-router architecture, OLAF generates and executes bioinformatics code on real scientific data, including formats like .h5ad. The system includes an Angular front end and a Python/Firebase backend, allowing users to run analyses such as single-cell RNA-seq workflows, gene annotation, and data visualization through a simple web interface. Unlike general-purpose AI tools, OLAF integrates code execution, data handling, and scientific libraries in a reproducible, user-friendly environment. It is designed to lower the barrier to computational biology for non-programmers and support transparent, AI-powered life science research.
zh

[AI-170] Experimental Study on Time Series Analysis of Lower Limb Rehabilitation Exercise Data Driven by Novel Model Architecture and Large Models

【速读】：该论文旨在解决中风后患者下肢运动功能恢复过程中，基于时间序列分析的康复指导策略优化问题。论文的关键在于探索新型模型架构（xLSTM）与大规模基础模型（Lag-Llama）在短期时间预测任务中的应用，通过利用中国科学院深圳先进技术研究院提出的SIAT-LLMD下肢运动数据集，系统性地验证了这些创新模型在关节运动学和动力学参数预测中的有效性。这种解决方案的核心在于结合机器学习与人工智能领域的最新进展，为实现个性化主动康复指导提供了理论依据和技术支持。

链接: https://arxiv.org/abs/2504.03799
作者: Hengyu Lin
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the application of novel model architectures and large-scale foundational models in temporal series analysis of lower limb rehabilitation motion data, aiming to leverage advancements in machine learning and artificial intelligence to empower active rehabilitation guidance strategies for post-stroke patients in limb motor function recovery. Utilizing the SIAT-LLMD dataset of lower limb movement data proposed by the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, we systematically elucidate the implementation and analytical outcomes of the innovative xLSTM architecture and the foundational model Lag-Llama in short-term temporal prediction tasks involving joint kinematics and dynamics parameters. The research provides novel insights for AI-enabled medical rehabilitation applications, demonstrating the potential of cutting-edge model architectures and large-scale models in rehabilitation medicine temporal prediction. These findings establish theoretical foundations for future applications of personalized rehabilitation regimens, offering significant implications for the development of customized therapeutic interventions in clinical practice.
zh

[AI-171] Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

【速读】：该论文旨在解决现有基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）算法在奖励模型假设不准确时性能下降的问题。现有方法大多依赖于布拉德利-特雷模型（Bradley-Terry model），其假设可能无法充分反映真实世界中人类判断的复杂性和变异性。为应对这一挑战，论文提出了一种鲁棒算法，通过降低奖励估计器和策略估计器的方差，理论上改善遗憾界（regret bounds）。关键在于该算法能够提升现有方法在奖励模型误设情况下的表现，并通过实证评估验证了其有效性，在Anthropic Helpful and Harmless数据集上的测试显示，有77%-81%的响应优于基线方法。

链接: https://arxiv.org/abs/2504.03784
作者: Kai Ye,Hongyi Zhou,Jin Zhu,Francesco Quinzan,Chengchung Shi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset.
zh

[AI-172] Advancing Air Quality Monitoring: TinyML-Based Real-Time Ozone Prediction with Cost-Effective Edge Devices

【速读】：该论文旨在解决城市空气污染加剧背景下实时空气质量监测与预测的需求。解决方案的关键在于提出了一种基于TinyML的创新系统，通过使用配备MQ7传感器（用于检测一氧化碳CO）及内置温度和压力传感器的Arduino Nano 33 BLE Sense微控制器，结合来自Kaggle印度空气质量参数数据集的预处理数据进行模型训练与评估。该系统采用多种输入参数组合（CO、温度和压力），最终构建的最优回归模型实现了0.03的均方误差（MSE）和0.95的R-squared值，表明具有高精度预测能力。通过敏感性分析发现，CO水平对臭氧浓度的预测最为关键，其次是压力和温度。系统低成本、低功耗的设计使其在资源受限环境中得以广泛应用，从而实现精确的实时臭氧水平预测，为及时应对污染事件和提升公众健康保护提供了有效手段。

链接: https://arxiv.org/abs/2504.03776
作者: Huam Ming Ken,Mehran Behjati
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: This is a preprint version of a paper accepted and published in Springer Lecture Notes in Networks and Systems. The final version is available at this https URL

点击查看摘要

Abstract:The escalation of urban air pollution necessitates innovative solutions for real-time air quality monitoring and prediction. This paper introduces a novel TinyML-based system designed to predict ozone concentration in real-time. The system employs an Arduino Nano 33 BLE Sense microcontroller equipped with an MQ7 sensor for carbon monoxide (CO) detection and built-in sensors for temperature and pressure measurements. The data, sourced from a Kaggle dataset on air quality parameters from India, underwent thorough cleaning and preprocessing. Model training and evaluation were performed using Edge Impulse, considering various combinations of input parameters (CO, temperature, and pressure). The optimal model, incorporating all three variables, achieved a mean squared error (MSE) of 0.03 and an R-squared value of 0.95, indicating high predictive accuracy. The regression model was deployed on the microcontroller via the Arduino IDE, showcasing robust real-time performance. Sensitivity analysis identified CO levels as the most critical predictor of ozone concentration, followed by pressure and temperature. The system’s low-cost and low-power design makes it suitable for widespread implementation, particularly in resource-constrained settings. This TinyML approach provides precise real-time predictions of ozone levels, enabling prompt responses to pollution events and enhancing public health protection.
zh

[AI-173] Artificial Intelligence and Deep Learning Algorithms for Epigenetic Sequence Analysis: A Review for Epigeneticists and AI Experts

【速读】：该论文旨在解决利用机器学习和人工智能（AI）方法处理表观遗传学问题，以克服传统高通量实验方法在时间和成本上的局限性。具体而言，论文关注如何通过训练AI模型解析表观基因组数据，解决包括疾病标志物预测、基因表达分析、增强子-启动子相互作用以及染色质状态预测等表观遗传学相关问题。论文的关键在于提供了一个面向AI专家和表观遗传学家的双向视角：一方面为AI研究者构建了一个表观遗传学研究问题的分类框架，另一方面为表观遗传学家列举了解决上述问题的相关AI解决方案。此外，论文还识别了现有研究中的空白、技术挑战，并提出了相应的改进建议。

链接: https://arxiv.org/abs/2504.03733
作者: Muhammad Tahir,Mahboobeh Norouzi,Shehroz S. Khan,James R. Davie,Soichiro Yamanaka,Ahmed Ashraf
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Epigenetics encompasses mechanisms that can alter the expression of genes without changing the underlying genetic sequence. The epigenetic regulation of gene expression is initiated and sustained by several mechanisms such as DNA methylation, histone modifications, chromatin conformation, and non-coding RNA. The changes in gene regulation and expression can manifest in the form of various diseases and disorders such as cancer and congenital deformities. Over the last few decades, high throughput experimental approaches have been used to identify and understand epigenetic changes, but these laboratory experimental approaches and biochemical processes are time-consuming and expensive. To overcome these challenges, machine learning and artificial intelligence (AI) approaches have been extensively used for mapping epigenetic modifications to their phenotypic manifestations. In this paper we provide a narrative review of published research on AI models trained on epigenomic data to address a variety of problems such as prediction of disease markers, gene expression, enhancer promoter interaction, and chromatin states. The purpose of this review is twofold as it is addressed to both AI experts and epigeneticists. For AI researchers, we provided a taxonomy of epigenetics research problems that can benefit from an AI-based approach. For epigeneticists, given each of the above problems we provide a list of candidate AI solutions in the literature. We have also identified several gaps in the literature, research challenges, and recommendations to address these challenges.
zh

[AI-174] Are Anxiety Detection Models Generalizable? A Cross-Activity and Cross-Population Study Using Wearables

【速读】：该论文试图解决焦虑检测模型在不同活动和多样化人群中的泛化能力不足的问题，这是评估模型偏差并增强其在广泛实际应用中用户信任的关键步骤。解决方案的关键在于通过构建并利用多个数据集（包括作者收集的数据集以及两个公开可用的数据集），全面评估模型在参与者内（同活动与跨活动场景）和参与者间（同活动与跨活动场景）的泛化性能，并系统性地训练和测试了超过3348个焦虑检测模型以探索模型的稳定性与适用性。研究发现，尽管焦虑状态召回率在不同活动中存在一定变化，但AUROC等关键指标表现相对稳定，这为模型的进一步优化提供了重要参考。

链接: https://arxiv.org/abs/2504.03695
作者: Nilesh Kumar Sahu,Snehil Gupta,Haroon R Lone
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Anxiety-provoking activities, such as public speaking, can trigger heightened anxiety responses in individuals with anxiety disorders. Recent research suggests that physiological signals, including electrocardiogram (ECG) and electrodermal activity (EDA), collected via wearable devices, can be used to detect anxiety in such contexts through machine learning models. However, the generalizability of these anxiety prediction models across different activities and diverse populations remains underexplored-an essential step for assessing model bias and fostering user trust in broader applications. To address this gap, we conducted a study with 111 participants who engaged in three anxiety-provoking activities. Utilizing both our collected dataset and two well-known publicly available datasets, we evaluated the generalizability of anxiety detection models within participants (for both same-activity and cross-activity scenarios) and across participants (within-activity and cross-activity). In total, we trained and tested more than 3348 anxiety detection models (using six classifiers, 31 feature sets, and 18 train-test configurations). Our results indicate that three key metrics-AUROC, recall for anxious states, and recall for non-anxious states-were slightly above the baseline score of 0.5. The best AUROC scores ranged from 0.62 to 0.73, with recall for the anxious class spanning 35.19% to 74.3%. Interestingly, model performance (as measured by AUROC) remained relatively stable across different activities and participant groups, though recall for the anxious class did exhibit some variation.
zh

[AI-175] Potential Indicator for Continuous Emotion Arousal by Dynamic Neural Synchrony

【速读】：该论文旨在解决自动且高质量的情绪标注需求，特别是在连续情绪识别和视频亮点检测等应用中，而传统的人工标注方式面临挑战。为应对这一问题，论文提出了一种基于脑电图（EEG）的新型个体间相关性（Inter-Subject Correlation, ISC）方法，其关键在于利用单电极与特征驱动的动态策略。具体而言，论文重新确认了两种适用于情绪分类的有效特征——一阶差分（First-Order Difference, FD）和微分熵（Differential Entropy, DE），并通过整体相关性分析验证了电极的异质同步性能，进一步结合滑动窗口相关技术，展示了不同特征或关键电极在各电影片段中动态ISC的一致性。这种方法能够可靠地捕捉由情感刺激引发的个体间一致且动态的神经同步，为连续情绪唤醒提供潜在指标，从而推动情感计算及神经科学领域的进展。

链接: https://arxiv.org/abs/2504.03643
作者: Guandong Pan,Zhaobang Wu,Yaqian Yang,Xin Wang,Longzhao Liu,Zhiming Zheng,Shaoting Tang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The need for automatic and high-quality emotion annotation is paramount in applications such as continuous emotion recognition and video highlight detection, yet achieving this through manual human annotations is challenging. Inspired by inter-subject correlation (ISC) utilized in neuroscience, this study introduces a novel Electroencephalography (EEG) based ISC methodology that leverages a single-electrode and feature-based dynamic approach. Our contributions are three folds. Firstly, we reidentify two potent emotion features suitable for classifying emotions-first-order difference (FD) an differential entropy (DE). Secondly, through the use of overall correlation analysis, we demonstrate the heterogeneous synchronized performance of electrodes. This performance aligns with neural emotion patterns established in prior studies, thus validating the effectiveness of our approach. Thirdly, by employing a sliding window correlation technique, we showcase the significant consistency of dynamic ISCs across various features or key electrodes in each analyzed film clip. Our findings indicate the method’s reliability in capturing consistent, dynamic shared neural synchrony among individuals, triggered by evocative film stimuli. This underscores the potential of our approach to serve as an indicator of continuous human emotion arousal. The implications of this research are significant for advancements in affective computing and the broader neuroscience field, suggesting a streamlined and effective tool for emotion analysis in real-world applications.
zh

机器学习

[LG-0] Dimension-Free Convergence of Diffusion Models for Approximate Gaussian Mixtures

链接: https://arxiv.org/abs/2504.05300
作者: Gen Li,Changxiao Cai,Yuting Wei
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models are distinguished by their exceptional generative performance, particularly in producing high-quality samples through iterative denoising. While current theory suggests that the number of denoising steps required for accurate sample generation should scale linearly with data dimension, this does not reflect the practical efficiency of widely used algorithms like Denoising Diffusion Probabilistic Models (DDPMs). This paper investigates the effectiveness of diffusion models in sampling from complex high-dimensional distributions that can be well-approximated by Gaussian Mixture Models (GMMs). For these distributions, our main result shows that DDPM takes at most \widetildeO(1/\varepsilon) iterations to attain an \varepsilon -accurate distribution in total variation (TV) distance, independent of both the ambient dimension d and the number of components K , up to logarithmic factors. Furthermore, this result remains robust to score estimation errors. These findings highlight the remarkable effectiveness of diffusion models in high-dimensional settings given the universal approximation capability of GMMs, and provide theoretical insights into their practical success.

[LG-1] Covariant Gradient Descent

链接: https://arxiv.org/abs/2504.05279
作者: Dmitry Guskov,Vitaly Vanchurin
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 2 tables

点击查看摘要

Abstract:We present a manifestly covariant formulation of the gradient descent method, ensuring consistency across arbitrary coordinate systems and general curved trainable spaces. The optimization dynamics is defined using a covariant force vector and a covariant metric tensor, both computed from the first and second statistical moments of the gradients. These moments are estimated through time-averaging with an exponential weight function, which preserves linear computational complexity. We show that commonly used optimization methods such as RMSProp and Adam correspond to special limits of the covariant gradient descent (CGD) and demonstrate how these methods can be further generalized and improved.

[LG-2] PEAKS: Selecting Key Training Examples Incrementally via Prediction Error Anchored by Kernel Similarity

链接: https://arxiv.org/abs/2504.05250
作者: Mustafa Burak Gurbuz,Xingyu Zheng,Constantine Dovrolis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As deep learning continues to be driven by ever-larger datasets, understanding which examples are most important for generalization has become a critical question. While progress in data selection continues, emerging applications require studying this problem in dynamic contexts. To bridge this gap, we pose the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source. In this setting, the learner must incrementally build a training dataset of predefined size while simultaneously learning the underlying task. We find that in IDS, the impact of a new sample on the model state depends fundamentally on both its geometric relationship in the feature space and its prediction error. Leveraging this insight, we propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS. Our comprehensive evaluations demonstrate that PEAKS consistently outperforms existing selection strategies. Furthermore, PEAKS yields increasingly better performance returns than random selection as training data size grows on real-world datasets.

[LG-3] Embedded Federated Feature Selection with Dynamic Sparse Training: Balancing Accuracy-Cost Tradeoffs IJCNN2025

链接: https://arxiv.org/abs/2504.05245
作者: Afsaneh Mahanipour,Hana Khamfroush
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted for presentation at IJCNN 2025

点击查看摘要

Abstract:Federated Learning (FL) enables multiple resource-constrained edge devices with varying levels of heterogeneity to collaboratively train a global model. However, devices with limited capacity can create bottlenecks and slow down model convergence. One effective approach to addressing this issue is to use an efficient feature selection method, which reduces overall resource demands by minimizing communication and computation costs, thereby mitigating the impact of struggling nodes. Existing federated feature selection (FFS) methods are either considered as a separate step from FL or rely on a third party. These approaches increase computation and communication overhead, making them impractical for real-world high-dimensional datasets. To address this, we present \textitDynamic Sparse Federated Feature Selection (DSFFS), the first innovative embedded FFS that is efficient in both communication and computation. In the proposed method, feature selection occurs simultaneously with model training. During training, input-layer neurons, their connections, and hidden-layer connections are dynamically pruned and regrown, eliminating uninformative features. This process enhances computational efficiency on devices, improves network communication efficiency, and boosts global model performance. Several experiments are conducted on nine real-world datasets of varying dimensionality from diverse domains, including biology, image, speech, and text. The results under a realistic non-iid data distribution setting show that our approach achieves a better trade-off between accuracy, computation, and communication costs by selecting more informative features compared to other state-of-the-art FFS methods.

[LG-4] Learning symmetries in datasets

链接: https://arxiv.org/abs/2504.05174
作者: Veronica Sanz
类目: Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:We investigate how symmetries present in datasets affect the structure of the latent space learned by Variational Autoencoders (VAEs). By training VAEs on data originating from simple mechanical systems and particle collisions, we analyze the organization of the latent space through a relevance measure that identifies the most meaningful latent directions. We show that when symmetries or approximate symmetries are present, the VAE self-organizes its latent space, effectively compressing the data along a reduced number of latent variables. This behavior captures the intrinsic dimensionality determined by the symmetry constraints and reveals hidden relations among the features. Furthermore, we provide a theoretical analysis of a simple toy model, demonstrating how, under idealized conditions, the latent space aligns with the symmetry directions of the data manifold. We illustrate these findings with examples ranging from two-dimensional datasets with O(2) symmetry to realistic datasets from electron-positron and proton-proton collisions. Our results highlight the potential of unsupervised generative models to expose underlying structures in data and offer a novel approach to symmetry discovery without explicit supervision.

[LG-5] SparsyFed: Sparse Adaptive Federated Training ICLR2025

链接: https://arxiv.org/abs/2504.05153
作者: Adriano Guastella,Lorenzo Sani,Alex Iacob,Alessio Mora,Paolo Bellavista,Nicholas D. Lane
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Sparse training is often adopted in cross-device federated learning (FL) environments where constrained devices collaboratively train a machine learning model on private data by exchanging pseudo-gradients across heterogeneous networks. Although sparse training methods can reduce communication overhead and computational burden in FL, they are often not used in practice for the following key reasons: (1) data heterogeneity makes it harder for clients to reach consensus on sparse models compared to dense ones, requiring longer training; (2) methods for obtaining sparse masks lack adaptivity to accommodate very heterogeneous data distributions, crucial in cross-device FL; and (3) additional hyperparameters are required, which are notably challenging to tune in FL. This paper presents SparsyFed, a practical federated sparse training method that critically addresses the problems above. Previous works have only solved one or two of these challenges at the expense of introducing new trade-offs, such as clients’ consensus on masks versus sparsity pattern adaptivity. We show that SparsyFed simultaneously (1) can produce 95% sparse models, with negligible degradation in accuracy, while only needing a single hyperparameter, (2) achieves a per-round weight regrowth 200 times smaller than previous methods, and (3) allows the sparse masks to adapt to highly heterogeneous data distributions and outperform all baselines under such conditions.

[LG-6] Prεεmpt: Sanitizing Sensitive Prompts for LLM s

链接: https://arxiv.org/abs/2504.05147
作者: Amrita Roy Chowdhury,David Glukhov,Divyam Anshumaan,Prasad Chalasani,Nicolas Papernot,Somesh Jha,Mihir Bellare
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of large language models (LLMs) has introduced new privacy challenges, particularly during inference where sensitive information in prompts may be exposed to proprietary LLM APIs. In this paper, we address the problem of formally protecting the sensitive information contained in a prompt while maintaining response quality. To this end, first, we introduce a cryptographically inspired notion of a prompt sanitizer which transforms an input prompt to protect its sensitive tokens. Second, we propose Pr \epsilon\epsilon mpt, a novel system that implements a prompt sanitizer. Pr \epsilon\epsilon mpt categorizes sensitive tokens into two types: (1) those where the LLM’s response depends solely on the format (such as SSNs, credit card numbers), for which we use format-preserving encryption (FPE); and (2) those where the response depends on specific values, (such as age, salary) for which we apply metric differential privacy (mDP). Our evaluation demonstrates that Pr \epsilon\epsilon mpt is a practical method to achieve meaningful privacy guarantees, while maintaining high utility compared to unsanitized prompts, and outperforming prior methods

[LG-7] Unifying Physics- and Data-Driven Modeling via Novel Causal Spatiotemporal Graph Neural Network for Interpretable Epidemic Forecasting

链接: https://arxiv.org/abs/2504.05140
作者: Shuai Han,Lukas Stelz,Thomas R. Sokolowski,Kai Zhou,Horst Stöcker
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 32 pages, 12 figures. Submitted to Expert Systems with Applications and currently under review. This version includes minor revisions. The work proposes a physics-informed deep learning framework integrating a novel epidemic model with causal spatiotemporal graph neural networks for interpretable forecasting

点击查看摘要

Abstract:Accurate epidemic forecasting is crucial for effective disease control and prevention. Traditional compartmental models often struggle to estimate temporally and spatially varying epidemiological parameters, while deep learning models typically overlook disease transmission dynamics and lack interpretability in the epidemiological context. To address these limitations, we propose a novel Causal Spatiotemporal Graph Neural Network (CSTGNN), a hybrid framework that integrates a Spatio-Contact SIR model with Graph Neural Networks (GNNs) to capture the spatiotemporal propagation of epidemics. Inter-regional human mobility exhibits continuous and smooth spatiotemporal patterns, leading to adjacent graph structures that share underlying mobility dynamics. To model these dynamics, we employ an adaptive static connectivity graph to represent the stable components of human mobility and utilize a temporal dynamics model to capture fluctuations within these patterns. By integrating the adaptive static connectivity graph with the temporal dynamics graph, we construct a dynamic graph that encapsulates the comprehensive properties of human mobility networks. Additionally, to capture temporal trends and variations in infectious disease spread, we introduce a temporal decomposition model to handle temporal dependence. This model is then integrated with a dynamic graph convolutional network for epidemic forecasting. We validate our model using real-world datasets at the provincial level in China and the state level in Germany. Extensive studies demonstrate that our method effectively models the spatiotemporal dynamics of infectious diseases, providing a valuable tool for forecasting and intervention strategies. Furthermore, analysis of the learned parameters offers insights into disease transmission mechanisms, enhancing the interpretability and practical applicability of our model.

[LG-8] owards Optimal Heterogeneous Client Sampling in Multi-Model Federated Learning

链接: https://arxiv.org/abs/2504.05138
作者: Haoran Zhang,Zejun Gong,Zekai Li,Marie Siew,Carlee Joe-Wong,Rachid El-Azouzi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 10 pages

点击查看摘要

Abstract:Federated learning (FL) allows edge devices to collaboratively train models without sharing local data. As FL gains popularity, clients may need to train multiple unrelated FL models, but communication constraints limit their ability to train all models simultaneously. While clients could train FL models sequentially, opportunistically having FL clients concurrently train different models – termed multi-model federated learning (MMFL) – can reduce the overall training time. Prior work uses simple client-to-model assignments that do not optimize the contribution of each client to each model over the course of its training. Prior work on single-model FL shows that intelligent client selection can greatly accelerate convergence, but naïve extensions to MMFL can violate heterogeneous resource constraints at both the server and the clients. In this work, we develop a novel convergence analysis of MMFL with arbitrary client sampling methods, theoretically demonstrating the strengths and limitations of previous well-established gradient-based methods. Motivated by this analysis, we propose MMFL-LVR, a loss-based sampling method that minimizes training variance while explicitly respecting communication limits at the server and reducing computational costs at the clients. We extend this to MMFL-StaleVR, which incorporates stale updates for improved efficiency and stability, and MMFL-StaleVRE, a lightweight variant suitable for low-overhead deployment. Experiments show our methods improve average accuracy by up to 19.1% over random sampling, with only a 5.4% gap from the theoretical optimum (full client participation).

[LG-9] AI-Driven Tactical Communications and Networking for Defense: A Survey and Emerging Trends

链接: https://arxiv.org/abs/2504.05071
作者: Victor Monzon Baeza,Raúl Parada,Laura Concha Salor,Carlos Monzo
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) in military communications and networking is reshaping modern defense strategies, enhancing secure data exchange, real-time situational awareness, and autonomous decision-making. This survey explores how AI-driven technologies improve tactical communication networks, radar-based data transmission, UAV-assisted relay systems, and electronic warfare resilience. The study highlights AI applications in adaptive signal processing, multi-agent coordination for network optimization, radar-assisted target tracking, and AI-driven electronic countermeasures. Our work introduces a novel three-criteria evaluation methodology. It systematically assesses AI applications based on general system objectives, communications constraints in the military domain, and critical tactical environmental factors. We analyze key AI techniques for different types of learning applied to multi-domain network interoperability and distributed data information fusion in military operations. We also address challenges such as adversarial AI threats, the real-time adaptability of autonomous communication networks, and the limitations of current AI models under battlefield conditions. Finally, we discuss emerging trends in self-healing networks, AI-augmented decision support systems, and intelligent spectrum allocation. We provide a structured roadmap for future AI-driven defense communications and networking research.

[LG-10] MIAT: Maneuver-Intention-Aware Transformer for Spatio-Temporal Trajectory Prediction

链接: https://arxiv.org/abs/2504.05059
作者: Chandra Raskoti,Iftekharul Islam,Xuan Wang,Weizi Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate vehicle trajectory prediction is critical for safe and efficient autonomous driving, especially in mixed traffic environments with both human-driven and autonomous vehicles. However, uncertainties introduced by inherent driving behaviors – such as acceleration, deceleration, and left and right maneuvers – pose significant challenges for reliable trajectory prediction. We introduce a Maneuver-Intention-Aware Transformer (MIAT) architecture, which integrates a maneuver intention awareness mechanism with spatiotemporal interaction modeling to enhance long-horizon trajectory predictions. We systematically investigate the impact of varying awareness of maneuver intention on both short- and long-horizon trajectory predictions. Evaluated on the real-world NGSIM dataset and benchmarked against various transformer- and LSTM-based methods, our approach achieves an improvement of up to 4.7% in short-horizon predictions and a 1.6% in long-horizon predictions compared to other intention-aware benchmark methods. Moreover, by leveraging an intention awareness control mechanism, MIAT realizes an 11.1% performance boost in long-horizon predictions, with a modest drop in short-horizon performance.

[LG-11] Attention-Augmented Inverse Reinforcement Learning with Graph Convolutions for Multi-Agent Task Allocation

链接: https://arxiv.org/abs/2504.05045
作者: Huilin Yin,Zhikun Yang,Daniel Watzenig
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Multi-agent task allocation (MATA) plays a vital role in cooperative multi-agent systems, with significant implications for applications such as logistics, search and rescue, and robotic coordination. Although traditional deep reinforcement learning (DRL) methods have been shown to be promising, their effectiveness is hindered by a reliance on manually designed reward functions and inefficiencies in dynamic environments. In this paper, an inverse reinforcement learning (IRL)-based framework is proposed, in which multi-head self-attention (MHSA) and graph attention mechanisms are incorporated to enhance reward function learning and task execution efficiency. Expert demonstrations are utilized to infer optimal reward densities, allowing dependence on handcrafted designs to be reduced and adaptability to be improved. Extensive experiments validate the superiority of the proposed method over widely used multi-agent reinforcement learning (MARL) algorithms in terms of both cumulative rewards and task execution efficiency.

[LG-12] Multi-level Neural Networks for high-dimensional parametric obstacle problems

链接: https://arxiv.org/abs/2504.05026
作者: Martin Eigel,Cosmas Heiß,Janina E. Schütte
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:A new method to solve computationally challenging (random) parametric obstacle problems is developed and analyzed, where the parameters can influence the related partial differential equation (PDE) and determine the position and surface structure of the obstacle. As governing equation, a stationary elliptic diffusion problem is assumed. The high-dimensional solution of the obstacle problem is approximated by a specifically constructed convolutional neural network (CNN). This novel algorithm is inspired by a finite element constrained multigrid algorithm to represent the parameter to solution map. This has two benefits: First, it allows for efficient practical computations since multi-level data is used as an explicit output of the NN thanks to an appropriate data preprocessing. This improves the efficacy of the training process and subsequently leads to small errors in the natural energy norm. Second, the comparison of the CNN to a multigrid algorithm provides means to carry out a complete a priori convergence and complexity analysis of the proposed NN architecture. Numerical experiments illustrate a state-of-the-art performance for this challenging problem.

[LG-13] Concept Extraction for Time Series with ECLAD-ts

链接: https://arxiv.org/abs/2504.05024
作者: Antonia Holzapfel,Andres Felipe Posada-Moreno,Sebastian Trimpe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) for time series classification (TSC) are being increasingly used in applications ranging from quality prediction to medical diagnosis. The black box nature of these models makes understanding their prediction process difficult. This issue is crucial because CNNs are prone to learning shortcuts and biases, compromising their robustness and alignment with human expectations. To assess whether such mechanisms are being used and the associated risk, it is essential to provide model explanations that reflect the inner workings of the model. Concept Extraction (CE) methods offer such explanations, but have mostly been developed for the image domain so far, leaving a gap in the time series domain. In this work, we present a CE and localization method tailored to the time series domain, based on the ideas of CE methods for images. We propose the novel method ECLAD-ts, which provides post-hoc global explanations based on how the models encode subsets of the input at different levels of abstraction. For this, concepts are produced by clustering timestep-wise aggregations of CNN activation maps, and their importance is computed based on their impact on the prediction process. We evaluate our method on synthetic and natural datasets. Furthermore, we assess the advantages and limitations of CE in time series through empirical results. Our results show that ECLAD-ts effectively explains models by leveraging their internal representations, providing useful insights about their prediction process.

[LG-14] Joint Pedestrian and Vehicle Traffic Optimization in Urban Environments using Reinforcement Learning

链接: https://arxiv.org/abs/2504.05018
作者: Bibek Poudel,Xuan Wang,Weizi Li,Lei Zhu,Kevin Heaslip
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) holds significant promise for adaptive traffic signal control. While existing RL-based methods demonstrate effectiveness in reducing vehicular congestion, their predominant focus on vehicle-centric optimization leaves pedestrian mobility needs and safety challenges unaddressed. In this paper, we present a deep RL framework for adaptive control of eight traffic signals along a real-world urban corridor, jointly optimizing both pedestrian and vehicular efficiency. Our single-agent policy is trained using real-world pedestrian and vehicle demand data derived from Wi-Fi logs and video analysis. The results demonstrate significant performance improvements over traditional fixed-time signals, reducing average wait times per pedestrian and per vehicle by up to 67% and 52%, respectively, while simultaneously decreasing total accumulated wait times for both groups by up to 67% and 53%. Additionally, our results demonstrate generalization capabilities across varying traffic demands, including conditions entirely unseen during training, validating RL’s potential for developing transportation systems that serve all road users.

[LG-15] Deconstructing Jazz Piano Style Using Machine Learning

链接: https://arxiv.org/abs/2504.05009
作者: Huw Cheston,Reuben Bance,Peter M. C. Harrison
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Paper: 40 pages, 11 figures, 1 table. Supplementary material: 33 pages, 48 figures, 6 tables

点击查看摘要

Abstract:Artistic style has been studied for centuries, and recent advances in machine learning create new possibilities for understanding it computationally. However, ensuring that machine-learning models produce insights aligned with the interests of practitioners and critics remains a significant challenge. Here, we focus on musical style, which benefits from a rich theoretical and mathematical analysis tradition. We train a variety of supervised-learning models to identify 20 iconic jazz musicians across a carefully curated dataset of 84 hours of recordings, and interpret their decision-making processes. Our models include a novel multi-input architecture that enables four musical domains (melody, harmony, rhythm, and dynamics) to be analysed separately. These models enable us to address fundamental questions in music theory and also advance the state-of-the-art in music performer identification (94% accuracy across 20 classes). We release open-source implementations of our models and an accompanying web application for exploring musical styles.

[LG-16] A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization

链接: https://arxiv.org/abs/2504.04950
作者: Wenyuan Xu,Xiaochen Zuo,Chao Xin,Yu Yue,Lin Yan,Yonghui Wu
类目: Machine Learning (cs.LG)
*备注: 11oages,2 figures

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a important paradigm for aligning large language models (LLMs) with human preferences during post-training. This framework typically involves two stages: first, training a reward model on human preference data, followed by optimizing the language model using reinforcement learning algorithms. However, current RLHF approaches may constrained by two limitations. First, existing RLHF frameworks often rely on Bradley-Terry models to assign scalar rewards based on pairwise comparisons of individual responses. However, this approach imposes significant challenges on reward model (RM), as the inherent variability in prompt-response pairs across different contexts demands robust calibration capabilities from the RM. Second, reward models are typically initialized from generative foundation models, such as pre-trained or supervised fine-tuned models, despite the fact that reward models perform discriminative tasks, creating a mismatch. This paper introduces Pairwise-RL, a RLHF framework that addresses these challenges through a combination of generative reward modeling and a pairwise proximal policy optimization (PPO) algorithm. Pairwise-RL unifies reward model training and its application during reinforcement learning within a consistent pairwise paradigm, leveraging generative modeling techniques to enhance reward model performance and score calibration. Experimental evaluations demonstrate that Pairwise-RL outperforms traditional RLHF frameworks across both internal evaluation datasets and standard public benchmarks, underscoring its effectiveness in improving alignment and model behavior.

[LG-17] Constrained Gaussian Process Motion Planning via Stein Variational Newton Inference

链接: https://arxiv.org/abs/2504.04936
作者: Jiayun Li,Kay Pompetzki,An Thai Le,Haolei Tong,Jan Peters,Georgia Chalvatzaki
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gaussian Process Motion Planning (GPMP) is a widely used framework for generating smooth trajectories within a limited compute time–an essential requirement in many robotic applications. However, traditional GPMP approaches often struggle with enforcing hard nonlinear constraints and rely on Maximum a Posteriori (MAP) solutions that disregard the full Bayesian posterior. This limits planning diversity and ultimately hampers decision-making. Recent efforts to integrate Stein Variational Gradient Descent (SVGD) into motion planning have shown promise in handling complex constraints. Nonetheless, these methods still face persistent challenges, such as difficulties in strictly enforcing constraints and inefficiencies when the probabilistic inference problem is poorly conditioned. To address these issues, we propose a novel constrained Stein Variational Gaussian Process Motion Planning (cSGPMP) framework, incorporating a GPMP prior specifically designed for trajectory optimization under hard constraints. Our approach improves the efficiency of particle-based inference while explicitly handling nonlinear constraints. This advancement significantly broadens the applicability of GPMP to motion planning scenarios demanding robust Bayesian inference, strict constraint adherence, and computational efficiency within a limited time. We validate our method on standard benchmarks, achieving an average success rate of 98.57% across 350 planning tasks, significantly outperforming competitive baselines. This demonstrates the ability of our method to discover and use diverse trajectory modes, enhancing flexibility and adaptability in complex environments, and delivering significant improvements over standard baselines without incurring major computational costs.

[LG-18] SoK: LLM -based Log Parsing

链接: https://arxiv.org/abs/2504.04877
作者: Viktor Beck,Max Landauer,Markus Wurzenberger,Florian Skopik,Andreas Rauber
类目: Machine Learning (cs.LG)
*备注: 34 pages, 11 figures

点击查看摘要

Abstract:Log data, generated by software systems, provides crucial insights for tasks like monitoring, root cause analysis, and anomaly detection. Due to the vast volume of logs, automated log parsing is essential to transform semi-structured log messages into structured representations. Traditional log parsing techniques often require manual configurations, such as defining log formats or labeling data, which limits scalability and usability. Recent advances in large language models (LLMs) have introduced the new research field of LLM-based log parsing, offering potential improvements in automation and adaptability. Despite promising results, there is no structured overview of these approaches since this is a relatively new research field with the earliest advances published in late 2023. This paper systematically reviews 29 LLM-based log parsing methods, comparing their capabilities, limitations, and reliance on manual effort. We analyze the learning and prompt-engineering paradigms employed, efficiency- and effectiveness-enhancing techniques, and the role of LLMs in the parsing process. We aggregate the results of the survey in a large table comprising the characterizing features of LLM-based log parsing approaches and derive the general process of LLM-based log parsing, incorporating all reviewed approaches in a single flow chart. Additionally, we benchmark seven open-source LLM-based log parsers on public datasets and critically assess their reproducibility. Our findings summarize the advances of this new research field and provide insights for researchers and practitioners seeking efficient and user-friendly log parsing solutions, with all code and results made publicly available for transparency.

[LG-19] Nonlocal techniques for the analysis of deep ReLU neural network approximations

链接: https://arxiv.org/abs/2504.04847
作者: Cornelia Schneider,Mario Ullrich,Jan Vybiral
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Recently, Daubechies, DeVore, Foucart, Hanin, and Petrova introduced a system of piece-wise linear functions, which can be easily reproduced by artificial neural networks with the ReLU activation function and which form a Riesz basis of L_2([0,1]) . This work was generalized by two of the authors to the multivariate setting. We show that this system serves as a Riesz basis also for Sobolev spaces W^s([0,1]^d) and Barron classes \mathbb B^s([0,1]^d) with smoothness 0s1 . We apply this fact to re-prove some recent results on the approximation of functions from these classes by deep neural networks. Our proof method avoids using local approximations and allows us to track also the implicit constants as well as to show that we can avoid the curse of dimension. Moreover, we also study how well one can approximate Sobolev and Barron functions by ANNs if only function values are known.

[LG-20] Attentional Graph Meta-Learning for Indoor Localization Using Extremely Sparse Fingerprints

链接: https://arxiv.org/abs/2504.04829
作者: Wenzhong Yan,Feng Yin,Jun Gao,Ao Wang,Yang Tian,Ruizhi Chen
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Fingerprint-based indoor localization is often labor-intensive due to the need for dense grids and repeated measurements across time and space. Maintaining high localization accuracy with extremely sparse fingerprints remains a persistent challenge. Existing benchmark methods primarily rely on the measured fingerprints, while neglecting valuable spatial and environmental characteristics. In this paper, we propose a systematic integration of an Attentional Graph Neural Network (AGNN) model, capable of learning spatial adjacency relationships and aggregating information from neighboring fingerprints, and a meta-learning framework that utilizes datasets with similar environmental characteristics to enhance model training. To minimize the labor required for fingerprint collection, we introduce two novel data augmentation strategies: 1) unlabeled fingerprint augmentation using moving platforms, which enables the semi-supervised AGNN model to incorporate information from unlabeled fingerprints, and 2) synthetic labeled fingerprint augmentation through environmental digital twins, which enhances the meta-learning framework through a practical distribution alignment, which can minimize the feature discrepancy between synthetic and real-world fingerprints effectively. By integrating these novel modules, we propose the Attentional Graph Meta-Learning (AGML) model. This novel model combines the strengths of the AGNN model and the meta-learning framework to address the challenges posed by extremely sparse fingerprints. To validate our approach, we collected multiple datasets from both consumer-grade WiFi devices and professional equipment across diverse environments. Extensive experiments conducted on both synthetic and real-world datasets demonstrate that the AGML model-based localization method consistently outperforms all baseline methods using sparse fingerprints across all evaluated metrics.

[LG-21] opological Schrödinger Bridge Matching ICLR2025

链接: https://arxiv.org/abs/2504.04799
作者: Maosheng Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICLR 2025 Spotlight, 42 pages

点击查看摘要

Abstract:Given two boundary distributions, the Schrödinger Bridge (SB) problem seeks the most likely random evolution between them with respect to a reference process. It has revealed rich connections to recent machine learning methods for generative modeling and distribution matching. While these methods perform well in Euclidean domains, they are not directly applicable to topological domains such as graphs and simplicial complexes, which are crucial for data defined over network entities, such as node signals and edge flows. In this work, we propose the Topological Schrödinger Bridge problem (TSBP) for matching signal distributions on a topological domain. We set the reference process to follow some linear tractable topology-aware stochastic dynamics such as topological heat diffusion. For the case of Gaussian boundary distributions, we derive a closed-form topological SB (TSB) in terms of its time-marginal and stochastic differential. In the general case, leveraging the well-known result, we show that the optimal process follows the forward-backward topological dynamics governed by some unknowns. Building on these results, we develop TSB-based models for matching topological signals by parameterizing the unknowns in the optimal process as (topological) neural networks and learning them through likelihood training. We validate the theoretical results and demonstrate the practical applications of TSB-based models on both synthetic and real-world networks, emphasizing the role of topology. Additionally, we discuss the connections of TSB-based models to other emerging models, and outline future directions for topological signal matching.

[LG-22] abRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation

链接: https://arxiv.org/abs/2504.04798
作者: Jacob Si,Zijing Ou,Mike Qu,Zhengrui Xiang,Yingzhen Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient.

[LG-23] Playing Non-Embedded Card-Based Games with Reinforcement Learning ATC WWW

链接: https://arxiv.org/abs/2504.04783
作者: Tianyang Wu,Lipeng Wan,Yuhang Wang,Qiang Wan,Xuguang Lan
类目: Machine Learning (cs.LG)
*备注: Match videos: this https URL , All code: this https URL , Detection dataset: this https URL , Expert dataset: this https URL

点击查看摘要

Abstract:Significant progress has been made in AI for games, including board games, MOBA, and RTS games. However, complex agents are typically developed in an embedded manner, directly accessing game state information, unlike human players who rely on noisy visual data, leading to unfair competition. Developing complex non-embedded agents remains challenging, especially in card-based RTS games with complex features and large state spaces. We propose a non-embedded offline reinforcement learning training strategy using visual inputs to achieve real-time autonomous gameplay in the RTS game Clash Royale. Due to the lack of a object detection dataset for this game, we designed an efficient generative object detection dataset for training. We extract features using state-of-the-art object detection and optical character recognition models. Our method enables real-time image acquisition, perception feature fusion, decision-making, and control on mobile devices, successfully defeating built-in AI opponents. All code is open-sourced at this https URL.

[LG-24] Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding

链接: https://arxiv.org/abs/2504.04772
作者: Zahir Alsulaimawi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-time scene comprehension is a key advance in artificial intelligence, enhancing robotics, surveillance, and assistive tools. However, hallucination remains a challenge. AI systems often misinterpret visual inputs, detecting nonexistent objects or describing events that never happened. These errors, far from minor, threaten reliability in critical areas like security and autonomous navigation where accuracy is essential. Our approach tackles this by embedding self-awareness into the AI. Instead of trusting initial outputs, our framework continuously assesses them in real time, adjusting confidence thresholds dynamically. When certainty falls below a solid benchmark, it suppresses unreliable claims. Combining YOLOv5’s object detection strength with VILA1.5-3B’s controlled language generation, we tie descriptions to confirmed visual data. Strengths include dynamic threshold tuning for better accuracy, evidence-based text to reduce hallucination, and real-time performance at 18 frames per second. This feedback-driven design cuts hallucination by 37 percent over traditional methods. Fast, flexible, and reliable, it excels in applications from robotic navigation to security monitoring, aligning AI perception with reality. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.04772 [cs.LG] (or arXiv:2504.04772v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.04772 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-25] MedGNN: Capturing the Links Between Urban Characteristics and Medical Prescriptions KDD2025

链接: https://arxiv.org/abs/2504.04739
作者: Minwei Zhao,Sanja Scepanovic,Stephen Law,Daniele Quercia,Ivica Obadic
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 12 pages’ main content. This is a preprint. Submitted to KDD 2025

点击查看摘要

Abstract:Understanding how urban socio-demographic and environmental factors relate with health is essential for public health and urban planning. However, traditional statistical methods struggle with nonlinear effects, while machine learning models often fail to capture geographical (nearby areas being more similar) and topological (unequal connectivity between places) effects in an interpretable way. To address this, we propose MedGNN, a spatio-topologically explicit framework that constructs a 2-hop spatial graph, integrating positional and locational node embeddings with urban characteristics in a graph neural network. Applied to MEDSAT, a comprehensive dataset covering over 150 environmental and socio-demographic factors and six prescription outcomes (depression, anxiety, diabetes, hypertension, asthma, and opioids) across 4,835 Greater London neighborhoods, MedGNN improved predictions by over 25% on average compared to baseline methods. Using depression prescriptions as a case study, we analyzed graph embeddings via geographical principal component analysis, identifying findings that: align with prior research (e.g., higher antidepressant prescriptions among older and White populations), contribute to ongoing debates (e.g., greenery linked to higher and NO2 to lower prescriptions), and warrant further study (e.g., canopy evaporation correlated with fewer prescriptions). These results demonstrate MedGNN’s potential, and more broadly, of carefully applied machine learning, to advance transdisciplinary public health research.

[LG-26] Large-Scale Mixed-Traffic and Intersection Control using Multi-agent Reinforcement Learning

链接: https://arxiv.org/abs/2504.04691
作者: Songyang Liu,Muyang Fan,Weizi Li,Jing Du,Shuai Li
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Traffic congestion remains a significant challenge in modern urban networks. Autonomous driving technologies have emerged as a potential solution. Among traffic control methods, reinforcement learning has shown superior performance over traffic signals in various scenarios. However, prior research has largely focused on small-scale networks or isolated intersections, leaving large-scale mixed traffic control largely unexplored. This study presents the first attempt to use decentralized multi-agent reinforcement learning for large-scale mixed traffic control in which some intersections are managed by traffic signals and others by robot vehicles. Evaluating a real-world network in Colorado Springs, CO, USA with 14 intersections, we measure traffic efficiency via average waiting time of vehicles at intersections and the number of vehicles reaching their destinations within a time window (i.e., throughput). At 80% RV penetration rate, our method reduces waiting time from 6.17 s to 5.09 s and increases throughput from 454 vehicles per 500 seconds to 493 vehicles per 500 seconds, outperforming the baseline of fully signalized intersections. These findings suggest that integrating reinforcement learning-based control large-scale traffic can improve overall efficiency and may inform future urban planning strategies.

[LG-27] Sparsity-Aware Communication for Distributed Graph Neural Network Training

链接: https://arxiv.org/abs/2504.04673
作者: Ujjaini Mukhodopadhyay,Alok Tripathy,Oguz Selvitopi,Katherine Yelick,Aydin Buluc
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are a computationally efficient method to learn embeddings and classifications on graph data. However, GNN training has low computational intensity, making communication costs the bottleneck for scalability. Sparse-matrix dense-matrix multiplication (SpMM) is the core computational operation in full-graph training of GNNs. Previous work parallelizing this operation focused on sparsity-oblivious algorithms, where matrix elements are communicated regardless of the sparsity pattern. This leads to a predictable communication pattern that can be overlapped with computation and enables the use of collective communication operations at the expense of wasting significant bandwidth by communicating unnecessary data. We develop sparsity-aware algorithms that tackle the communication bottlenecks in GNN training with three novel approaches. First, we communicate only the necessary matrix elements. Second, we utilize a graph partitioning model to reorder the matrix and drastically reduce the amount of communicated elements. Finally, we address the high load imbalance in communication with a tailored partitioning model, which minimizes both the total communication volume and the maximum sending volume. We further couple these sparsity-exploiting approaches with a communication-avoiding approach (1.5D parallel SpMM) in which submatrices are replicated to reduce communication. We explore the tradeoffs of these combined optimizations and show up to 14X improvement on 256 GPUs and on some instances reducing communication to almost zero resulting in a communication-free parallel training relative to a popular GNN framework based on communication-oblivious SpMM.

[LG-28] Scaling Graph Neural Networks for Particle Track Reconstruction

链接: https://arxiv.org/abs/2504.04670
作者: Alok Tripathy,Alina Lazar,Xiangyang Ju,Paolo Calafiura,Katherine Yelick,Aydin Buluc
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Particle track reconstruction is an important problem in high-energy physics (HEP), necessary to study properties of subatomic particles. Traditional track reconstruction algorithms scale poorly with the number of particles within the accelerator. The this http URL project, to alleviate this computational burden, introduces a pipeline that reduces particle track reconstruction to edge classification on a graph, and uses graph neural networks (GNNs) to produce particle tracks. However, this GNN-based approach is memory-prohibitive and skips graphs that would exceed GPU memory. We introduce improvements to the this http URL pipeline to train on samples of input particle graphs, and show that these improvements generalize to higher precision and recall. In addition, we adapt performance optimizations, introduced for GNN training, to fit our augmented this http URL pipeline. These optimizations provide a 2\times speedup over our baseline implementation in PyTorch Geometric.

[LG-29] A Simultaneous Approach for Training Neural Differential-Algebraic Systems of Equations

链接: https://arxiv.org/abs/2504.04665
作者: Laurens R. Lueg,Victor Alves,Daniel Schicksnus,John R. Kitchin,Carl D. Laird,Lorenz T. Biegler
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Scientific machine learning is an emerging field that broadly describes the combination of scientific computing and machine learning to address challenges in science and engineering. Within the context of differential equations, this has produced highly influential methods, such as neural ordinary differential equations (NODEs). Recent works extend this line of research to consider neural differential-algebraic systems of equations (DAEs), where some unknown relationships within the DAE are learned from data. Training neural DAEs, similarly to neural ODEs, is computationally expensive, as it requires the solution of a DAE for every parameter update. Further, the rigorous consideration of algebraic constraints is difficult within common deep learning training algorithms such as stochastic gradient descent. In this work, we apply the simultaneous approach to neural DAE problems, resulting in a fully discretized nonlinear optimization problem, which is solved to local optimality and simultaneously obtains the neural network parameters and the solution to the corresponding DAE. We extend recent work demonstrating the simultaneous approach for neural ODEs, by presenting a general framework to solve neural DAEs, with explicit consideration of hybrid models, where some components of the DAE are known, e.g. physics-informed constraints. Furthermore, we present a general strategy for improving the performance and convergence of the nonlinear programming solver, based on solving an auxiliary problem for initialization and approximating Hessian terms. We achieve promising results in terms of accuracy, model generalizability and computational cost, across different problem settings such as sparse data, unobserved states and multiple trajectories. Lastly, we provide several promising future directions to improve the scalability and robustness of our approach.

[LG-30] ACE-RLHF: Automated Code Evaluation and Socratic Feedback Generation Tool using Large Language Models and Reinforcement Learning with Human Feedback

链接: https://arxiv.org/abs/2504.04657
作者: Tasnia Rahman,Sathish A. P. Kumar,Sumit Jha,Arvind Ramanathan
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Automated Program Repair tools are developed for generating feedback and suggesting a repair method for erroneous code. State of the art (SOTA) code repair methods rely on data-driven approaches and often fail to deliver solution for complicated programming questions. To interpret the natural language of unprecedented programming problems, using Large Language Models (LLMs) for code-feedback generation is crucial. LLMs generate more comprehensible feedback than compiler-generated error messages, and Reinforcement Learning with Human Feedback (RLHF) further enhances quality by integrating human-in-the-loop which helps novice students to lean programming from scratch interactively. We are applying RLHF fine-tuning technique for an expected Socratic response such as a question with hint to solve the programming issue. We are proposing code feedback generation tool by fine-tuning LLM with RLHF, Automated Code Evaluation with RLHF (ACE-RLHF), combining two open-source LLM models with two different SOTA optimization techniques. The quality of feedback is evaluated on two benchmark datasets containing basic and competition-level programming questions where the later is proposed by us. We achieved 2-5% higher accuracy than RL-free SOTA techniques using Llama-3-7B-Proximal-policy optimization in automated evaluation and similar or slightly higher accuracy compared to reward model-free RL with AI Feedback (RLAIF). We achieved almost 40% higher accuracy with GPT-3.5 Best-of-n optimization while performing manual evaluation.

[LG-31] Sub-Clustering for Class Distance Recalculation in Long-Tailed Drug Classification

链接: https://arxiv.org/abs/2504.04647
作者: Yujia Su,Xinjie Li,Lionel Z. Wang
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:In the real world, long-tailed data distributions are prevalent, making it challenging for models to effectively learn and classify tail classes. However, we discover that in the field of drug chemistry, certain tail classes exhibit higher identifiability during training due to their unique molecular structural features, a finding that significantly contrasts with the conventional understanding that tail classes are generally difficult to identify. Existing imbalance learning methods, such as resampling and cost-sensitive reweighting, overly rely on sample quantity priors, causing models to excessively focus on tail classes at the expense of head class performance. To address this issue, we propose a novel method that breaks away from the traditional static evaluation paradigm based on sample size. Instead, we establish a dynamical inter-class separability metric using feature distances between different classes. Specifically, we employ a sub-clustering contrastive learning approach to thoroughly learn the embedding features of each class, and we dynamically compute the distances between class embeddings to capture the relative positional evolution of samples from different classes in the feature space, thereby rebalancing the weights of the classification loss function. We conducted experiments on multiple existing long-tailed drug datasets and achieved competitive results by improving the accuracy of tail classes without compromising the performance of dominant classes.

[LG-32] Exact Unlearning of Finetuning Data via Model Merging at Scale

链接: https://arxiv.org/abs/2504.04626
作者: Kevin Kuo,Amrith Setlur,Kartik Srinivas,Aditi Raghunathan,Virginia Smith
类目: Machine Learning (cs.LG)
*备注: 9 pages, 10 figures

点击查看摘要

Abstract:Approximate unlearning has gained popularity as an approach to efficiently update an LLM so that it behaves (roughly) as if it was not trained on a subset of data to begin with. However, existing methods are brittle in practice and can easily be attacked to reveal supposedly unlearned information. To alleviate issues with approximate unlearning, we instead propose SIFT-Masks (SIgn-Fixed Tuning-Masks), an exact unlearning method based on model merging. SIFT-Masks addresses two key limitations of standard model merging: (1) merging a large number of tasks can severely harm utility; and (2) methods that boost utility by sharing extra information across tasks make exact unlearning prohibitively expensive. SIFT-Masks solves these issues by (1) applying local masks to recover task-specific performance; and (2) constraining finetuning to align with a global sign vector as a lightweight approach to determine masks independently before merging. Across four settings where we merge up to 500 models, SIFT-Masks improves accuracy by 5-80% over naive merging and uses up to 250x less compute for exact unlearning compared to other merging baselines.

[LG-33] SiameseDuo: Active Learning from Data Streams with Dual Augmented Siamese Networks

链接: https://arxiv.org/abs/2504.04613
作者: Kleanthis Malialis,Stylianos Filippou,Christos G. Panayiotou,Marios M. Polycarpou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data stream mining, also known as stream learning, is a growing area which deals with learning from high-speed arriving data. Its relevance has surged recently due to its wide range of applicability, such as, critical infrastructure monitoring, social media analysis, and recommender systems. The design of stream learning methods faces significant research challenges; from the nonstationary nature of the data (referred to as concept drift) and the fact that data streams are typically not annotated with the ground truth, to the requirement that such methods should process large amounts of data in real-time with limited memory. This work proposes the SiameseDuo++ method, which uses active learning to automatically select instances for a human expert to label according to a budget. Specifically, it incrementally trains two siamese neural networks which operate in synergy, augmented by generated examples. Both the proposed active learning strategy and augmentation operate in the latent space. SiameseDuo++ addresses the aforementioned challenges by operating with limited memory and limited labelling budget. Simulation experiments show that the proposed method outperforms strong baselines and state-of-the-art methods in terms of learning speed and/or performance. To promote open science we publicly release our code and datasets.

[LG-34] Modeling of AUV Dynamics with Limited Resources: Efficient Online Learning Using Uncertainty

链接: https://arxiv.org/abs/2504.04583
作者: Michal Tešnar,Bilal Wehbe,Matias Valdenegro-Toro
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 10 Pages, 9 Figures. Oceans Brest 2025 camera ready

点击查看摘要

Abstract:Machine learning proves effective in constructing dynamics models from data, especially for underwater vehicles. Continuous refinement of these models using incoming data streams, however, often requires storage of an overwhelming amount of redundant data. This work investigates the use of uncertainty in the selection of data points to rehearse in online learning when storage capacity is constrained. The models are learned using an ensemble of multilayer perceptrons as they perform well at predicting epistemic uncertainty. We present three novel approaches: the Threshold method, which excludes samples with uncertainty below a specified threshold, the Greedy method, designed to maximize uncertainty among the stored points, and Threshold-Greedy, which combines the previous two approaches. The methods are assessed on data collected by an underwater vehicle Dagon. Comparison with baselines reveals that the Threshold exhibits enhanced stability throughout the learning process and also yields a model with the least cumulative testing loss. We also conducted detailed analyses on the impact of model parameters and storage size on the performance of the models, as well as a comparison of three different uncertainty estimation methods.

[LG-35] Better Rates for Random Task Orderings in Continual Linear Models

链接: https://arxiv.org/abs/2504.04579
作者: Itay Evron,Ran Levinstein,Matan Schliserman,Uri Sherman,Tomer Koren,Daniel Soudry,Nathan Srebro
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze the forgetting, i.e., loss on previously seen tasks, after k iterations. For linear models, we prove that fitting a task is equivalent to a single stochastic gradient descent (SGD) step on a modified objective. We develop novel last-iterate SGD upper bounds in the realizable least squares setup, and apply them to derive new results for continual learning. Focusing on random orderings over T tasks, we establish universal forgetting rates, whereas existing rates depend on the problem dimensionality or complexity. Specifically, in continual regression with replacement, we improve the best existing rate from O((d-r)/k) to O(\min(k^-1/4, \sqrtd-r/k, \sqrtTr/k)) , where d is the dimensionality and r the average task rank. Furthermore, we establish the first rates for random task orderings without replacement. The obtained rate of O(\min(T^-1/4, (d-r)/T)) proves for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task sequences. Finally, we prove a similar O(k^-1/4) universal rate for the forgetting in continual linear classification on separable data. Our universal rates apply for broader projection methods, such as block Kaczmarz and POCS, illuminating their loss convergence under i.i.d and one-pass orderings.

[LG-36] A Classification View on Meta Learning Bandits

链接: https://arxiv.org/abs/2504.04505
作者: Mirco Mutti,Jeongyeol Kwon,Shie Mannor,Aviv Tamar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contextual multi-armed bandits are a popular choice to model sequential decision-making. E.g., in a healthcare application we may perform various tests to asses a patient condition (exploration) and then decide on the best treatment to give (exploitation). When humans design strategies, they aim for the exploration to be fast, since the patient’s health is at stake, and easy to interpret for a physician overseeing the process. However, common bandit algorithms are nothing like that: The regret caused by exploration scales with \sqrtH over H rounds and decision strategies are based on opaque statistical considerations. In this paper, we use an original classification view to meta learn interpretable and fast exploration plans for a fixed collection of bandits \mathbbM . The plan is prescribed by an interpretable decision tree probing decisions’ payoff to classify the test bandit. The test regret of the plan in the stochastic and contextual setting scales with O (\lambda^-2 C_\lambda (\mathbbM) \log^2 (MH)) , being M the size of \mathbbM , \lambda a separation parameter over the bandits, and C_\lambda (\mathbbM) a novel classification-coefficient that fundamentally links meta learning bandits with classification. Through a nearly matching lower bound, we show that C_\lambda (\mathbbM) inherently captures the complexity of the setting.

[LG-37] Deliberate Planning of 3D Bin Packing on Packing Configuration Trees

链接: https://arxiv.org/abs/2504.04421
作者: Hang Zhao,Juzhan Xu,Kexiong Yu,Ruizhen Hu,Chenyang Zhu,Kai Xu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online 3D Bin Packing Problem (3D-BPP) has widespread applications in industrial automation. Existing methods usually solve the problem with limited resolution of spatial discretization, and/or cannot deal with complex practical constraints well. We propose to enhance the practical applicability of online 3D-BPP via learning on a novel hierarchical representation, packing configuration tree (PCT). PCT is a full-fledged description of the state and action space of bin packing which can support packing policy learning based on deep reinforcement learning (DRL). The size of the packing action space is proportional to the number of leaf nodes, making the DRL model easy to train and well-performing even with continuous solution space. We further discover the potential of PCT as tree-based planners in deliberately solving packing problems of industrial significance, including large-scale packing and different variations of BPP setting. A recursive packing method is proposed to decompose large-scale packing into smaller sub-trees while a spatial ensemble mechanism integrates local solutions into global. For different BPP variations with additional decision variables, such as lookahead, buffering, and offline packing, we propose a unified planning framework enabling out-of-the-box problem solving. Extensive evaluations demonstrate that our method outperforms existing online BPP baselines and is versatile in incorporating various practical constraints. The planning process excels across large-scale problems and diverse problem variations. We develop a real-world packing robot for industrial warehousing, with careful designs accounting for constrained placement and transportation stability. Our packing robot operates reliably and efficiently on unprotected pallets at 10 seconds per box. It achieves averagely 19 boxes per pallet with 57.4% space utilization for relatively large-size boxes.

[LG-38] Binned Group Algebra Factorization for Differentially Private Continual Counting

链接: https://arxiv.org/abs/2504.04398
作者: Monika Henzinger,Nikita P. Kalinin,Jalaj Upadhyay
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study memory-efficient matrix factorization for differentially private counting under continual observation. While recent work by Henzinger and Upadhyay 2024 introduced a factorization method with reduced error based on group algebra, its practicality in streaming settings remains limited by computational constraints. We present new structural properties of the group algebra factorization, enabling the use of a binning technique from Andersson and Pagh (2024). By grouping similar values in rows, the binning method reduces memory usage and running time to \tilde O(\sqrtn) , where n is the length of the input stream, while maintaining a low error. Our work bridges the gap between theoretical improvements in factorization accuracy and practical efficiency in large-scale private learning systems.

[LG-39] Human-Level Competitive Pokémon via Scalable Offline Reinforcement Learning with Transformers

链接: https://arxiv.org/abs/2504.04395
作者: Jake Grigsby,Yuqi Xie,Justin Sasek,Steven Zheng,Yuke Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Competitive Pokémon Singles (CPS) is a popular strategy game where players learn to exploit their opponent based on imperfect information in battles that can last more than one hundred stochastic turns. AI research in CPS has been led by heuristic tree search and online self-play, but the game may also create a platform to study adaptive policies trained offline on large datasets. We develop a pipeline to reconstruct the first-person perspective of an agent from logs saved from the third-person perspective of a spectator, thereby unlocking a dataset of real human battles spanning more than a decade that grows larger every day. This dataset enables a black-box approach where we train large sequence models to adapt to their opponent based solely on their input trajectory while selecting moves without explicit search of any kind. We study a progression from imitation learning to offline RL and offline fine-tuning on self-play data in the hardcore competitive setting of Pokémon’s four oldest (and most partially observed) game generations. The resulting agents outperform a recent LLM Agent approach and a strong heuristic search engine. While playing anonymously in online battles against humans, our best agents climb to rankings inside the top 10% of active players.

[LG-40] A Novel Cholesky Kernel based Support Vector Classifier

链接: https://arxiv.org/abs/2504.04371
作者: Satyajeet Sahoo,Jhareswar Maiti
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Support Vector Machine (SVM) is a popular supervised classification model that works by first finding the margin boundaries for the training data classes and then calculating the decision boundary, which is then used to classify the test data. This study demonstrates limitations of traditional support vector classification which uses cartesian coordinate geometry to find the margin and decision boundaries in an input space using only a few support vectors, without considering data variance and correlation. Subsequently, the study proposes a new Cholesky Kernel that adjusts for the effects of variance-covariance structure of the data in the decision boundary equation and margin calculations. The study demonstrates that SVM model is valid only in the Euclidean space, and the Cholesky kernel obtained by decomposing covariance matrix acts as a transformation matrix, which when applied on the original data transforms the data from the input space to the Euclidean space. The effectiveness of the Cholesky kernel based SVM classifier is demonstrated by classifying the Wisconsin Breast Cancer (Diagnostic) Dataset and comparing with traditional SVM approaches. The Cholesky kernel based SVM model shows marked improvement in the precision, recall and F1 scores compared to linear and other kernel SVMs.

[LG-41] Extending Cox Proportional Hazards Model with Symbolic Non-Linear Log-Risk Functions for Survival Analysis

链接: https://arxiv.org/abs/2504.04353
作者: Jiaxiang Cheng,Guoqiang Hu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Cox proportional hazards (CPH) model has been widely applied in survival analysis to estimate relative risks across different subjects given multiple covariates. Traditional CPH models rely on a linear combination of covariates weighted with coefficients as the log-risk function, which imposes a strong and restrictive assumption, limiting generalization. Recent deep learning methods enable non-linear log-risk functions. However, they often lack interpretability due to the end-to-end training mechanisms. The implementation of Kolmogorov-Arnold Networks (KAN) offers new possibilities for extending the CPH model with fully transparent and symbolic non-linear log-risk functions. In this paper, we introduce Generalized Cox Proportional Hazards (GCPH) model, a novel method for survival analysis that leverages KAN to enable a non-linear mapping from covariates to survival outcomes in a fully symbolic manner. GCPH maintains the interpretability of traditional CPH models while allowing for the estimation of non-linear log-risk functions. Experiments conducted on both synthetic data and various public benchmarks demonstrate that GCPH achieves competitive performance in terms of prediction accuracy and exhibits superior interpretability compared to current state-of-the-art methods.

[LG-42] ght Regret Bounds for Fixed-Price Bilateral Trade

链接: https://arxiv.org/abs/2504.04349
作者: Houshuang Chen,Yaonan Jin,Pinyan Lu,Chihao Zhang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We examine fixed-price mechanisms in bilateral trade through the lens of regret minimization. Our main results are twofold. (i) For independent values, a near-optimal \widetilde\Theta(T^2/3) tight bound for \textsfGlobal Budget Balance fixed-price mechanisms with two-bit/one-bit feedback. (ii) For correlated/adversarial values, a near-optimal \Omega(T^3/4) lower bound for \textsfGlobal Budget Balance fixed-price mechanisms with two-bit/one-bit feedback, which improves the best known \Omega(T^5/7) lower bound obtained in the work \citeBCCF24 and, up to polylogarithmic factors, matches the \widetilde\mathcalO(T^3 / 4) upper bound obtained in the same work. Our work in combination with the previous works \citeCCCFL24mor, CCCFL24jmlr, AFF24, BCCF24 (essentially) gives a thorough understanding of regret minimization for fixed-price bilateral trade. En route, we have developed two technical ingredients that might be of independent interest: (i) A novel algorithmic paradigm, called \textitfractal elimination , to address one-bit feedback and independent values. (ii) A new \textitlower-bound construction with novel proof techniques, to address the \textsfGlobal Budget Balance constraint and correlated values. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2504.04349 [cs.GT] (or arXiv:2504.04349v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2504.04349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-43] Economic Battery Storag e Dispatch with Deep Reinforcement Learning from Rule-Based Demonstrations

链接: https://arxiv.org/abs/2504.04326
作者: Manuel Sage,Martin Staniszewski,Yaoyao Fiona Zhao
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The application of deep reinforcement learning algorithms to economic battery dispatch problems has significantly increased recently. However, optimizing battery dispatch over long horizons can be challenging due to delayed rewards. In our experiments we observe poor performance of popular actor-critic algorithms when trained on yearly episodes with hourly resolution. To address this, we propose an approach extending soft actor-critic (SAC) with learning from demonstrations. The special feature of our approach is that, due to the absence of expert demonstrations, the demonstration data is generated through simple, rule-based policies. We conduct a case study on a grid-connected microgrid and use if-then-else statements based on the wholesale price of electricity to collect demonstrations. These are stored in a separate replay buffer and sampled with linearly decaying probability along with the agent’s own experiences. Despite these minimal modifications and the imperfections in the demonstration data, the results show a drastic performance improvement regarding both sample efficiency and final rewards. We further show that the proposed method reliably outperforms the demonstrator and is robust to the choice of rule, as long as the rule is sufficient to guide early training into the right direction.

[LG-44] Causal Inference Isnt Special: Why Its Just Another Prediction Problem

链接: https://arxiv.org/abs/2504.04320
作者: Carlos Fernández-Loría
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal inference is often portrayed as fundamentally distinct from predictive modeling, with its own terminology, goals, and intellectual challenges. But at its core, causal inference is simply a structured instance of prediction under distribution shift. In both cases, we begin with labeled data from a source domain and seek to generalize to a target domain where outcomes are not observed. The key difference is that in causal inference, the labels – potential outcomes – are selectively observed based on treatment assignment, introducing bias that must be addressed through assumptions. This perspective reframes causal estimation as a familiar generalization problem and highlights how techniques from predictive modeling, such as reweighting and domain adaptation, apply directly to causal tasks. It also clarifies that causal assumptions are not uniquely strong – they are simply more explicit. By viewing causal inference through the lens of prediction, we demystify its logic, connect it to familiar tools, and make it more accessible to practitioners and educators alike.

[LG-45] Using ensemble methods of machine learning to predict real estate prices

链接: https://arxiv.org/abs/2504.04303
作者: Oleh Pastukh,Viktor Khomyshyn
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:In recent years, machine learning (ML) techniques have become a powerful tool for improving the accuracy of predictions and decision-making. Machine learning technologies have begun to penetrate all areas, including the real estate sector. Correct forecasting of real estate value plays an important role in the buyer-seller chain, because it ensures reasonableness of price expectations based on the offers available in the market and helps to avoid financial risks for both parties of the transaction. Accurate forecasting is also important for real estate investors to make an informed decision on a specific property. This study helps to gain a deeper understanding of how effective and accurate ensemble machine learning methods are in predicting real estate values. The results obtained in the work are quite accurate, as can be seen from the coefficient of determination (R^2), root mean square error (RMSE) and mean absolute error (MAE) calculated for each model. The Gradient Boosting Regressor model provides the highest accuracy, the Extra Trees Regressor, Hist Gradient Boosting Regressor and Random Forest Regressor models give good results. In general, ensemble machine learning techniques can be effectively used to solve real estate valuation. This work forms ideas for future research, which consist in the preliminary processing of the data set by searching and extracting anomalous values, as well as the practical implementation of the obtained results.

[LG-46] Foundation Models for Environmental Science: A Survey of Emerging Frontiers

链接: https://arxiv.org/abs/2504.04280
作者: Runlong Yu,Shengyu Chen,Yiqun Xie,Huaxiu Yao,Jared Willard,Xiaowei Jia
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Modeling environmental ecosystems is essential for effective resource management, sustainable development, and understanding complex ecological processes. However, traditional data-driven methods face challenges in capturing inherently complex and interconnected processes and are further constrained by limited observational data in many environmental applications. Foundation models, which leverages large-scale pre-training and universal representations of complex and heterogeneous data, offer transformative opportunities for capturing spatiotemporal dynamics and dependencies in environmental processes, and facilitate adaptation to a broad range of applications. This survey presents a comprehensive overview of foundation model applications in environmental science, highlighting advancements in common environmental use cases including forward prediction, data generation, data assimilation, downscaling, inverse modeling, model ensembling, and decision-making across domains. We also detail the process of developing these models, covering data collection, architecture design, training, tuning, and evaluation. Through discussions on these emerging methods as well as their future opportunities, we aim to promote interdisciplinary collaboration that accelerates advancements in machine learning for driving scientific discovery in addressing critical environmental challenges.

[LG-47] Directional Sign Loss: A Topology-Preserving Loss Function that Approximates the Sign of Finite Differences

链接: https://arxiv.org/abs/2504.04202
作者: Harvey Dam,Tripti Agarwal,Ganesh Gopalakrishnan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preserving critical topological features in learned latent spaces is a fundamental challenge in representation learning, particularly for topology-sensitive data. This paper introduces directional sign loss (DSL), a novel loss function that approximates the number of mismatches in the signs of finite differences between corresponding elements of two arrays. By penalizing discrepancies in critical points between input and reconstructed data, DSL encourages autoencoders and other learnable compressors to retain the topological features of the original data. We present the mathematical formulation, complexity analysis, and practical implementation of DSL, comparing its behavior to its non-differentiable counterpart and to other topological measures. Experiments on one-, two-, and three-dimensional data show that combining DSL with traditional loss functions preserves topological features more effectively than traditional losses alone. Moreover, DSL serves as a differentiable, efficient proxy for common topology-based metrics, enabling its use in gradient-based optimization frameworks.

[LG-48] owards Principled Learning for Re-ranking in Recommender Systems

链接: https://arxiv.org/abs/2504.04188
作者: Qunwei Li,Linghui Li,Jianbin Lin,Wenliang Zhong
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the final stage of recommender systems, re-ranking presents ordered item lists to users that best match their interests. It plays such a critical role and has become a trending research topic with much attention from both academia and industry. Recent advances of re-ranking are focused on attentive listwise modeling of interactions and mutual influences among items to be re-ranked. However, principles to guide the learning process of a re-ranker, and to measure the quality of the output of the re-ranker, have been always missing. In this paper, we study such principles to learn a good re-ranker. Two principles are proposed, including convergence consistency and adversarial consistency. These two principles can be applied in the learning of a generic re-ranker and improve its performance. We validate such a finding by various baseline methods over different datasets.

[LG-49] AttackLLM : LLM -based Attack Pattern Generation for an Industrial Control System

链接: https://arxiv.org/abs/2504.04187
作者: Chuadhry Mujeeb Ahmed(Newcastle University UK)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Malicious examples are crucial for evaluating the robustness of machine learning algorithms under attack, particularly in Industrial Control Systems (ICS). However, collecting normal and attack data in ICS environments is challenging due to the scarcity of testbeds and the high cost of human expertise. Existing datasets are often limited by the domain expertise of practitioners, making the process costly and inefficient. The lack of comprehensive attack pattern data poses a significant problem for developing robust anomaly detection methods. In this paper, we propose a novel approach that combines data-centric and design-centric methodologies to generate attack patterns using large language models (LLMs). Our results demonstrate that the attack patterns generated by LLMs not only surpass the quality and quantity of those created by human experts but also offer a scalable solution that does not rely on expensive testbeds or pre-existing attack examples. This multi-agent based approach presents a promising avenue for enhancing the security and resilience of ICS environments.

[LG-50] MInCo: Mitigating Information Conflicts in Distracted Visual Model-based Reinforcement Learning

链接: https://arxiv.org/abs/2504.04164
作者: Shiguang Sun,Hanbo Zhang,Zeyang Liu,Xinrui Yang,Lipeng Wan,Bing Yan,Xingyu Chen,Xuguang Lan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing visual model-based reinforcement learning (MBRL) algorithms with observation reconstruction often suffer from information conflicts, making it difficult to learn compact representations and hence result in less robust policies, especially in the presence of task-irrelevant visual distractions. In this paper, we first reveal that the information conflicts in current visual MBRL algorithms stem from visual representation learning and latent dynamics modeling with an information-theoretic perspective. Based on this finding, we present a new algorithm to resolve information conflicts for visual MBRL, named MInCo, which mitigates information conflicts by leveraging negative-free contrastive learning, aiding in learning invariant representation and robust policies despite noisy observations. To prevent the dominance of visual representation learning, we introduce time-varying reweighting to bias the learning towards dynamics modeling as training proceeds. We evaluate our method on several robotic control tasks with dynamic background distractions. Our experiments demonstrate that MInCo learns invariant representations against background noise and consistently outperforms current state-of-the-art visual MBRL methods. Code is available at this https URL.

[LG-51] OrbitZoo: Multi-Agent Reinforcement Learning Environment for Orbital Dynamics

链接: https://arxiv.org/abs/2504.04160
作者: Alexandre Oliveira,Katarina Dyreby,Francisco Caldas,Cláudia Soares
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The increasing number of satellites and orbital debris has made space congestion a critical issue, threatening satellite safety and sustainability. Challenges such as collision avoidance, station-keeping, and orbital maneuvering require advanced techniques to handle dynamic uncertainties and multi-agent interactions. Reinforcement learning (RL) has shown promise in this domain, enabling adaptive, autonomous policies for space operations; however, many existing RL frameworks rely on custom-built environments developed from scratch, which often use simplified models and require significant time to implement and validate the orbital dynamics, limiting their ability to fully capture real-world complexities. To address this, we introduce OrbitZoo, a versatile multi-agent RL environment built on a high-fidelity industry standard library, that enables realistic data generation, supports scenarios like collision avoidance and cooperative maneuvers, and ensures robust and accurate orbital dynamics. The environment is validated against a real satellite constellation, Starlink, achieving a Mean Absolute Percentage Error (MAPE) of 0.16% compared to real-world data. This validation ensures reliability for generating high-fidelity simulations and enabling autonomous and independent satellite operations.

[LG-52] Vehicle Acceleration Prediction Considering Environmental Influence and Individual Driving Behavior

链接: https://arxiv.org/abs/2504.04159
作者: Wenxuan Wang,Lexing Zhang,Jiale Lei,Yin Feng,Hengxu Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate vehicle acceleration prediction is critical for intelligent driving control and energy efficiency management, particularly in environments with complex driving behavior dynamics. This paper proposes a general short-term vehicle acceleration prediction framework that jointly models environmental influence and individual driving behavior. The framework adopts a dual input design by incorporating environmental sequences, constructed from historical traffic variables such as percentile-based speed and acceleration statistics of multiple vehicles at specific spatial locations, capture group-level driving behavior influenced by the traffic environment. In parallel, individual driving behavior sequences represent motion characteristics of the target vehicle prior to the prediction point, reflecting personalized driving styles. These two inputs are processed using an LSTM Seq2Seq model enhanced with an attention mechanism, enabling accurate multi-step acceleration prediction. To demonstrate the effectiveness of the proposed method, an empirical study was conducted using high resolution radar video fused trajectory data collected from the exit section of the Guangzhou Baishi Tunnel. Drivers were clustered into three categories conservative, moderate, and aggressive based on key behavioral indicators, and a dedicated prediction model was trained for each group to account for driver this http URL results show that the proposed method consistently outperforms four baseline models, yielding a 10.9% improvement in accuracy with the inclusion of historical traffic variables and a 33% improvement with driver classification. Although prediction errors increase with forecast distance, incorporating environment- and behavior-aware features significantly enhances model robustness.

[LG-53] ransformer representation learning is necessary for dynamic multi-modal physiological data on small-cohort patients

链接: https://arxiv.org/abs/2504.04120
作者: Bingxu Wang,Kunzhi Cai,Yuqi Zhang,Yachong Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Postoperative delirium (POD), a severe neuropsychiatric complication affecting nearly 50% of high-risk surgical patients, is defined as an acute disorder of attention and cognition, It remains significantly underdiagnosed in the intensive care units (ICUs) due to subjective monitoring methods. Early and accurate diagnosis of POD is critical and achievable. Here, we propose a POD prediction framework comprising a Transformer representation model followed by traditional machine learning algorithms. Our approaches utilizes multi-modal physiological data, including amplitude-integrated electroencephalography (aEEG), vital signs, electrocardiographic monitor data as well as hemodynamic parameters. We curated the first multi-modal POD dataset encompassing two patient types and evaluated the various Transformer architectures for representation learning. Empirical results indicate a consistent improvements of sensitivity and Youden index in patient TYPE I using Transformer representations, particularly our fusion adaptation of Pathformer. By enabling effective delirium diagnosis from postoperative day 1 to 3, our extensive experimental findings emphasize the potential of multi-modal physiological data and highlight the necessity of representation learning via multi-modal Transformer architecture in clinical diagnosis.

[LG-54] PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models

链接: https://arxiv.org/abs/2504.04104
作者: Haofei Yin,Mengbai Xiao,Rouzhou Lu,Xiao Zhang,Dongxiao Yu,Guanghui Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive large language model inference primarily consists of two stages: pre-filling and decoding. Decoding involves sequential computation for each token, which leads to significant latency. Speculative decoding is a technique that leverages the draft model combined with large model verification to enhance parallelism without sacrificing accuracy. However, existing external prediction methods face challenges in adapting to multi-node serial deployments. While they can maintain speedup under such conditions, the high latency of multi-node deployments ultimately results in low overall efficiency. We propose a speculative decoding framework named PipeDec to address the low global resource utilization of single tasks in pipeline deployments thereby reducing decoding latency. We integrate a draft model into the pipeline of the large model and immediately forward each prediction from the draft model to subsequent pipeline stages. A dynamic prediction tree manages prediction sequences across nodes, enabling efficient updating and pruning. This approach leverages the draft model’s predictions to utilize all pipeline nodes for parallel decoding of a single task. Experiments were conducted using LLama3.2 1B as the draft model in conjunction with a 14-stage parallel pipeline to accelerate LLama3.1 70B by six different types of datasets. During the decoding phase of a single task, PipeDec achieved a 4.46x-7.79x speedup compared to traditional pipeline parallelism and a 2.2x-2.69x speedup compared to baseline tree-based speculative decoding methods. The code will be released after the review process.

[LG-55] Corrected with the Latest Version: Make Robust Asynchronous Federated Learning Possible IJCNN2025

链接: https://arxiv.org/abs/2504.04081
作者: Chaoyi Lu,Yiding Sun,Pengbo Li,Zhichuan Yang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted as a full paper at IJCNN 2025

点击查看摘要

Abstract:As an emerging paradigm of federated learning, asynchronous federated learning offers significant speed advantages over traditional synchronous federated learning. Unlike synchronous federated learning, which requires waiting for all clients to complete updates before aggregation, asynchronous federated learning aggregates the models that have arrived in realtime, greatly improving training speed. However, this mechanism also introduces the issue of client model version inconsistency. When the differences between models of different versions during aggregation become too large, it may lead to conflicts, thereby reducing the models accuracy. To address this issue, this paper proposes an asynchronous federated learning version correction algorithm based on knowledge distillation, named FedADT. FedADT applies knowledge distillation before aggregating gradients, using the latest global model to correct outdated information, thus effectively reducing the negative impact of outdated gradients on the training process. Additionally, FedADT introduces an adaptive weighting function that adjusts the knowledge distillation weight according to different stages of training, helps mitigate the misleading effects caused by the poorer performance of the global model in the early stages of training. This method significantly improves the overall performance of asynchronous federated learning without adding excessive computational overhead. We conducted experimental comparisons with several classical algorithms, and the results demonstrate that FedADT achieves significant improvements over other asynchronous methods and outperforms all methods in terms of convergence speed.

[LG-56] Scalable Robust Bayesian Co-Clustering with Compositional ELBOs

链接: https://arxiv.org/abs/2504.04079
作者: Ashwin Vinod,Chandrajit Bajaj
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Co-clustering exploits the duality of instances and features to simultaneously uncover meaningful groups in both dimensions, often outperforming traditional clustering in high-dimensional or sparse data settings. Although recent deep learning approaches successfully integrate feature learning and cluster assignment, they remain susceptible to noise and can suffer from posterior collapse within standard autoencoders. In this paper, we present the first fully variational Co-clustering framework that directly learns row and column clusters in the latent space, leveraging a doubly reparameterized ELBO to improve gradient signal-to-noise separation. Our unsupervised model integrates a Variational Deep Embedding with a Gaussian Mixture Model (GMM) prior for both instances and features, providing a built-in clustering mechanism that naturally aligns latent modes with row and column clusters. Furthermore, our regularized end-to-end noise learning Compositional ELBO architecture jointly reconstructs the data while regularizing against noise through the KL divergence, thus gracefully handling corrupted or missing inputs in a single training pipeline. To counteract posterior collapse, we introduce a scale modification that increases the encoder’s latent means only in the reconstruction pathway, preserving richer latent representations without inflating the KL term. Finally, a mutual information-based cross-loss ensures coherent co-clustering of rows and columns. Empirical results on diverse real-world datasets from multiple modalities, numerical, textual, and image-based, demonstrate that our method not only preserves the advantages of prior Co-clustering approaches but also exceeds them in accuracy and robustness, particularly in high-dimensional or noisy settings.

[LG-57] Deep-Learning-Directed Preventive Dynamic Security Control via Coordinated Demand Response

链接: https://arxiv.org/abs/2504.04059
作者: Amin Masoumi,Mert Korkali
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: to appear in the 2025 IEEE Power Energy Society General Meeting (PESGM)

点击查看摘要

Abstract:Unlike common faults, three-phase short-circuit faults in power systems pose significant challenges. These faults can lead to out-of-step (OOS) conditions and jeopardize the system’s dynamic security. The rapid dynamics of these faults often exceed the time of protection actions, thus limiting the effectiveness of corrective schemes. This paper proposes an end-to-end deep-learning-based mechanism, namely, a convolutional neural network with an attention mechanism, to predict OOS conditions early and enhance the system’s fault resilience. The results of the study demonstrate the effectiveness of the proposed algorithm in terms of early prediction and robustness against such faults in various operating conditions.

[LG-58] Learning-Based Multi-Criteria Decision Model for Site Selection Problems

链接: https://arxiv.org/abs/2504.04055
作者: Mahid Ahmed,Ali Dogru,Chaoyang Zhang,Chao Meng
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, Proceedings of the IISE Annual Conference Expo 2025

点击查看摘要

Abstract:Strategically locating sawmills is critical for the efficiency, profitability, and sustainability of timber supply chains, yet it involves a series of complex decision-making affected by various factors, such as proximity to resources and markets, proximity to roads and rail lines, distance from the urban area, slope, labor market, and existing sawmill data. Although conventional Multi-Criteria Decision-Making (MCDM) approaches utilize these factors while locating facilities, they are susceptible to bias since they rely heavily on expert opinions to determine the relative factor weights. Machine learning (ML) models provide an objective, data-driven alternative for site selection that derives these weights directly from the patterns in large datasets without requiring subjective weighting. Additionally, ML models autonomously identify critical features, eliminating the need for subjective feature selection. In this study, we propose integrated ML and MCDM methods and showcase the utility of this integrated model to improve sawmill location decisions via a case study in Mississippi. This integrated model is flexible and applicable to site selection problems across various industries.

[LG-59] Disparate Privacy Vulnerability: Targeted Attribute Inference Attacks and Defenses USENIX-SECURITY

链接: https://arxiv.org/abs/2504.04033
作者: Ehsanul Kabir,Lucas Craig,Shagufta Mehnaz
类目: Machine Learning (cs.LG)
*备注: Selected for publication at 34th USENIX Security Symposium

点击查看摘要

Abstract:As machine learning (ML) technologies become more prevalent in privacy-sensitive areas like healthcare and finance, eventually incorporating sensitive information in building data-driven algorithms, it is vital to scrutinize whether these data face any privacy leakage risks. One potential threat arises from an adversary querying trained models using the public, non-sensitive attributes of entities in the training data to infer their private, sensitive attributes, a technique known as the attribute inference attack. This attack is particularly deceptive because, while it may perform poorly in predicting sensitive attributes across the entire dataset, it excels at predicting the sensitive attributes of records from a few vulnerable groups, a phenomenon known as disparate vulnerability. This paper illustrates that an adversary can take advantage of this disparity to carry out a series of new attacks, showcasing a threat level beyond previous imagination. We first develop a novel inference attack called the disparity inference attack, which targets the identification of high-risk groups within the dataset. We then introduce two targeted variations of the attribute inference attack that can identify and exploit a vulnerable subset of the training data, marking the first instances of targeted attacks in this category, achieving significantly higher accuracy than untargeted versions. We are also the first to introduce a novel and effective disparity mitigation technique that simultaneously preserves model performance and prevents any risk of targeted attacks.

[LG-60] A Comprehensive Survey of Challenges and Opportunities of Few-Shot Learning Across Multiple Domains

链接: https://arxiv.org/abs/2504.04017
作者: Andrea Gajic,Sudip Vhaduri
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:In a world where new domains are constantly discovered and machine learning (ML) is applied to automate new tasks every day, challenges arise with the number of samples available to train ML models. While the traditional ML training relies heavily on data volume, finding a large dataset with a lot of usable samples is not always easy, and often the process takes time. For instance, when a new human transmissible disease such as COVID-19 breaks out and there is an immediate surge for rapid diagnosis, followed by rapid isolation of infected individuals from healthy ones to contain the spread, there is an immediate need to create tools/automation using machine learning models. At the early stage of an outbreak, it is not only difficult to obtain a lot of samples, but also difficult to understand the details about the disease, to process the data needed to train a traditional ML model. A solution for this can be a few-shot learning approach. This paper presents challenges and opportunities of few-shot approaches that vary across major domains, i.e., audio, image, text, and their combinations, with their strengths and weaknesses. This detailed understanding can help to adopt appropriate approaches applicable to different domains and applications.

[LG-61] Multi-resolution Score-Based Variational Graphical Diffusion for Causal Disaster System Modeling and Inference

链接: https://arxiv.org/abs/2504.04015
作者: Xuechun Li,Shan Gao,Susu Xu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Complex systems with intricate causal dependencies challenge accurate prediction. Effective modeling requires precise physical process representation, integration of interdependent factors, and incorporation of multi-resolution observational data. These systems manifest in both static scenarios with instantaneous causal chains and temporal scenarios with evolving dynamics, complicating modeling efforts. Current methods struggle to simultaneously handle varying resolutions, capture physical relationships, model causal dependencies, and incorporate temporal dynamics, especially with inconsistently sampled data from diverse sources. We introduce Temporal-SVGDM: Score-based Variational Graphical Diffusion Model for Multi-resolution observations. Our framework constructs individual SDEs for each variable at its native resolution, then couples these SDEs through a causal score mechanism where parent nodes inform child nodes’ evolution. This enables unified modeling of both immediate causal effects in static scenarios and evolving dependencies in temporal scenarios. In temporal models, state representations are processed through a sequence prediction model to predict future states based on historical patterns and causal relationships. Experiments on real-world datasets demonstrate improved prediction accuracy and causal understanding compared to existing methods, with robust performance under varying levels of background knowledge. Our model exhibits graceful degradation across different disaster types, successfully handling both static earthquake scenarios and temporal hurricane and wildfire scenarios, while maintaining superior performance even with limited data.

[LG-62] Practical Poisoning Attacks against Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2504.03957
作者: Baolei Zhang,Yuxi Chen,Minghong Fang,Zhuqing Liu,Lihai Nie,Tong Li,Zheli Liu
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive natural language processing abilities but face challenges such as hallucination and outdated knowledge. Retrieval-Augmented Generation (RAG) has emerged as a state-of-the-art approach to mitigate these issues. While RAG enhances LLM outputs, it remains vulnerable to poisoning attacks. Recent studies show that injecting poisoned text into the knowledge database can compromise RAG systems, but most existing attacks assume that the attacker can insert a sufficient number of poisoned texts per query to outnumber correct-answer texts in retrieval, an assumption that is often unrealistic. To address this limitation, we propose CorruptRAG, a practical poisoning attack against RAG systems in which the attacker injects only a single poisoned text, enhancing both feasibility and stealth. Extensive experiments across multiple datasets demonstrate that CorruptRAG achieves higher attack success rates compared to existing baselines.

[LG-63] A New Approach to Controlling Linear Dynamical Systems

链接: https://arxiv.org/abs/2504.03952
作者: Anand Brahmbhatt,Gon Buzaglo,Sofiia Druchyna,Elad Hazan
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a new method for controlling linear dynamical systems under adversarial disturbances and cost functions. Our algorithm achieves a running time that scales polylogarithmically with the inverse of the stability margin, improving upon prior methods with polynomial dependence maintaining the same regret guarantees. The technique, which may be of independent interest, is based on a novel convex relaxation that approximates linear control policies using spectral filters constructed from the eigenvectors of a specific Hankel matrix.

[LG-64] Random Normed k-Means: A Paradigm-Shift in Clustering within Probabilistic Metric Spaces

链接: https://arxiv.org/abs/2504.03928
作者: Abderrafik Laakel Hemdanou,Youssef Achtoun,Mohammed Lamarti Sefian,Ismail Tahiri,Abdellatif El Afia
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 27 pages, 16 figures

点击查看摘要

Abstract:Existing approaches remain largely constrained by traditional distance metrics, limiting their effectiveness in handling random data. In this work, we introduce the first k-means variant in the literature that operates within a probabilistic metric space, replacing conventional distance measures with a well-defined distance distribution function. This pioneering approach enables more flexible and robust clustering in both deterministic and random datasets, establishing a new foundation for clustering in stochastic environments. By adopting a probabilistic perspective, our method not only introduces a fresh paradigm but also establishes a rigorous theoretical framework that is expected to serve as a key reference for future clustering research involving random data. Extensive experiments on diverse real and synthetic datasets assess our model’s effectiveness using widely recognized evaluation metrics, including Silhouette, Davies-Bouldin, Calinski Harabasz, the adjusted Rand index, and distortion. Comparative analyses against established methods such as k-means++, fuzzy c-means, and kernel probabilistic k-means demonstrate the superior performance of our proposed random normed k-means (RNKM) algorithm. Notably, RNKM exhibits a remarkable ability to identify nonlinearly separable structures, making it highly effective in complex clustering scenarios. These findings position RNKM as a groundbreaking advancement in clustering research, offering a powerful alternative to traditional techniques while addressing a long-standing gap in the literature. By bridging probabilistic metrics with clustering, this study provides a foundational reference for future developments and opens new avenues for advanced data analysis in dynamic, data-driven applications.

[LG-65] An Exploration-free Method for a Linear Stochastic Bandit Driven by a Linear Gaussian Dynamical System

链接: https://arxiv.org/abs/2504.03926
作者: Jonathan Gornet,Yilin Mo,Bruno Sinopoli
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In stochastic multi-armed bandits, a major problem the learner faces is the trade-off between exploration and exploitation. Recently, exploration-free methods – methods that commit to the action predicted to return the highest reward – have been studied from the perspective of linear bandits. In this paper, we introduce a linear bandit setting where the reward is the output of a linear Gaussian dynamical system. Motivated by a problem encountered in hyperparameter optimization for reinforcement learning, where the number of actions is much higher than the number of training iterations, we propose Kalman filter Observability Dependent Exploration (KODE), an exploration-free method that utilizes the Kalman filter predictions to select actions. Our major contribution of this work is our analysis of the performance of the proposed method, which is dependent on the observability properties of the underlying linear Gaussian dynamical system. We evaluate KODE via two different metrics: regret, which is the cumulative expected difference between the highest possible reward and the reward sampled by KODE, and action alignment, which measures how closely KODE’s chosen action aligns with the linear Gaussian dynamical system’s state variable. To provide intuition on the performance, we prove that KODE implicitly encourages the learner to explore actions depending on the observability of the linear Gaussian dynamical system. This method is compared to several well-known stochastic multi-armed bandit algorithms to validate our theoretical results.

[LG-66] Opening the Black-Box: Symbolic Regression with Kolmogorov-Arnold Networks for Energy Applications

链接: https://arxiv.org/abs/2504.03913
作者: Nataly R. Panczyk,Omer F. Erdem,Majdi I. Radaideh
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC); Machine Learning (stat.ML)
*备注: 35 pages, 11 Figures, 14 Tables

点击查看摘要

Abstract:While most modern machine learning methods offer speed and accuracy, few promise interpretability or explainability – two key features necessary for highly sensitive industries, like medicine, finance, and engineering. Using eight datasets representative of one especially sensitive industry, nuclear power, this work compares a traditional feedforward neural network (FNN) to a Kolmogorov-Arnold Network (KAN). We consider not only model performance and accuracy, but also interpretability through model architecture and explainability through a post-hoc SHAP analysis. In terms of accuracy, we find KANs and FNNs comparable across all datasets, when output dimensionality is limited. KANs, which transform into symbolic equations after training, yield perfectly interpretable models while FNNs remain black-boxes. Finally, using the post-hoc explainability results from Kernel SHAP, we find that KANs learn real, physical relations from experimental data, while FNNs simply produce statistically accurate results. Overall, this analysis finds KANs a promising alternative to traditional machine learning methods, particularly in applications requiring both accuracy and comprehensibility.

[LG-67] Stochastic Variational Inference with Tuneable Stochastic Annealing

链接: https://arxiv.org/abs/2504.03902
作者: John Paisley,Ghazal Fazelnia,Brian Barr
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we exploit the observation that stochastic variational inference (SVI) is a form of annealing and present a modified SVI approach – applicable to both large and small datasets – that allows the amount of annealing done by SVI to be tuned. We are motivated by the fact that, in SVI, the larger the batch size the more approximately Gaussian is the intrinsic noise, but the smaller its variance. This low variance reduces the amount of annealing which is needed to escape bad local optimal solutions. We propose a simple method for achieving both goals of having larger variance noise to escape bad local optimal solutions and more data information to obtain more accurate gradient directions. The idea is to set an actual batch size, which may be the size of the data set, and a smaller effective batch size that matches the larger level of variance at this smaller batch size. The result is an approximation to the maximum entropy stochastic gradient at this variance level. We theoretically motivate our approach for the framework of conjugate exponential family models and illustrate the method empirically on the probabilistic matrix factorization collaborative filter, the Latent Dirichlet Allocation topic model, and the Gaussian mixture model.

[LG-68] Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLM s

链接: https://arxiv.org/abs/2504.03889
作者: Pedro Sandoval-Segura,Xijun Wang,Ashwinee Panda,Micah Goldblum,Ronen Basri,Tom Goldstein,David Jacobs
类目: Machine Learning (cs.LG)
*备注: 22 pages, 14 figures

点击查看摘要

Abstract:Multi-head attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives most attention despite limited semantic importance, challenge our understanding of multi-head attention. To analyze this phenomenon, we propose a new definition for attention heads dominated by attention sinks, known as dormant attention heads. We compare our definition to prior work in a model intervention study where we test whether dormant heads matter for inference by zeroing out the output of dormant attention heads. Using six pretrained models and five benchmark datasets, we find our definition to be more model and dataset-agnostic. Using our definition on most models, more than 4% of a model’s attention heads can be zeroed while maintaining average accuracy, and zeroing more than 14% of a model’s attention heads can keep accuracy to within 1% of the pretrained model’s average accuracy. Further analysis reveals that dormant heads emerge early in pretraining and can transition between dormant and active states during pretraining. Additionally, we provide evidence that they depend on characteristics of the input text.

[LG-69] Concept-based Rubrics Improve LLM Formative Assessment and Data Synthesis

链接: https://arxiv.org/abs/2504.03877
作者: Yuchen Wei,Dennis Pearl,Matthew Beckman,Rebecca J. Passonneau
类目: Machine Learning (cs.LG)
*备注: 13 pages excluding references. 9 tables and 4 figures

点击查看摘要

Abstract:Formative assessment in STEM topics aims to promote student learning by identifying students’ current understanding, thus targeting how to promote further learning. Previous studies suggest that the assessment performance of current generative large language models (LLMs) on constructed responses to open-ended questions is significantly lower than that of supervised classifiers trained on high-quality labeled data. However, we demonstrate that concept-based rubrics can significantly enhance LLM performance, which narrows the gap between LLMs as off-the shelf assessment tools, and smaller supervised models, which need large amounts of training data. For datasets where concept-based rubrics allow LLMs to achieve strong performance, we show that the concept-based rubrics help the same LLMs generate high quality synthetic data for training lightweight, high-performance supervised models. Our experiments span diverse STEM student response datasets with labels of varying quality, including a new real-world dataset that contains some AI-assisted responses, which introduces additional considerations.

[LG-70] HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs

链接: https://arxiv.org/abs/2504.03871
作者: Yongji Wu,Xueshen Liu,Shuowei Jin,Ceyu Xu,Feng Qian,Z. Morley Mao,Matthew Lentz,Danyang Zhuo,Ion Stoica
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Mixture-of-Experts (MoE) architecture has become increasingly popular as a method to scale up large language models (LLMs). To save costs, heterogeneity-aware training solutions have been proposed to utilize GPU clusters made up of both newer and older-generation GPUs. However, existing solutions are agnostic to the performance characteristics of different MoE model components (i.e., attention and expert) and do not fully utilize each GPU’s compute capability. In this paper, we introduce HeterMoE, a system to efficiently train MoE models on heterogeneous GPUs. Our key insight is that newer GPUs significantly outperform older generations on attention due to architectural advancements, while older GPUs are still relatively efficient for experts. HeterMoE disaggregates attention and expert computation, where older GPUs are only assigned with expert modules. Through the proposed zebra parallelism, HeterMoE overlaps the computation on different GPUs, in addition to employing an asymmetric expert assignment strategy for fine-grained load balancing to minimize GPU idle time. Our evaluation shows that HeterMoE achieves up to 2.3x speed-up compared to existing MoE training systems, and 1.4x compared to an optimally balanced heterogeneity-aware solution. HeterMoE efficiently utilizes older GPUs by maintaining 95% training throughput on average, even with half of the GPUs in a homogeneous A40 cluster replaced with V100. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2504.03871 [cs.DC] (or arXiv:2504.03871v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2504.03871 Focus to learn more arXiv-issued DOI via DataCite

[LG-71] Offline and Distributional Reinforcement Learning for Wireless Communications

链接: https://arxiv.org/abs/2504.03804
作者: Eslam Eldeeb,Hirley Alves
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The rapid growth of heterogeneous and massive wireless connectivity in 6G networks demands intelligent solutions to ensure scalability, reliability, privacy, ultra-low latency, and effective control. Although artificial intelligence (AI) and machine learning (ML) have demonstrated their potential in this domain, traditional online reinforcement learning (RL) and deep RL methods face limitations in real-time wireless networks. For instance, these methods rely on online interaction with the environment, which might be unfeasible, costly, or unsafe. In addition, they cannot handle the inherent uncertainties in real-time wireless applications. We focus on offline and distributional RL, two advanced RL techniques that can overcome these challenges by training on static datasets and accounting for network uncertainties. We introduce a novel framework that combines offline and distributional RL for wireless communication applications. Through case studies on unmanned aerial vehicle (UAV) trajectory optimization and radio resource management (RRM), we demonstrate that our proposed Conservative Quantile Regression (CQR) algorithm outperforms conventional RL approaches regarding convergence speed and risk management. Finally, we discuss open challenges and potential future directions for applying these techniques in 6G networks, paving the way for safer and more efficient real-time wireless systems.

[LG-72] CSF: Fixed-outline Floorplanning Based on the Conjugate Subgradient Algorithm Assisted by Q-Learning

链接: https://arxiv.org/abs/2504.03796
作者: Huabin Cheng,Rujie Chen,Yu Chen,Wei Zhang,Ning Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To perform the fixed-outline floorplanning problem efficiently, we propose to solve the original nonsmooth analytic optimization model via the conjugate subgradient algorithm (CSA), which is further accelerated by adaptively regulating the step size with the assistance of Q-learning. The objective for global floorplanning is a weighted sum of the half-perimeter wirelength, the overlapping area and the out-of-bound width, and the legalization is implemented by optimizing the weighted sum of the overlapping area and the out-of-bound width. Meanwhile, we also propose two improved variants for the legalizaiton algorithm based on constraint graphs (CGs). Experimental results demonstrate that the CSA assisted by Q-learning (CSAQ) can address both global floorplanning and legalization efficiently, and the two stages jointly contribute to competitive results on the optimization of wirelength. Meanwhile, the improved CG-based legalization methods also outperforms the original one in terms of runtime and success rate.

[LG-73] MMCE: A Framework for Deep Monotonic Modeling of Multiple Causal Effects

链接: https://arxiv.org/abs/2504.03753
作者: Juhua Chen,Karson shi,Jialing He,North Chen,Kele Jiang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:When we plan to use money as an incentive to change the behavior of a person (such as making riders to deliver more orders or making consumers to buy more items), the common approach of this problem is to adopt a two-stage framework in order to maximize ROI under cost constraints. In the first stage, the individual price response curve is obtained. In the second stage, business goals and resource constraints are formally expressed and modeled as an optimization problem. The first stage is very critical. It can answer a very important question. This question is how much incremental results can incentives bring, which is the basis of the second stage. Usually, the causal modeling is used to obtain the curve. In the case of only observational data, causal modeling and evaluation are very challenging. In some business scenarios, multiple causal effects need to be obtained at the same time. This paper proposes a new observational data modeling and evaluation framework, which can simultaneously model multiple causal effects and greatly improve the modeling accuracy under some abnormal distributions. In the absence of RCT data, evaluation seems impossible. This paper summarizes three priors to illustrate the necessity and feasibility of qualitative evaluation of cognitive testing. At the same time, this paper innovatively proposes the conditions under which observational data can be considered as an evaluation dataset. Our approach is very groundbreaking. It is the first to propose a modeling framework that simultaneously obtains multiple causal effects. The offline analysis and online experimental results show the effectiveness of the results and significantly improve the effectiveness of the allocation strategies generated in real world marketing activities.

[LG-74] Revolutionizing Fractional Calculus with Neural Networks: Voronovskaya-Damasclin Theory for Next-Generation AI Systems

链接: https://arxiv.org/abs/2504.03751
作者: Rômulo Damasclin Chaves dos Santos,Jorge Henrique de Oliveira Sales
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:This work introduces rigorous convergence rates for neural network operators activated by symmetrized and perturbed hyperbolic tangent functions, utilizing novel Voronovskaya-Damasclin asymptotic expansions. We analyze basic, Kantorovich, and quadrature-type operators over infinite domains, extending classical approximation theory to fractional calculus via Caputo derivatives. Key innovations include parameterized activation functions with asymmetry control, symmetrized density operators, and fractional Taylor expansions for error analysis. The main theorem demonstrates that Kantorovich operators achieve (o(n^-\beta(N-\varepsilon))) convergence rates, while basic operators exhibit (\mathcalO(n^-\beta N)) error decay. For deep networks, we prove (\mathcalO(L^-\beta(N-\varepsilon))) approximation bounds. Stability results under parameter perturbations highlight operator robustness. By integrating neural approximation theory with fractional calculus, this work provides foundational mathematical insights and deployable engineering solutions, with potential applications in complex system modeling and signal processing.

[LG-75] Detecting Financial Fraud with Hybrid Deep Learning: A Mix-of-Experts Approach to Sequential and Anomalous Patterns

链接: https://arxiv.org/abs/2504.03750
作者: Diego Vallarino
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial fraud detection remains a critical challenge due to the dynamic and adversarial nature of fraudulent behavior. As fraudsters evolve their tactics, detection systems must combine robustness, adaptability, and precision. This study presents a hybrid architecture for credit card fraud detection that integrates a Mixture of Experts (MoE) framework with Recurrent Neural Networks (RNNs), Transformer encoders, and Autoencoders. Each expert module contributes a specialized capability: RNNs capture sequential behavior, Transformers extract high-order feature interactions, and Autoencoders detect anomalies through reconstruction loss. The MoE framework dynamically assigns predictive responsibility among the experts, enabling adaptive and context-sensitive decision-making. Trained on a high-fidelity synthetic dataset that simulates real-world transaction patterns and fraud typologies, the hybrid model achieved 98.7 percent accuracy, 94.3 percent precision, and 91.5 percent recall, outperforming standalone models and classical machine learning baselines. The Autoencoder component significantly enhanced the system’s ability to identify emerging fraud strategies and atypical behaviors. Beyond technical performance, the model contributes to broader efforts in financial governance and crime prevention. It supports regulatory compliance with Anti-Money Laundering (AML) and Know Your Customer (KYC) protocols and aligns with routine activity theory by operationalizing AI as a capable guardian within financial ecosystems. The proposed hybrid system offers a scalable, modular, and regulation-aware approach to detecting increasingly sophisticated fraud patterns, contributing both to the advancement of intelligent systems and to the strengthening of institutional fraud defense infrastructures. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2504.03750 [cs.CR] (or arXiv:2504.03750v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.03750 Focus to learn more arXiv-issued DOI via DataCite

[LG-76] Input Resolution Downsizing as a Compression Technique for Vision Deep Learning Systems

链接: https://arxiv.org/abs/2504.03749
作者: Jeremy Morlier,Mathieu Leonardon,Vincent Gripon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model compression is a critical area of research in deep learning, in particular in vision, driven by the need to lighten models memory or computational footprints. While numerous methods for model compression have been proposed, most focus on pruning, quantization, or knowledge distillation. In this work, we delve into an under-explored avenue: reducing the resolution of the input image as a complementary approach to other types of compression. By systematically investigating the impact of input resolution reduction, on both tasks of classification and semantic segmentation, and on convnets and transformer-based architectures, we demonstrate that this strategy provides an interesting alternative for model compression. Our experimental results on standard benchmarks highlight the potential of this method, achieving competitive performance while significantly reducing computational and memory requirements. This study establishes input resolution reduction as a viable and promising direction in the broader landscape of model compression techniques for vision applications.

[LG-77] meseries Foundation Models for Mobility: A Benchmark Comparison with Traditional and Deep Learning Models

链接: https://arxiv.org/abs/2504.03725
作者: Anita Graser
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Crowd and flow predictions have been extensively studied in mobility data science. Traditional forecasting methods have relied on statistical models such as ARIMA, later supplemented by deep learning approaches like ST-ResNet. More recently, foundation models for time series forecasting, such as TimeGPT, Chronos, and LagLlama, have emerged. A key advantage of these models is their ability to generate zero-shot predictions, allowing them to be applied directly to new tasks without retraining. This study evaluates the performance of TimeGPT compared to traditional approaches for predicting city-wide mobility timeseries using two bike-sharing datasets from New York City and Vienna, Austria. Model performance is assessed across short (1-hour), medium (12-hour), and long-term (24-hour) forecasting horizons. The results highlight the potential of foundation models for mobility forecasting while also identifying limitations of our experiments.

[LG-78] A Survey of Circuit Foundation Model: Foundation AI Models for VLSI Circuit Design and EDA

链接: https://arxiv.org/abs/2504.03711
作者: Wenji Fang,Jing Wang,Yao Lu,Shang Liu,Yuchao Wu,Yuzhe Ma,Zhiyao Xie
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI)-driven electronic design automation (EDA) techniques have been extensively explored for VLSI circuit design applications. Most recently, foundation AI models for circuits have emerged as a new technology trend. Unlike traditional task-specific AI solutions, these new AI models are developed through two stages: 1) self-supervised pre-training on a large amount of unlabeled data to learn intrinsic circuit properties; and 2) efficient fine-tuning for specific downstream applications, such as early-stage design quality evaluation, circuit-related context generation, and functional verification. This new paradigm brings many advantages: model generalization, less reliance on labeled circuit data, efficient adaptation to new tasks, and unprecedented generative capability. In this paper, we propose referring to AI models developed with this new paradigm as circuit foundation models (CFMs). This paper provides a comprehensive survey of the latest progress in circuit foundation models, unprecedentedly covering over 130 relevant works. Over 90% of our introduced works were published in or after 2022, indicating that this emerging research trend has attracted wide attention in a short period. In this survey, we propose to categorize all existing circuit foundation models into two primary types: 1) encoder-based methods performing general circuit representation learning for predictive tasks; and 2) decoder-based methods leveraging large language models (LLMs) for generative tasks. For our introduced works, we cover their input modalities, model architecture, pre-training strategies, domain adaptation techniques, and downstream design applications. In addition, this paper discussed the unique properties of circuits from the data perspective. These circuit properties have motivated many works in this domain and differentiated them from general AI techniques.

[LG-79] Geometric Flow Models over Neural Network Weights

链接: https://arxiv.org/abs/2504.03710
作者: Ege Erdogan
类目: Machine Learning (cs.LG)
*备注: MSc Thesis

点击查看摘要

Abstract:Deep generative models such as flow and diffusion models have proven to be effective in modeling high-dimensional and complex data types such as videos or proteins, and this has motivated their use in different data modalities, such as neural network weights. A generative model of neural network weights would be useful for a diverse set of applications, such as Bayesian deep learning, learned optimization, and transfer learning. However, the existing work on weight-space generative models often ignores the symmetries of neural network weights, or only takes into account a subset of them. Modeling those symmetries, such as permutation symmetries between subsequent layers in an MLP, the filters in a convolutional network, or scaling symmetries arising with the use of non-linear activations, holds the potential to make weight-space generative modeling more efficient by effectively reducing the dimensionality of the problem. In this light, we aim to design generative models in weight-space that more comprehensively respect the symmetries of neural network weights. We build on recent work on generative modeling with flow matching, and weight-space graph neural networks to design three different weight-space flows. Each of our flows takes a different approach to modeling the geometry of neural network weights, and thus allows us to explore the design space of weight-space flows in a principled way. Our results confirm that modeling the geometry of neural networks more faithfully leads to more effective flow models that can generalize to different tasks and architectures, and we show that while our flows obtain competitive performance with orders of magnitude fewer parameters than previous work, they can be further improved by scaling them up. We conclude by listing potential directions for future work on weight-space generative models. Comments: MSc Thesis Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.03710 [cs.LG] (or arXiv:2504.03710v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.03710 Focus to learn more arXiv-issued DOI via DataCite

[LG-80] A Theoretical Framework for Graph-based Digital Twins for Supply Chain Management and Optimization

链接: https://arxiv.org/abs/2504.03692
作者: Azmine Toushik Wasi,Mahfuz Ahmed Anik,Abdur Rahman,Md. Iqramul Hoque,MD Shafikul Islam,Md Manjurul Ahsan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supply chain management is growing increasingly complex due to globalization, evolving market demands, and sustainability pressures, yet traditional systems struggle with fragmented data and limited analytical capabilities. Graph-based modeling offers a powerful way to capture the intricate relationships within supply chains, while Digital Twins (DTs) enable real-time monitoring and dynamic simulations. However, current implementations often face challenges related to scalability, data integration, and the lack of sustainability-focused metrics. To address these gaps, we propose a Graph-Based Digital Twin Framework for Supply Chain Optimization, which combines graph modeling with DT architecture to create a dynamic, real-time representation of supply networks. Our framework integrates a Data Integration Layer to harmonize disparate sources, a Graph Construction Module to model complex dependencies, and a Simulation and Analysis Engine for scalable optimization. Importantly, we embed sustainability metrics - such as carbon footprints and resource utilization - into operational dashboards to drive eco-efficiency. By leveraging the synergy between graph-based modeling and DTs, our approach enhances scalability, improves decision-making, and enables organizations to proactively manage disruptions, cut costs, and transition toward greener, more resilient supply chains.

[LG-81] Predictive Maintenance of Electric Motors Using Supervised Learning Models: A Comparative Analysis

链接: https://arxiv.org/abs/2504.03670
作者: Amir Hossein Baradaran
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive maintenance is a key strategy for ensuring the reliability and efficiency of industrial systems. This study investigates the use of supervised learning models to diagnose the condition of electric motors, categorizing them as “Healthy,” “Needs Preventive Maintenance (PM),” or “Broken.” Key features of motor operation were employed to train various machine learning algorithms, including Naive Bayes, Support Vector Machines (SVM), Regression models, Random Forest, k-Nearest Neighbors (k-NN), and Gradient Boosting techniques. The performance of these models was evaluated to identify the most effective classifier for predicting motor health. Results showed notable differences in accuracy among the models, with one emerging as the best-performing solution. This study underscores the practicality of using supervised learning for electric motor diagnostics, providing a foundation for efficient maintenance scheduling and minimizing unplanned downtimes in industrial applications.

[LG-82] Adaptive Orchestration for Inference of Large Foundation Models at the Edge

链接: https://arxiv.org/abs/2504.03668
作者: Fernando Koch,Aladin Djuhera,Alecio Binotto
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Large Foundation Models (LFMs), including multi-modal and generative AI models, promise to unlock new capabilities for next-generation Edge AI applications. However, performing inference with LFMs in resource-constrained and heterogeneous edge environments presents significant challenges for workload orchestration. We propose a novel adaptive orchestration method and system tailored specifically for managing distributed inference workloads across multi-access edge computing (MEC) infrastructures. Our approach enhances traditional workload orchestration by introducing dynamic methods including: (1) adaptive workload distribution that selects optimal, inter-connected edge nodes based on runtime capacity profiling; (2) dynamic redistribution of LFM partitions as operational conditions evolve, and; (3) real-time reconfiguration (e.g., re-splitting) of LFM layers to balance performance and privacy requirements. Our proposed framework introduces an architecture for adaptive split inference, enabling real-time, QoS-aware management of inference workloads. We present a reference architecture, detail operational mechanisms, and demonstrate its application through various use cases in real-world scenarios.

[LG-83] Memory and Bandwidth are All You Need for Fully Sharded Data Parallel

链接: https://arxiv.org/abs/2504.03655
作者: Jiangtao Wang,Jan Ebert,Oleg Filatov,Stefan Kesselheim
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer models have revolutionized a wide spectrum of disciplines, especially in language processing. The recent success has proven that model size scalability is crucial for achieving superior performance metrics. However, training large transformer models is challenging even on modern hardware with powerful GPUs and high-speed interconnects. Existing studies primarily focus on optimizing model training distribution strategies to minimize memory footprint and enhance training speed, often overlooking the scalability challenges related to model size and hardware constraints. To address this oversight, we thoroughly investigate computational, memory, and network demands of training large transformers using the Fully Sharded Data Parallel (FSDP) distributed strategy across different hardware clusters. We explore the intricate relationships between model size and hardware setups to identify configurations that ensure maximum model and hardware efficiency, effective sequence length management, and optimal training throughput. A significant finding of our study is the critical interplay of the cluster’s connection bandwidth and GPU memory size compared to the computational performance of GPUs. This interplay limits training efficiency, underscoring the role of both hardware characteristics as a possible bottleneck. By integrating theoretical analysis with simulations and empirical tests, we demonstrate how hardware limitations affect training efficacy, identifying key hardware thresholds and the impact of network connectivity. Our findings prompt a reassessment of training strategies guiding users on the way to finding hardware-optimal FSDP configurations, enhancing training efficiency for large-scale transformer models.

[LG-84] A Modern Approach to Real-Time Air Traffic Management System

链接: https://arxiv.org/abs/2504.03652
作者: Priyank Vaidya,Vedansh Kamdar
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 5 Pages along with 7 Figures

点击查看摘要

Abstract:Air traffic analytics systems are pivotal for ensuring safety, efficiency, and predictability in air travel. However, traditional systems struggle to handle the increasing volume and complexity of air traffic data. This project explores the application of real-time big data processing frameworks like Apache Spark, HDFS, and Spark Streaming for developing a new robust system. By reviewing existing research on real-time systems and analyzing the challenges and opportunities presented by big data technologies, we propose an architecture for a real-time system. Our project pipeline involves real-time data collection from flight information sources through flight API’s, ingestion into Kafka, and transmission to Elasticsearch for visualization using Kibana. Additionally, we present a dashboard of U.S. airlines on PowerBI, demonstrating the potential of real-time analytics in revolutionizing air traffic management.

[LG-85] Machine Learning Models for Reinforced Concrete Pipes Condition Prediction: The State-of-the-Art Using Artificial Neural Networks and Multiple Linear Regression in a Wisconsin Case Study

链接: https://arxiv.org/abs/2502.00363
作者: Mohsen Mohammadagha,Mohammad Najafi,Vinayak Kaushal,Ahmad Mahmoud Ahmad Jibreen
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:The aging sewer infrastructure in the U.S., covering 2.1 million kilometers, encounters increasing structural issues, resulting in around 75,000 yearly sanitary sewer overflows that present serious economic, environmental, and public health hazards. Conventional inspection techniques and deterministic models do not account for the unpredictable nature of sewer decline, whereas probabilistic methods depend on extensive historical data, which is frequently lacking or incomplete. This research intends to enhance predictive accuracy for the condition of sewer pipelines through machine learning models artificial neural networks (ANNs) and multiple linear regression (MLR) by integrating factors such as pipe age, material, diameter, environmental influences, and PACP ratings. ANNs utilized ReLU activation functions and Adam optimization, whereas MLR applied regularization to address multicollinearity, with both models assessed through metrics like RMSE, MAE, and R2. The findings indicated that ANNs surpassed MLR, attaining an R2 of 0.9066 compared to MLRs 0.8474, successfully modeling nonlinear relationships while preserving generalization. MLR, on the other hand, offered enhanced interpretability by pinpointing significant predictors such as residual buildup. As a result, pipeline degradation is driven by pipe length, age, and pipe diameter as key predictors, while depth, soil type, and segment show minimal influence in this analysis. Future studies ought to prioritize hybrid models that merge the accuracy of ANNs with the interpretability of MLR, incorporating advanced methods such as SHAP analysis and transfer learning to improve scalability in managing infrastructure and promoting environmental sustainability.

[LG-86] Aggregating time-series and image data: functors and double functors

链接: https://arxiv.org/abs/2504.05274
作者: Joscha Diehl
类目: Category Theory (math.CT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aggregation of time-series or image data over subsets of the domain is a fundamental task in data science. We show that many known aggregation operations can be interpreted as (double) functors on appropriate (double) categories. Such functorial aggregations are amenable to parallel implementation via straightforward extensions of Blelloch’s parallel scan algorithm. In addition to providing a unified viewpoint on existing operations, it allows us to propose new aggregation operations for time-series and image data.

[LG-87] IAEmu: Learning Galaxy Intrinsic Alignment Correlations

链接: https://arxiv.org/abs/2504.05235
作者: Sneh Pandya,Yuanyuan Yang,Nicholas Van Alfen,Jonathan Blazek,Robin Walters
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 16 pages, 10 figures, 1 table

点击查看摘要

Abstract:The intrinsic alignments (IA) of galaxies, a key contaminant in weak lensing analyses, arise from correlations in galaxy shapes driven by tidal interactions and galaxy formation processes. Accurate IA modeling is essential for robust cosmological inference, but current approaches rely on perturbative methods that break down on nonlinear scales or on expensive simulations. We introduce IAEmu, a neural network-based emulator that predicts the galaxy position-position ( \xi ), position-orientation ( \omega ), and orientation-orientation ( \eta ) correlation functions and their uncertainties using mock catalogs based on the halo occupation distribution (HOD) framework. Compared to simulations, IAEmu achieves ~3% average error for \xi and ~5% for \omega , while capturing the stochasticity of \eta without overfitting. The emulator provides both aleatoric and epistemic uncertainties, helping identify regions where predictions may be less reliable. We also demonstrate generalization to non-HOD alignment signals by fitting to IllustrisTNG hydrodynamical simulation data. As a fully differentiable neural network, IAEmu enables \sim 10,000 \times speed-ups in mapping HOD parameters to correlation functions on GPUs, compared to CPU-based simulations. This acceleration facilitates inverse modeling via gradient-based sampling, making IAEmu a powerful surrogate model for galaxy bias and IA studies with direct applications to Stage IV weak lensing surveys.

[LG-88] Hybrid machine learning data assimilation for marine biogeochemistry

链接: https://arxiv.org/abs/2504.05218
作者: Ieuan Higgs,Ross Bannister,Jozef Skákala,Alberto Carrassi,Stefano Ciavatta
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 31 pages, 13 figures (10 in main text, 3 in appendix)

点击查看摘要

Abstract:Marine biogeochemistry models are critical for forecasting, as well as estimating ecosystem responses to climate change and human activities. Data assimilation (DA) improves these models by aligning them with real-world observations, but marine biogeochemistry DA faces challenges due to model complexity, strong nonlinearity, and sparse, uncertain observations. Existing DA methods applied to marine biogeochemistry struggle to update unobserved variables effectively, while ensemble-based methods are computationally too expensive for high-complexity marine biogeochemistry models. This study demonstrates how machine learning (ML) can improve marine biogeochemistry DA by learning statistical relationships between observed and unobserved variables. We integrate ML-driven balancing schemes into a 1D prototype of a system used to forecast marine biogeochemistry in the North-West European Shelf seas. ML is applied to predict (i) state-dependent correlations from free-run ensembles and (ii), in an ``end-to-end’’ fashion, analysis increments from an Ensemble Kalman Filter. Our results show that ML significantly enhances updates for previously not-updated variables when compared to univariate schemes akin to those used operationally. Furthermore, ML models exhibit moderate transferability to new locations, a crucial step toward scaling these methods to 3D operational systems. We conclude that ML offers a clear pathway to overcome current computational bottlenecks in marine biogeochemistry DA and that refining transferability, optimizing training data sampling, and evaluating scalability for large-scale marine forecasting, should be future research priorities.

[LG-89] Machine learning interatomic potential can infer electrical response

链接: https://arxiv.org/abs/2504.05169
作者: Peichen Zhong,Dongjin Kim,Daniel S. King,Bingqing Cheng
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Modeling the response of material and chemical systems to electric fields remains a longstanding challenge. Machine learning interatomic potentials (MLIPs) offer an efficient and scalable alternative to quantum mechanical methods but do not by themselves incorporate electrical response. Here, we show that polarization and Born effective charge (BEC) tensors can be directly extracted from long-range MLIPs within the Latent Ewald Summation (LES) framework, solely by learning from energy and force data. Using this approach, we predict the infrared spectra of bulk water under zero or finite external electric fields, ionic conductivities of high-pressure superionic ice, and the phase transition and hysteresis in ferroelectric PbTiO _3 perovskite. This work thus extends the capability of MLIPs to predict electrical response–without training on charges or polarization or BECs–and enables accurate modeling of electric-field-driven processes in diverse systems at scale.

[LG-90] DDPM Score Matching and Distribution Learning

链接: https://arxiv.org/abs/2504.05161
作者: Sinho Chewi,Alkis Kalavasis,Anay Mehrotra,Omar Montasser
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Abstract shortened to fit arXiv limit

点击查看摘要

Abstract:Score estimation is the backbone of score-based generative models (SGMs), especially denoising diffusion probabilistic models (DDPMs). A key result in this area shows that with accurate score estimates, SGMs can efficiently generate samples from any realistic data distribution (Chen et al., ICLR’23; Lee et al., ALT’23). This distribution learning result, where the learned distribution is implicitly that of the sampler’s output, does not explain how score estimation relates to classical tasks of parameter and density estimation. This paper introduces a framework that reduces score estimation to these two tasks, with various implications for statistical and computational learning theory: Parameter Estimation: Koehler et al. (ICLR’23) demonstrate that a score-matching variant is statistically inefficient for the parametric estimation of multimodal densities common in practice. In contrast, we show that under mild conditions, denoising score-matching in DDPMs is asymptotically efficient. Density Estimation: By linking generation to score estimation, we lift existing score estimation guarantees to (\epsilon,\delta) -PAC density estimation, i.e., a function approximating the target log-density within \epsilon on all but a \delta -fraction of the space. We provide (i) minimax rates for density estimation over Hölder classes and (ii) a quasi-polynomial PAC density estimation algorithm for the classical Gaussian location mixture model, building on and addressing an open problem from Gatmiry et al. (arXiv’24). Lower Bounds for Score Estimation: Our framework offers the first principled method to prove computational lower bounds for score estimation across general distributions. As an application, we establish cryptographic lower bounds for score estimation in general Gaussian mixture models, conceptually recovering Song’s (NeurIPS’24) result and advancing his key open problem. Comments: Abstract shortened to fit arXiv limit Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2504.05161 [stat.ML] (or arXiv:2504.05161v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2504.05161 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anay Mehrotra [view email] [v1] Mon, 7 Apr 2025 15:07:19 UTC (174 KB) Full-text links: Access Paper: View a PDF of the paper titled DDPM Score Matching and Distribution Learning, by Sinho Chewi and Alkis Kalavasis and Anay Mehrotra and Omar MontasserView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: stat.ML prev | next new | recent | 2025-04 Change to browse by: cs cs.DS cs.LG math math.ST stat stat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-91] Online Cluster-Based Parameter Control for Metaheuristic

链接: https://arxiv.org/abs/2504.05144
作者: Vasileios A. Tatsis,Dimos Ioannidis
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The concept of parameter setting is a crucial and significant process in metaheuristics since it can majorly impact their performance. It is a highly complex and challenging procedure since it requires a deep understanding of the optimization algorithm and the optimization problem at hand. In recent years, the upcoming rise of autonomous decision systems has attracted ongoing scientific interest in this direction, utilizing a considerable number of parameter-tuning methods. There are two types of methods: offline and online. Online methods usually excel in complex real-world problems, as they can offer dynamic parameter control throughout the execution of the algorithm. The present work proposes a general-purpose online parameter-tuning method called Cluster-Based Parameter Adaptation (CPA) for population-based metaheuristics. The main idea lies in the identification of promising areas within the parameter search space and in the generation of new parameters around these areas. The method’s validity has been demonstrated using the differential evolution algorithm and verified in established test suites of low- and high-dimensional problems. The obtained results are statistically analyzed and compared with state-of-the-art algorithms, including advanced auto-tuning approaches. The analysis reveals the promising solid CPA’s performance as well as its robustness under a variety of benchmark problems and dimensions.

[LG-92] Stacking Variational Bayesian Monte Carlo

链接: https://arxiv.org/abs/2504.05004
作者: Francesco Silvestrin,Chengkun Li,Luigi Acerbi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at the Workshop track of the 7th Symposium in Advances in Approximate Bayesian Inference (AABI 2025). 24 pages, 9 figures

点击查看摘要

Abstract:Variational Bayesian Monte Carlo (VBMC) is a sample-efficient method for approximate Bayesian inference with computationally expensive likelihoods. While VBMC’s local surrogate approach provides stable approximations, its conservative exploration strategy and limited evaluation budget can cause it to miss regions of complex posteriors. In this work, we introduce Stacking Variational Bayesian Monte Carlo (S-VBMC), a method that constructs global posterior approximations by merging independent VBMC runs through a principled and inexpensive post-processing step. Our approach leverages VBMC’s mixture posterior representation and per-component evidence estimates, requiring no additional likelihood evaluations while being naturally parallelizable. We demonstrate S-VBMC’s effectiveness on two synthetic problems designed to challenge VBMC’s exploration capabilities and two real-world applications from computational neuroscience, showing substantial improvements in posterior approximation quality across all cases.

[LG-93] Closed-Loop Neural Operator-Based Observer of Traffic Density

链接: https://arxiv.org/abs/2504.04873
作者: Alice Harting,Karl Henrik Johansson,Matthieu Barreau
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of traffic density estimation with sparse measurements from stationary roadside sensors. Our approach uses Fourier neural operators to learn macroscopic traffic flow dynamics from high-fidelity microscopic-level simulations. During inference, the operator functions as an open-loop predictor of traffic evolution. To close the loop, we couple the open-loop operator with a correction operator that combines the predicted density with sparse measurements from the sensors. Simulations with the SUMO software indicate that, compared to open-loop observers, the proposed closed-loop observer exhibit classical closed-loop properties such as robustness to noise and ultimate boundedness of the error. This shows the advantages of combining learned physics with real-time corrections, and opens avenues for accurate, efficient, and interpretable data-driven observers.

[LG-94] Sparse Optimization for Transfer Learning: A L0-Regularized Framework for Multi-Source Domain Adaptation

链接: https://arxiv.org/abs/2504.04812
作者: Chenqi Gong,Hu Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores transfer learning in heterogeneous multi-source environments with distributional divergence between target and auxiliary domains. To address challenges in statistical bias and computational efficiency, we propose a Sparse Optimization for Transfer Learning (SOTL) framework based on L0-regularization. The method extends the Joint Estimation Transferred from Strata (JETS) paradigm with two key innovations: (1) L0-constrained exact sparsity for parameter space compression and complexity reduction, and (2) refining optimization focus to emphasize target parameters over redundant ones. Simulations show that SOTL significantly improves both estimation accuracy and computational speed, especially under adversarial auxiliary domain conditions. Empirical validation on the Community and Crime benchmarks demonstrates the statistical robustness of the SOTL method in cross-domain transfer.

[LG-95] asKAN: Active Subspace embedded Kolmogorov-Arnold Network

链接: https://arxiv.org/abs/2504.04669
作者: Zhiteng Zhou,Zhaoyue Xu,Yi Liu,Shizhao Wang
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Kolmogorov-Arnold Network (KAN) has emerged as a promising neural network architecture for small-scale AI+Science applications. However, it suffers from inflexibility in modeling ridge functions, which is widely used in representing the relationships in physical systems. This study investigates this inflexibility through the lens of the Kolmogorov-Arnold theorem, which starts the representation of multivariate functions from constructing the univariate components rather than combining the independent variables. Our analysis reveals that incorporating linear combinations of independent variables can substantially simplify the network architecture in representing the ridge functions. Inspired by this finding, we propose active subspace embedded KAN (asKAN), a hierarchical framework that synergizes KAN’s function representation with active subspace methodology. The architecture strategically embeds active subspace detection between KANs, where the active subspace method is used to identify the primary ridge directions and the independent variables are adaptively projected onto these critical dimensions. The proposed asKAN is implemented in an iterative way without increasing the number of neurons in the original KAN. The proposed method is validated through function fitting, solving the Poisson equation, and reconstructing sound field. Compared with KAN, asKAN significantly reduces the error using the same network architecture. The results suggest that asKAN enhances the capability of KAN in fitting and solving equations with in the form of ridge functions.

[LG-96] Interval-Valued Time Series Classification Using D_K-Distance

链接: https://arxiv.org/abs/2504.04667
作者: Wan Tian,Zhongfeng Qin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, modeling and analysis of interval-valued time series have garnered increasing attention in econometrics, finance, and statistics. However, these studies have predominantly focused on statistical inference in the forecasting of univariate and multivariate interval-valued time series, overlooking another important aspect: classification. In this paper, we introduce a classification approach that treats intervals as unified entities, applicable to both univariate and multivariate interval-valued time series. Specifically, we first extend the point-valued time series imaging methods to interval-valued scenarios using the D_K -distance, enabling the imaging of interval-valued time series. Then, we employ suitable deep learning model for classification on the obtained imaging dataset, aiming to achieve classification for interval-valued time series. In theory, we derived a sharper excess risk bound for deep multiclassifiers based on offset Rademacher complexity. Finally, we validate the superiority of the proposed method through comparisons with various existing point-valued time series classification methods in both simulation studies and real data applications.

[LG-97] A Novel Algorithm for Personalized Federated Learning: Knowledge Distillation with Weighted Combination Loss

链接: https://arxiv.org/abs/2504.04642
作者: Hengrui Hu,Anai N. Kothari,Anjishnu Banerjee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) offers a privacy-preserving framework for distributed machine learning, enabling collaborative model training across diverse clients without centralizing sensitive data. However, statistical heterogeneity, characterized by non-independent and identically distributed (non-IID) client data, poses significant challenges, leading to model drift and poor generalization. This paper proposes a novel algorithm, pFedKD-WCL (Personalized Federated Knowledge Distillation with Weighted Combination Loss), which integrates knowledge distillation with bi-level optimization to address non-IID challenges. pFedKD-WCL leverages the current global model as a teacher to guide local models, optimizing both global convergence and local personalization efficiently. We evaluate pFedKD-WCL on the MNIST dataset and a synthetic dataset with non-IID partitioning, using multinomial logistic regression and multilayer perceptron models. Experimental results demonstrate that pFedKD-WCL outperforms state-of-the-art algorithms, including FedAvg, FedProx, Per-FedAvg, and pFedMe, in terms of accuracy and convergence speed.

[LG-98] Scalable Approximate Algorithms for Optimal Transport Linear Models

链接: https://arxiv.org/abs/2504.04609
作者: Tomasz Kacprzak,Francois Kamper,Michael W. Heiss,Gianluca Janka,Ann M. Dillner,Satoshi Takahama
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Code will be made available at this address: this https URL

点击查看摘要

Abstract:Recently, linear regression models incorporating an optimal transport (OT) loss have been explored for applications such as supervised unmixing of spectra, music transcription, and mass spectrometry. However, these task-specific approaches often do not generalize readily to a broader class of linear models. In this work, we propose a novel algorithmic framework for solving a general class of non-negative linear regression models with an entropy-regularized OT datafit term, based on Sinkhorn-like scaling iterations. Our framework accommodates convex penalty functions on the weights (e.g. squared- \ell_2 and \ell_1 norms), and admits additional convex loss terms between the transported marginal and target distribution (e.g. squared error or total variation). We derive simple multiplicative updates for common penalty and datafit terms. This method is suitable for large-scale problems due to its simplicity of implementation and straightforward parallelization.

[LG-99] Cramer-Rao Bounds for Laplacian Matrix Estimation

链接: https://arxiv.org/abs/2504.04576
作者: Morad Halihal,Tirza Routtenberg,H. Vincent Poor
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, we analyze the performance of the estimation of Laplacian matrices under general observation models. Laplacian matrix estimation involves structural constraints, including symmetry and null-space properties, along with matrix sparsity. By exploiting a linear reparametrization that enforces the structural constraints, we derive closed-form matrix expressions for the Cramer-Rao Bound (CRB) specifically tailored to Laplacian matrix estimation. We further extend the derivation to the sparsity-constrained case, introducing two oracle CRBs that incorporate prior information of the support set, i.e. the locations of the nonzero entries in the Laplacian matrix. We examine the properties and order relations between the bounds, and provide the associated Slepian-Bangs formula for the Gaussian case. We demonstrate the use of the new CRBs in three representative applications: (i) topology identification in power systems, (ii) graph filter identification in diffused models, and (iii) precision matrix estimation in Gaussian Markov random fields under Laplacian constraints. The CRBs are evaluated and compared with the mean-squared-errors (MSEs) of the constrained maximum likelihood estimator (CMLE), which integrates both equality and inequality constraints along with sparsity constraints, and of the oracle CMLE, which knows the locations of the nonzero entries of the Laplacian matrix. We perform this analysis for the applications of power system topology identification and graphical LASSO, and demonstrate that the MSEs of the estimators converge to the CRB and oracle CRB, given a sufficient number of measurements.

[LG-100] Generative Market Equilibrium Models with Stable Adversarial Learning via Reinforcement

链接: https://arxiv.org/abs/2504.04300
作者: Anastasis Kratsios,Xiaofei Shi,Qiang Sun,Zhanhao Zhang
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG); Pricing of Securities (q-fin.PR)
*备注:

点击查看摘要

Abstract:We present a general computational framework for solving continuous-time financial market equilibria under minimal modeling assumptions while incorporating realistic financial frictions, such as trading costs, and supporting multiple interacting agents. Inspired by generative adversarial networks (GANs), our approach employs a novel generative deep reinforcement learning framework with a decoupling feedback system embedded in the adversarial training loop, which we term as the \emphreinforcement link. This architecture stabilizes the training dynamics by incorporating feedback from the discriminator. Our theoretically guided feedback mechanism enables the decoupling of the equilibrium system, overcoming challenges that hinder conventional numerical algorithms. Experimentally, our algorithm not only learns but also provides testable predictions on how asset returns and volatilities emerge from the endogenous trading behavior of market participants, where traditional analytical methods fall short. The design of our model is further supported by an approximation guarantee.

[LG-101] Randomised Postiterations for Calibrated BayesCG

链接: https://arxiv.org/abs/2504.04247
作者: Niall Vyas,Disha Hegde,Jon Cockayne
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The Bayesian conjugate gradient method offers probabilistic solutions to linear systems but suffers from poor calibration, limiting its utility in uncertainty quantification tasks. Recent approaches leveraging postiterations to construct priors have improved computational properties but failed to correct calibration issues. In this work, we propose a novel randomised postiteration strategy that enhances the calibration of the BayesCG posterior while preserving its favourable convergence characteristics. We present theoretical guarantees for the improved calibration, supported by results on the distribution of posterior errors. Numerical experiments demonstrate the efficacy of the method in both synthetic and inverse problem settings, showing enhanced uncertainty quantification and better propagation of uncertainties through computational pipelines.

[LG-102] Variational autoencoders understand knot topology

链接: https://arxiv.org/abs/2504.04179
作者: Anna Braghetto,Sumanta Kundu,Marco Baiesi,Enzo Orlandini
类目: atistical Mechanics (cond-mat.stat-mech); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Supervised machine learning (ML) methods are emerging as valid alternatives to standard mathematical methods for identifying knots in long, collapsed polymers. Here, we introduce a hybrid supervised/unsupervised ML approach for knot classification based on a variational autoencoder enhanced with a knot type classifier (VAEC). The neat organization of knots in its latent representation suggests that the VAEC, only based on an arbitrary labeling of three-dimensional configurations, has grasped complex topological concepts such as chirality, unknotting number, braid index, and the grouping in families such as achiral, torus, and twist knots. The understanding of topological concepts is confirmed by the ability of the VAEC to distinguish the chirality of knots 9_42 and 10_71 not used for its training and with a notoriously undetected chirality to standard tools. The well-organized latent space is also key for generating configurations with the decoder that reliably preserves the topology of the input ones. Our findings demonstrate the ability of a hybrid supervised-generative ML algorithm to capture different topological features of entangled filaments and to exploit this knowledge to faithfully reconstruct or produce new knotted configurations without simulations.

[LG-103] Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes

链接: https://arxiv.org/abs/2504.04105
作者: Ruiqi Zhang,Jingfeng Wu,Licong Lin,Peter L. Bartlett
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:We study \textitgradient descent (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter \eta . We show that after at most 1/\gamma^2 burn-in steps, GD achieves a risk upper bounded by \exp(-\Theta(\eta)) , where \gamma is the margin of the dataset. As \eta can be arbitrarily large, GD attains an arbitrarily small risk \textitimmediately after the burn-in steps , though the risk evolution may be \textitnon-monotonic . We further construct hard datasets with margin \gamma , where any batch or online first-order method requires \Omega(1/\gamma^2) steps to find a linear separator. Thus, GD with large, adaptive stepsizes is \textitminimax optimal among first-order batch methods. Notably, the classical \textitPerceptron (Novikoff, 1962), a first-order online method, also achieves a step complexity of 1/\gamma^2 , matching GD even in constants. Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks. Comments: 27 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2504.04105 [stat.ML] (or arXiv:2504.04105v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2504.04105 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-104] Computational Efficient Informative Nonignorable Matrix Completion: A Row- and Column-Wise Matrix U-Statistic Pseudo-Likelihood Approach

链接: https://arxiv.org/abs/2504.04016
作者: Yuanhong A,Guoyu Zhang,Yongcheng Zeng,Bo Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this study, we establish a unified framework to deal with the high dimensional matrix completion problem under flexible nonignorable missing mechanisms. Although the matrix completion problem has attracted much attention over the years, there are very sparse works that consider the nonignorable missing mechanism. To address this problem, we derive a row- and column-wise matrix U-statistics type loss function, with the nuclear norm for regularization. A singular value proximal gradient algorithm is developed to solve the proposed optimization problem. We prove the non-asymptotic upper bound of the estimation error’s Frobenius norm and show the performance of our method through numerical simulations and real data analysis.

[LG-105] Spatially-Heterogeneous Causal Bayesian Networks for Seismic Multi-Hazard Estimation: A Variational Approach with Gaussian Processes and Normalizing Flows

链接: https://arxiv.org/abs/2504.04013
作者: Xuechun Li,Shan Gao,Runyu Gao,Susu Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Post-earthquake hazard and impact estimation are critical for effective disaster response, yet current approaches face significant limitations. Traditional models employ fixed parameters regardless of geographical context, misrepresenting how seismic effects vary across diverse landscapes, while remote sensing technologies struggle to distinguish between co-located hazards. We address these challenges with a spatially-aware causal Bayesian network that decouples co-located hazards by modeling their causal relationships with location-specific parameters. Our framework integrates sensing observations, latent variables, and spatial heterogeneity through a novel combination of Gaussian Processes with normalizing flows, enabling us to capture how same earthquake produces different effects across varied geological and topographical features. Evaluations across three earthquakes demonstrate Spatial-VCBN achieves Area Under the Curve (AUC) improvements of up to 35.2% over existing methods. These results highlight the critical importance of modeling spatial heterogeneity in causal mechanisms for accurate disaster assessment, with direct implications for improving emergency response resource allocation.

[LG-106] Machine Learning Reviews Composition Dependent Thermal Stability in Halide Perovskites

链接: https://arxiv.org/abs/2504.04002
作者: Abigail R. Hering,Mansha Dubey,Elahe Hosseini,Meghna Srivastava,Yu An,Juan-Pablo Correa-Baena,Houman Homayoun,Marina S. Leite
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:Halide perovskites exhibit unpredictable properties in response to environmental stressors, due to several composition-dependent degradation mechanisms. In this work, we apply data visualization and machine learning (ML) techniques to reveal unexpected correlations between composition, temperature, and material properties while using high throughput, in situ environmental photoluminescence (PL) experiments. Correlation heatmaps show the strong influence of Cs content on film degradation, and dimensionality reduction visualization methods uncover clear composition-based data clusters. An extreme gradient boosting algorithm (XGBoost) effectively forecasts PL features for ten perovskite films with both composition-agnostic (85% accuracy) and composition-dependent (75% accuracy) model approaches, while elucidating the relative feature importance of composition (up to 99%). This model validates a previously unseen anti-correlation between Cs content and material thermal stability. Our ML-based framework can be expanded to any perovskite family, significantly reducing the analysis time currently employed to identify stable options for photovoltaics.

[LG-107] Regression Discontinuity Design with Distribution-Valued Outcomes

链接: https://arxiv.org/abs/2504.03992
作者: David Van Dijcke
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)
*备注: 36 pages, 6 figures, 1 table, R package available at this https URL

点击查看摘要

Abstract:This article introduces Regression Discontinuity Design (RDD) with Distribution-Valued Outcomes (R3D), extending the standard RDD framework to settings where the outcome is a distribution rather than a scalar. Such settings arise when treatment is assigned at a higher level of aggregation than the outcome-for example, when a subsidy is allocated based on a firm-level revenue cutoff while the outcome of interest is the distribution of employee wages within the firm. Since standard RDD methods cannot accommodate such two-level randomness, I propose a novel approach based on random distributions. The target estimand is a “local average quantile treatment effect”, which averages across random quantiles. To estimate this target, I introduce two related approaches: one that extends local polynomial regression to random quantiles and another based on local Fréchet regression, a form of functional regression. For both estimators, I establish asymptotic normality and develop uniform, debiased confidence bands together with a data-driven bandwidth selection procedure. Simulations validate these theoretical properties and show existing methods to be biased and inconsistent in this setting. I then apply the proposed methods to study the effects of gubernatorial party control on within-state income distributions in the US, using a close-election design. The results suggest a classic equality-efficiency tradeoff under Democratic governorship, driven by reductions in income at the top of the distribution.

[LG-108] Batch Bayesian Optimization for High-Dimensional Experimental Design: Simulation and Visualization

链接: https://arxiv.org/abs/2504.03943
作者: Imon Mia,Armi Tiihonen,Anna Ernst,Anusha Srivastava,Tonio Buonassisi,William Vandenberghe,Julia W.P. Hsu
类目: Machine Learning (stat.ML); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian Optimization (BO) is increasingly used to guide experimental optimization tasks. To elucidate BO behavior in noisy and high-dimensional settings typical for materials science applications, we perform batch BO of two six-dimensional test functions: an Ackley function representing a needle-in-a-haystack problem and a Hartmann function representing a problem with a false maximum with a value close to the global maximum. We show learning curves, performance metrics, and visualization to effectively track the evolution of optimization in high dimensions and evaluate how they are affected by noise, batch-picking method, choice of acquisition function,and its exploration hyperparameter values. We find that the effects of noise depend on the problem landscape; therefore, prior knowledge of the domain structure and noise level is needed when designing BO. The Ackley function optimization is significantly degraded by noise with a complete loss of ground truth resemblance when noise equals 10 % of the maximum objective value. For the Hartmann function, even in the absence of noise, a significant fraction of the initial samplings identify the false maximum instead of the ground truth maximum as the optimum of the function; with increasing noise, BO remains effective, albeit with increasing probability of landing on the false maximum. This study systematically highlights the critical issues when setting up BO and choosing synthetic data to test experimental design. The results and methodology will facilitate wider utilization of BO in guiding experiments, specifically in high-dimensional settings.

[LG-109] Efficient FPGA-accelerated Convolutional Neural Networks for Cloud Detection on CubeSats

链接: https://arxiv.org/abs/2504.03891
作者: Angela Cratere,M. Salim Farissi,Andrea Carbone,Marcello Asciolla,Maria Rizzi,Francesco Dell’Olio,Augusto Nascetti,Dario Spiller
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 11 pages, Published in IEEE Journal on Miniaturization for Air and Space Systems

点击查看摘要

Abstract:We present the implementation of four FPGA-accelerated convolutional neural network (CNN) models for onboard cloud detection in resource-constrained CubeSat missions, leveraging Xilinx’s Vitis AI (VAI) framework and Deep Learning Processing Unit (DPU), a programmable engine with pre-implemented, parameterizable IP cores optimized for deep neural networks, on a Zynq UltraScale+ MPSoC. This study explores both pixel-wise (Pixel-Net and Patch-Net) and image-wise (U-Net and Scene-Net) models to benchmark trade-offs in accuracy, latency, and model complexity. Applying channel pruning, we achieved substantial reductions in model parameters (up to 98.6%) and floating-point operations (up to 90.7%) with minimal accuracy loss. Furthermore, the VAI tool was used to quantize the models to 8-bit precision, ensuring optimized hardware performance with negligible impact on accuracy. All models retained high accuracy post-FPGA integration, with a cumulative maximum accuracy drop of only 0.6% after quantization and pruning. The image-wise Scene-Net and U-Net models demonstrated strong real-time inference capabilities, achieving frame rates per second of 57.14 and 37.45, respectively, with power consumption of around 2.5 W, surpassing state-of-the-art onboard cloud detection solutions. Our approach underscores the potential of DPU-based hardware accelerators to expand the processing capabilities of small satellites, enabling efficient and flexible onboard CNN-based applications.

[LG-110] CREASE-2D Analysis of Small Angle X-ray Scattering Data from Supramolecular Dipeptide Systems

链接: https://arxiv.org/abs/2504.03869
作者: Nitant Gupta,Sri V.V.R. Akepati,Simona Bianco,Jay Shah,Dave J. Adams,Arthi Jayaraman
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注: 30 Pages, 9 figures

点击查看摘要

Abstract:In this paper, we extend a recently developed machine-learning (ML) based CREASE-2D method to analyze the entire two-dimensional (2D) scattering pattern obtained from small angle X-ray scattering measurements of supramolecular dipeptide micellar systems. Traditional analysis of such scattering data would involve use of approximate or incorrect analytical models to fit to azimuthally-averaged 1D scattering patterns that can miss the anisotropic arrangements. Analysis of the 2D scattering profiles of such micellar solutions using CREASE-2D allows us to understand both isotropic and anisotropic structural arrangements that are present in these systems of assembled dipeptides in water and in the presence of added solvents/salts. CREASE-2D outputs distributions of relevant structural features including ones that cannot be identified with existing analytical models (e.g., assembled tubes, cross-sectional eccentricity, tortuosity, orientational order). The representative three-dimensional (3D) real-space structures for the optimized values of these structural features further facilitate visualization of the structures. Through this detailed interpretation of these 2D SAXS profiles we are able to characterize the shapes of the assembled tube structures as a function of dipeptide chemistry, solution conditions with varying salts and solvents, and relative concentrations of all components. This paper demonstrates how CREASE-2D analysis of entire SAXS profiles can provide an unprecedented level of understanding of structural arrangements which has not been possible through traditional analytical model fits to the 1D SAXS data.

[LG-111] Interpretable Multimodal Learning for Tumor Protein-Metal Binding: Progress Challenges and Perspectives

链接: https://arxiv.org/abs/2504.03847
作者: Xiaokun Liu,Sayedmohammadreza Rastegari,Yijun Huang,Sxe Chang Cheong,Weikang Liu,Wenjie Zhao,Qihao Tian,Hongming Wang,Shuo Zhou,Yingjie Guo,Sina Tabakhi,Xianyuan Liu,Zheqing Zhu,Wei Sang,Haiping Lu
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:In cancer therapeutics, protein-metal binding mechanisms critically govern drug pharmacokinetics and targeting efficacy, thereby fundamentally shaping the rational design of anticancer metallodrugs. While conventional laboratory methods used to study such mechanisms are often costly, low throughput, and limited in capturing dynamic biological processes, machine learning (ML) has emerged as a promising alternative. Despite increasing efforts to develop protein-metal binding datasets and ML algorithms, the application of ML in tumor protein-metal binding remains limited. Key challenges include a shortage of high-quality, tumor-specific datasets, insufficient consideration of multiple data modalities, and the complexity of interpreting results due to the ‘‘black box’’ nature of complex ML models. This paper summarizes recent progress and ongoing challenges in using ML to predict tumor protein-metal binding, focusing on data, modeling, and interpretability. We present multimodal protein-metal binding datasets and outline strategies for acquiring, curating, and preprocessing them for training ML models. Moreover, we explore the complementary value provided by different data modalities and examine methods for their integration. We also review approaches for improving model interpretability to support more trustworthy decisions in cancer research. Finally, we offer our perspective on research opportunities and propose strategies to address the scarcity of tumor protein data and the limited number of predictive models for tumor protein-metal binding. We also highlight two promising directions for effective metal-based drug design: integrating protein-protein interaction data to provide structural insights into metal-binding events and predicting structural changes in tumor proteins after metal binding.

[LG-112] Detecting Plant VOC Traces Using Indoor Air Quality Sensors

链接: https://arxiv.org/abs/2504.03785
作者: Seyed Hamidreza Nabaei,Ryan Lenfant,Viswajith Govinda Rajan,Dong Chen,Michael P. Timko,Bradford Campbell,Arsalan Heydarian
类目: ignal Processing (eess.SP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of growing interest in healthy buildings and smart homes, the importance of sustainable, health conscious indoor environments is paramount. Smart tools, especially VOC sensors, are crucial for monitoring indoor air quality, yet interpreting signals from various VOC sources remains challenging. A promising approach involves understanding how indoor plants respond to environmental conditions. Plants produce terpenes, a type of VOC, when exposed to abiotic and biotic stressors - including pathogens, predators, light, and temperature - offering a novel pathway for monitoring indoor air quality. While prior work often relies on specialized laboratory sensors, our research leverages readily available commercial sensors to detect and classify plant emitted VOCs that signify changes in indoor conditions. We quantified the sensitivity of these sensors by measuring 16 terpenes in controlled experiments, then identified and tested the most promising terpenes in realistic environments. We also examined physics based models to map VOC responses but found them lacking for real world complexity. Consequently, we trained machine learning models to classify terpenes using commercial sensors and identified optimal sensor placement. To validate this approach, we analyzed emissions from a living basil plant, successfully detecting terpene output. Our findings establish a foundation for overcoming challenges in plant VOC detection, paving the way for advanced plant based sensors to enhance indoor environmental quality in future smart buildings.

[LG-113] Low-cost Embedded Breathing Rate Determination Using 802.15.4z IR-UWB Hardware for Remote Healthcare

链接: https://arxiv.org/abs/2504.03772
作者: Anton Lambrecht,Stijn Luchie,Jaron Fontaine,Ben Van Herbruggen,Adnan Shahid,Eli De Poorter
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This paper has been submitted to IEEE Sensors Journal and is currently undergoing review

点击查看摘要

Abstract:Respiratory diseases account for a significant portion of global mortality. Affordable and early detection is an effective way of addressing these ailments. To this end, a low-cost commercial off-the-shelf (COTS), IEEE 802.15.4z standard compliant impulse-radio ultra-wideband (IR-UWB) radar system is exploited to estimate human respiration rates. We propose a convolutional neural network (CNN) to predict breathing rates from ultra-wideband (UWB) channel impulse response (CIR) data, and compare its performance with other rule-based algorithms. The study uses a diverse dataset of 16 individuals, incorporating various real-life environments to evaluate system robustness. Results show that the CNN achieves a mean absolute error (MAE) of 1.73 breaths per minute (BPM) in unseen situations, significantly outperforming rule-based methods (3.40 BPM). By incorporating calibration data from other individuals in the unseen situations, the error is further reduced to 0.84 BPM. In addition, this work evaluates the feasibility of running the pipeline on a low-cost embedded device. Applying 8-bit quantization to both the weights and input/ouput tensors, reduces memory requirements by 67% and inference time by 64% with only a 3% increase in MAE. As a result, we show it is feasible to deploy the algorithm on an nRF52840 system-on-chip (SoC) requiring only 46 KB of memory and operating with an inference time of only 192 ms. Once deployed, the system can last up to 268 days without recharging using a 20 000 mAh battery pack. For breathing monitoring in bed, the sampling rate can be lowered, extending battery life to 313 days, making the solution highly efficient for real-world, low-cost deployments.

[LG-114] Decoding Covert Speech from EEG Using a Functional Areas Spatio-Temporal Transformer

链接: https://arxiv.org/abs/2504.03762
作者: Muyun Jiang,Yi Ding,Wei Zhang,Kok Ann Colin Teo,LaiGuan Fong,Shuailei Zhang,Zhiwei Guo,Chenyu Liu,Raghavan Bhuvanakantham,Wei Khang Jeremy Sim,Chuan Huat Vince Foo,Rong Hui Jonathan Chua,Parasuraman Padmanabhan,Victoria Leong,Jia Lu,Balazs Gulyas,Cuntai Guan
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Covert speech involves imagining speaking without audible sound or any movements. Decoding covert speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping and the low signal-to-noise ratio of the signal. In this study, we developed a large-scale multi-utterance speech EEG dataset from 57 right-handed native English-speaking subjects, each performing covert and overt speech tasks by repeating the same word in five utterances within a ten-second duration. Given the spatio-temporal nature of the neural activation process during speech pronunciation, we developed a Functional Areas Spatio-temporal Transformer (FAST), an effective framework for converting EEG signals into tokens and utilizing transformer architecture for sequence encoding. Our results reveal distinct and interpretable speech neural features by the visualization of FAST-generated activation maps across frontal and temporal brain regions with each word being covertly spoken, providing new insights into the discriminative features of the neural representation of covert speech. This is the first report of such a study, which provides interpretable evidence for speech decoding from EEG. The code for this work has been made public at this https URL

[LG-115] Augmentation of EEG and ECG Time Series for Deep Learning Applications: Integrating Changepoint Detection into the iAAFT Surrogates

链接: https://arxiv.org/abs/2504.03761
作者: Nina Moutonnet,Gregory Scott,Danilo P. Mandic
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The performance of deep learning methods critically depends on the quality and quantity of the available training data. This is especially the case for physiological time series, which are both noisy and scarce, which calls for data augmentation to artificially increase the size of datasets. Another issue is that the time-evolving statistical properties of nonstationary signals prevent the use of standard data augmentation techniques. To this end, we introduce a novel method for augmenting nonstationary time series. This is achieved by combining offline changepoint detection with the iterative amplitude-adjusted Fourier transform (iAAFT), which ensures that the time-frequency properties of the original signal are preserved during augmentation. The proposed method is validated through comparisons of the performance of i) a deep learning seizure detection algorithm on both the original and augmented versions of the CHB-MIT and Siena scalp electroencephalography (EEG) databases, and ii) a deep learning atrial fibrillation (AF) detection algorithm on the original and augmented versions of the Computing in Cardiology Challenge 2017 dataset. By virtue of the proposed method, for the CHB-MIT and Siena datasets respectively, accuracy rose by 4.4% and 1.9%, precision by 10% and 5.5%, recall by 3.6% and 0.9%, and F1 by 4.2% and 1.4%. For the AF classification task, accuracy rose by 0.3%, precision by 2.1%, recall by 0.8%, and F1 by 2.1%.

[LG-116] EEG-EyeTrack: A Benchmark for Time Series and Functional Data Analysis with Open Challenges and Baselines

链接: https://arxiv.org/abs/2504.03760
作者: Tiago Vasconcelos Afonso,Florian Heinrichs
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Keywords: Functional data analysis, functional neural networks, EEG data, eye-tracking 18 pages, 2 figures, 9 tables

点击查看摘要

Abstract:A new benchmark dataset for functional data analysis (FDA) is presented, focusing on the reconstruction of eye movements from EEG data. The contribution is twofold: first, open challenges and evaluation metrics tailored to FDA applications are proposed. Second, functional neural networks are used to establish baseline results for the primary regression task of reconstructing eye movements from EEG signals. Baseline results are reported for the new dataset, based on consumer-grade hardware, and the EEGEyeNet dataset, based on research-grade hardware.

[LG-117] EEG2GAIT: A Hierarchical Graph Convolutional Network for EEG-based Gait Decoding

链接: https://arxiv.org/abs/2504.03757
作者: Xi Fu,Rui Liu,Aung Aung Phyo Wai,Hannah Pulferer,Neethu Robinson,Gernot R Müller-Putz,Cuntai Guan
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decoding gait dynamics from EEG signals presents significant challenges due to the complex spatial dependencies of motor processes, the need for accurate temporal and spectral feature extraction, and the scarcity of high-quality gait EEG datasets. To address these issues, we propose EEG2GAIT, a novel hierarchical graph-based model that captures multi-level spatial embeddings of EEG channels using a Hierarchical Graph Convolutional Network (GCN) Pyramid. To further improve decoding accuracy, we introduce a Hybrid Temporal-Spectral Reward (HTSR) loss function, which combines time-domain, frequency-domain, and reward-based loss components. Moreover, we contribute a new Gait-EEG Dataset (GED), consisting of synchronized EEG and lower-limb joint angle data collected from 50 participants over two lab visits. Validation experiments on both the GED and the publicly available Mobile Brain-body imaging (MoBI) dataset demonstrate that EEG2GAIT outperforms state-of-the-art methods and achieves the best joint angle prediction. Ablation studies validate the contributions of the hierarchical GCN modules and HTSR Loss, while saliency maps reveal the significance of motor-related brain regions in decoding tasks. These findings underscore EEG2GAIT’s potential for advancing brain-computer interface applications, particularly in lower-limb rehabilitation and assistive technologies.

[LG-118] Graph Transformer-Based Flood Susceptibility Mapping: Application to the French Riviera and Railway Infrastructure Under Climate Change

链接: https://arxiv.org/abs/2504.03727
作者: Sreenath Vemula,Filippo Gatti,Pierre Jehel
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to Science of Total Environment journal

点击查看摘要

Abstract:Increasing flood frequency and severity due to climate change threatens infrastructure and demands improved susceptibility mapping techniques. While traditional machine learning (ML) approaches are widely used, they struggle to capture spatial dependencies and poor boundary delineation between susceptibility classes. This study introduces the first application of a graph transformer (GT) architecture for flood susceptibility mapping to the flood-prone French Riviera (e.g., 2020 Storm Alex) using topography, hydrology, geography, and environmental data. GT incorporates watershed topology using Laplacian positional encoders (PEs) and attention mechanisms. The developed GT model has an AUC-ROC (0.9739), slightly lower than XGBoost (0.9853). However, the GT model demonstrated better clustering and delineation with a higher Moran’s I value (0.6119) compared to the random forest (0.5775) and XGBoost (0.5311) with p-value lower than 0.0001. Feature importance revealed a striking consistency across models, with elevation, slope, distance to channel, and convergence index being the critical factors. Dimensionality reduction on Laplacian PEs revealed partial clusters, indicating they could capture spatial information; however, their importance was lower than flood factors. Since climate and land use changes aggravate flood risk, susceptibility maps are developed for the 2050 year under different Representative Concentration Pathways (RCPs) and railway track vulnerability is assessed. All RCP scenarios revealed increased area across susceptibility classes, except for the very low category. RCP 8.5 projections indicate that 17.46% of the watershed area and 54% of railway length fall within very-high susceptible zones, compared to 6.19% and 35.61%, respectively, under current conditions. The developed maps can be integrated into a multi-hazard framework.

[LG-119] owards Practical Emotion Recognition: An Unsupervised Source-Free Approach for EEG Domain Adaptation

链接: https://arxiv.org/abs/2504.03707
作者: Md Niaz Imtiaz,Naimul Khan
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Emotion recognition is crucial for advancing mental health, healthcare, and technologies like brain-computer interfaces (BCIs). However, EEG-based emotion recognition models face challenges in cross-domain applications due to the high cost of labeled data and variations in EEG signals from individual differences and recording conditions. Unsupervised domain adaptation methods typically require access to source domain data, which may not always be feasible in real-world scenarios due to privacy and computational constraints. Source-free unsupervised domain adaptation (SF-UDA) has recently emerged as a solution, enabling target domain adaptation without source data, but its application in emotion recognition remains unexplored. We propose a novel SF-UDA approach for EEG-based emotion classification across domains, introducing a multi-stage framework that enhances model adaptability without requiring source data. Our approach incorporates Dual-Loss Adaptive Regularization (DLAR) to minimize prediction discrepancies on confident samples and align predictions with expected pseudo-labels. Additionally, we introduce Localized Consistency Learning (LCL), which enforces local consistency by promoting similar predictions from reliable neighbors. These techniques together address domain shift and reduce the impact of noisy pseudo-labels, a key challenge in traditional SF-UDA models. Experiments on two widely used datasets, DEAP and SEED, demonstrate the effectiveness of our method. Our approach significantly outperforms state-of-the-art methods, achieving 65.84% accuracy when trained on DEAP and tested on SEED, and 58.99% accuracy in the reverse scenario. It excels at detecting both positive and negative emotions, making it well-suited for practical emotion recognition applications.

[LG-120] A multi-scale lithium-ion battery capacity prediction using mixture of experts and patch-based MLP

链接: https://arxiv.org/abs/2504.03706
作者: Yuzhu Lei,Guanding Yu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Lithium-ion battery health management has become increasingly important as the application of batteries expands. Precise forecasting of capacity degradation is critical for ensuring the healthy usage of batteries. In this paper, we innovatively propose MSPMLP, a multi-scale capacity prediction model utilizing the mixture of experts (MoE) architecture and patch-based multi-layer perceptron (MLP) blocks, to capture both the long-term degradation trend and local capacity regeneration phenomena. Specifically, we utilize patch-based MLP blocks with varying patch sizes to extract multi-scale features from the capacity sequence. Leveraging the MoE architecture, the model adaptively integrates the extracted features, thereby enhancing its capacity and expressiveness. Finally, the future battery capacity is predicted based on the integrated features, achieving high prediction accuracy and generalization. Experimental results on the public NASA dataset indicate that MSPMLP achieves a mean absolute error (MAE) of 0.0078, improving by 41.8% compared to existing methods. These findings highlight that MSPMLP, owing to its multi-scale modeling capability and generalizability, provides a promising solution to the battery capacity prediction challenges caused by capacity regeneration phenomena and complex usage conditions. The code of this work is provided at this https URL.

[LG-121] Chemistry-aware battery degradation prediction under simulated real-world cyclic protocols

链接: https://arxiv.org/abs/2504.03701
作者: Yuqi Li,Han Zhang,Xiaofan Gui,Zhao Chen,Yu Li,Xiwen Chi,Quan Zhou,Shun Zheng,Ziheng Lu,Wei Xu,Jiang Bian,Liquan Chen,Hong Li
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Battery degradation is governed by complex and randomized cyclic conditions, yet existing modeling and prediction frameworks usually rely on rigid, unchanging protocols that fail to capture real-world dynamics. The stochastic electrical signals make such prediction extremely challenging, while, on the other hand, they provide abundant additional information, such as voltage fluctuations, which may probe the degradation mechanisms. Here, we present chemistry-aware battery degradation prediction under dynamic conditions with machine learning, which integrates hidden Markov processes for realistic power simulations, an automated batch-testing system that generates a large electrochemical dataset under randomized conditions, an interfacial chemistry database derived from high-throughput X-ray photoelectron spectroscopy for mechanistic probing, and a machine learning model for prediction. By automatically constructing a polynomial-scale feature space from irregular electrochemical curves, our model accurately predicts both battery life and critical knee points. This feature space also predicts the composition of the solid electrolyte interphase, revealing six distinct failure mechanisms-demonstrating a viable approach to use electrical signals to infer interfacial chemistry. This work establishes a scalable and adaptive framework for integrating chemical engineering and data science to advance noninvasive diagnostics and optimize processes for more durable and sustainable energy storage technologies.

[LG-122] Robust Blind Channel Estimation for Bursty Impulsive Noise with a Constrained EM Approach

链接: https://arxiv.org/abs/2504.03685
作者: Chin-Hung Chen,Ivana Nikoloska,Wim van Houtum,Yan Wu,Boris Karanov,Alex Alvarado
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted paper of the 2025 IEEE 101st Vehicular Technology Conference

点击查看摘要

Abstract:Impulsive noise (IN) commonly generated by power devices can severely degrade the performance of high sensitivity wireless receivers. Accurate channel state information (CSI) knowledge is essential for designing optimal maximum a posteriori detectors. This paper examines blind channel estimation methods based on the expectation-maximization (EM) algorithm tailored for scenarios impacted by bursty IN, which can be described by the Markov-Middleton model. We propose a constrained EM algorithm that exploits the trellis structure of the IN model and the transmitted binary phase shift keying (BPSK) symbols. By enforcing shared variance among specific trellis states and symmetry in the transition matrix, the proposed constrained EM algorithm adapted for the bursty IN channel has an almost two times faster convergence rate and better estimation performance than the standard EM approach. We comprehensively evaluate the robustness of both standard and constrained EM estimators under different types of CSI uncertainties. The results indicate that the final estimations of both EM estimators are robust enough to mismatch Markov-Middleton model parameters. However, as the level of CSI uncertainty increases, the convergence rate decreases.

[LG-123] End-to-End Deep Learning for Real-Time Neuroimaging-Based Assessment of Bimanual Motor Skills

链接: https://arxiv.org/abs/2504.03681
作者: Aseem Subedi,Rahul,Lora Cavuoto,Steven Schwaitzberg,Matthew Hackett,Jack Norfleet,Suvranu De
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The real-time assessment of complex motor skills presents a challenge in fields such as surgical training and rehabilitation. Recent advancements in neuroimaging, particularly functional near-infrared spectroscopy (fNIRS), have enabled objective assessment of such skills with high accuracy. However, these techniques are hindered by extensive preprocessing requirements to extract neural biomarkers. This study presents a novel end-to-end deep learning framework that processes raw fNIRS signals directly, eliminating the need for intermediate preprocessing steps. The model was evaluated on datasets from three distinct bimanual motor tasks–suturing, pattern cutting, and endotracheal intubation (ETI)–using performance metrics derived from both training and retention datasets. It achieved a mean classification accuracy of 93.9% (SD 4.4) and a generalization accuracy of 92.6% (SD 1.9) on unseen skill retention datasets, with a leave-one-subject-out cross-validation yielding an accuracy of 94.1% (SD 3.6). Contralateral prefrontal cortex activations exhibited task-specific discriminative power, while motor cortex activations consistently contributed to accurate classification. The model also demonstrated resilience to neurovascular coupling saturation caused by extended task sessions, maintaining robust performance across trials. Comparative analysis confirms that the end-to-end model performs on par with or surpasses baseline models optimized for fully processed fNIRS data, with statistically similar (p0.05) or improved prediction accuracies. By eliminating the need for extensive signal preprocessing, this work provides a foundation for real-time, non-invasive assessment of bimanual motor skills in medical training environments, with potential applications in robotics, rehabilitation, and sports.

信息检索

[IR-0] LLM -Alignment Live-Streaming Recommendation

链接: https://arxiv.org/abs/2504.05217
作者: Yueyang Liu,Jiangxia Cao,Shen Wang,Shuang Wen,Xiang Chen,Xiangyu Wu,Shuang Yang,Zhaojie Liu,Kun Gai,Guorui Zhou
类目: Information Retrieval (cs.IR)
*备注: Work in progress

点击查看摘要

Abstract:In recent years, integrated short-video and live-streaming platforms have gained massive global adoption, offering dynamic content creation and consumption. Unlike pre-recorded short videos, live-streaming enables real-time interaction between authors and users, fostering deeper engagement. However, this dynamic nature introduces a critical challenge for recommendation systems (RecSys): the same live-streaming vastly different experiences depending on when a user watching. To optimize recommendations, a RecSys must accurately interpret the real-time semantics of live content and align them with user preferences.

[IR-1] Blending Queries and Conversations: Understanding Tactics Trust Verification and System Choice in Web Search and Chat Interactions

链接: https://arxiv.org/abs/2504.05156
作者: Kerstin Mayerhofer,Rob Capra,David Elsweiler
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: 11 pages, ACM CHIIR pre-print

点击查看摘要

Abstract:This paper presents a user study (N=22) where participants used an interface combining Web Search and a Generative AI-Chat feature to solve health-related information tasks. We study how people behaved with the interface, why they behaved in certain ways, and what the outcomes of these behaviours were. A think-aloud protocol captured their thought processes during searches. Our findings suggest that GenAI is neither a search panacea nor a major regression compared to standard Web Search interfaces. Qualitative and quantitative analyses identified 78 tactics across five categories and provided insight into how and why different interface features were used. We find evidence that pre-task confidence and trust both influenced which interface feature was used. In both systems, but particularly when using the chat feature, trust was often misplaced in favour of ease-of-use and seemingly perfect answers, leading to increased confidence post-search despite having incorrect results. We discuss what our findings mean in the context of our defined research questions and outline several open questions for future research.

[IR-2] Query Smarter Trust Better? Exploring Search Behaviours for Verifying News Accuracy SIGIR2025

链接: https://arxiv.org/abs/2504.05146
作者: David Elsweiler,Samy Ateia,Markus Bink,Gregor Donabauer,Marcos Fernández Pichel,Alexander Frummet,Udo Kruschwitz,David Losada,Bernd Ludwig,Selina Meyer,Noel Pascual Presa
类目: Information Retrieval (cs.IR)
*备注: 12 pages, Pre-Print SIGIR 2025

点击查看摘要

Abstract:While it is often assumed that searching for information to evaluate misinformation will help identify false claims, recent work suggests that search behaviours can instead reinforce belief in misleading news, particularly when users generate queries using vocabulary from the source articles. Our research explores how different query generation strategies affect news verification and whether the way people search influences the accuracy of their information evaluation. A mixed-methods approach was used, consisting of three parts: (1) an analysis of existing data to understand how search behaviour influences trust in fake news, (2) a simulation of query generation strategies using a Large Language Model (LLM) to assess the impact of different query formulations on search result quality, and (3) a user study to examine how ‘Boost’ interventions in interface design can guide users to adopt more effective query strategies. The results show that search behaviour significantly affects trust in news, with successful searches involving multiple queries and yielding higher-quality results. Queries inspired by different parts of a news article produced search results of varying quality, and weak initial queries improved when reformulated using full SERP information. Although ‘Boost’ interventions had limited impact, the study suggests that interface design encouraging users to thoroughly review search results can enhance query formulation. This study highlights the importance of query strategies in evaluating news and proposes that interface design can play a key role in promoting more effective search practices, serving as one component of a broader set of interventions to combat misinformation.

[IR-3] Data Augmentation as Free Lunch: Exploring the Test-Time Augmentation for Sequential Recommendation SIGIR2025

链接: https://arxiv.org/abs/2504.04843
作者: Yizhou Dang,Yuting Liu,Enneng Yang,Minhan Huang,Guibing Guo,Jianzhe Zhao,Xingwei Wang
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2025

点击查看摘要

Abstract:Data augmentation has become a promising method of mitigating data sparsity in sequential recommendation. Existing methods generate new yet effective data during model training to improve performance. However, deploying them requires retraining, architecture modification, or introducing additional learnable parameters. The above steps are time-consuming and costly for well-trained models, especially when the model scale becomes large. In this work, we explore the test-time augmentation (TTA) for sequential recommendation, which augments the inputs during the model inference and then aggregates the model’s predictions for augmented data to improve final accuracy. It avoids significant time and cost overhead from loss calculation and backward propagation. We first experimentally disclose the potential of existing augmentation operators for TTA and find that the Mask and Substitute consistently achieve better performance. Further analysis reveals that these two operators are effective because they retain the original sequential pattern while adding appropriate perturbations. Meanwhile, we argue that these two operators still face time-consuming item selection or interference information from mask tokens. Based on the analysis and limitations, we present TNoise and TMask. The former injects uniform noise into the original representation, avoiding the computational overhead of item selection. The latter blocks mask token from participating in model calculations or directly removes interactions that should have been replaced with mask tokens. Comprehensive experiments demonstrate the effectiveness, efficiency, and generalizability of our method. We provide an anonymous implementation at this https URL.

[IR-4] Investigating Popularity Bias Amplification in Recommender Systems Employed in the Entertainment Domain

链接: https://arxiv.org/abs/2504.04752
作者: Dominik Kowald
类目: Information Retrieval (cs.IR)
*备注: Under review at EWAF’25, summarizes fairness and popularity bias research presented in Dr. Kowald’s habilitation

点击查看摘要

Abstract:Recommender systems have become an integral part of our daily online experience by analyzing past user behavior to suggest relevant content in entertainment domains such as music, movies, and books. Today, they are among the most widely used applications of AI and machine learning. Consequently, regulations and guidelines for trustworthy AI, such as the European AI Act, which addresses issues like bias and fairness, are highly relevant to the design, development, and evaluation of recommender systems. One particularly important type of bias in this context is popularity bias, which results in the unfair underrepresentation of less popular content in recommendation lists. This work summarizes our research on investigating the amplification of popularity bias in recommender systems within the entertainment sector. Analyzing datasets from three entertainment domains, music, movies, and anime, we demonstrate that an item’s recommendation frequency is positively correlated with its popularity. As a result, user groups with little interest in popular content receive less accurate recommendations compared to those who prefer widely popular items. Furthermore, we aim to better understand the connection between recommendation accuracy, calibration quality of algorithms, and popularity bias amplification.

[IR-5] Can LLM -Driven Hard Negative Sampling Empower Collaborative Filtering? Findings and Potentials

链接: https://arxiv.org/abs/2504.04726
作者: Chu Zhao,Enneng Yang,Yuting Liu,Jianzhe Zhao,Guibing Guo,Xingwei Wang
类目: Information Retrieval (cs.IR)
*备注: 11 pages

点击查看摘要

Abstract:Hard negative samples can accelerate model convergence and optimize decision boundaries, which is key to improving the performance of recommender systems. Although large language models (LLMs) possess strong semantic understanding and generation capabilities, systematic research has not yet been conducted on how to generate hard negative samples effectively. To fill this gap, this paper introduces the concept of Semantic Negative Sampling and exploreshow to optimize LLMs for high-quality, hard negative sampling. Specifically, we design an experimental pipeline that includes three main modules, profile generation, semantic negative sampling, and semantic alignment, to verify the potential of LLM-driven hard negative sampling in enhancing the accuracy of collaborative filtering (CF). Experimental results indicate that hard negative samples generated based on LLMs, when semantically aligned and integrated into CF, can significantly improve CF performance, although there is still a certain gap compared to traditional negative sampling methods. Further analysis reveals that this gap primarily arises from two major challenges: noisy samples and lack of behavioral constraints. To address these challenges, we propose a framework called HNLMRec, based on fine-tuning LLMs supervised by collaborative signals. Experimental results show that this framework outperforms traditional negative sampling and other LLM-driven recommendation methods across multiple datasets, providing new solutions for empowering traditional RS with LLMs. Additionally, we validate the excellent generalization ability of the LLM-based semantic negative sampling method on new datasets, demonstrating its potential in alleviating issues such as data sparsity, popularity bias, and the problem of false hard negative samples. Our implementation code is available at this https URL.

[IR-6] C-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval

链接: https://arxiv.org/abs/2504.04707
作者: Xiaolun Jing,Genke Yang,Jian Chu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Motivated by the success of coarse-grained or fine-grained contrast in text-video retrieval, there emerge multi-grained contrastive learning methods which focus on the integration of contrasts with different granularity. However, due to the wider semantic range of videos, the text-agnostic video representations might encode misleading information not described in texts, thus impeding the model from capturing precise cross-modal semantic correspondence. To this end, we propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Specifically, our model employs a language-video attention block to generate aggregated frame and video representations conditioned on the word’s and text’s attention weights over frames. To filter unnecessary similarity interactions and decrease trainable parameters in the Interactive Similarity Aggregation (ISA) module, we design a Similarity Reorganization (SR) module to identify attentive similarities and reorganize cross-modal similarity vectors and matrices. Next, we argue that the imbalance problem among multigrained similarities may result in over- and under-representation issues. We thereby introduce an auxiliary Similarity Decorrelation Regularization (SDR) loss to facilitate cooperative relationship utilization by similarity variance minimization on matching text-video pairs. Finally, we present a Linear Softmax Aggregation (LSA) module to explicitly encourage the interactions between multiple similarities and promote the usage of multi-grained information. Empirically, TC-MGC achieves competitive results on multiple text-video retrieval benchmarks, outperforming X-CLIP model by +2.8% (+1.3%), +2.2% (+1.0%), +1.5% (+0.9%) relative (absolute) improvements in text-to-video retrieval R@1 on MSR-VTT, DiDeMo and VATEX, respectively. Our code is publicly available at this https URL.

[IR-7] COHESION: Composite Graph Convolutional Network with Dual-Stage Fusion for Multimodal Recommendation CIKM2024

链接: https://arxiv.org/abs/2504.04452
作者: Jinfeng Xu,Zheyu Chen,Wei Wang,Xiping Hu,Sang-Wook Kim,Edith C. H. Ngai
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2024

点击查看摘要

Abstract:Recent works in multimodal recommendations, which leverage diverse modal information to address data sparsity and enhance recommendation accuracy, have garnered considerable interest. Two key processes in multimodal recommendations are modality fusion and representation learning. Previous approaches in modality fusion often employ simplistic attentive or pre-defined strategies at early or late stages, failing to effectively handle irrelevant information among modalities. In representation learning, prior research has constructed heterogeneous and homogeneous graph structures encapsulating user-item, user-user, and item-item relationships to better capture user interests and item profiles. Modality fusion and representation learning were considered as two independent processes in previous work. In this paper, we reveal that these two processes are complementary and can support each other. Specifically, powerful representation learning enhances modality fusion, while effective fusion improves representation quality. Stemming from these two processes, we introduce a COmposite grapH convolutional nEtwork with dual-stage fuSION for the multimodal recommendation, named COHESION. Specifically, it introduces a dual-stage fusion strategy to reduce the impact of irrelevant information, refining all modalities using ID embedding in the early stage and fusing their representations at the late stage. It also proposes a composite graph convolutional network that utilizes user-item, user-user, and item-item graphs to extract heterogeneous and homogeneous latent relationships within users and items. Besides, it introduces a novel adaptive optimization to ensure balanced and reasonable representations across modalities. Extensive experiments on three widely used datasets demonstrate the significant superiority of COHESION over various competitive baselines.

[IR-8] Squeeze and Excitation: A Weighted Graph Contrastive Learning for Collaborative Filtering SIGIR2025

链接: https://arxiv.org/abs/2504.04443
作者: Zheyu Chen,Jinfeng Xu,Yutong Wei,Ziyue Peng
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2025

点击查看摘要

Abstract:Contrastive Learning (CL) has recently emerged as a powerful technique in recommendation systems, particularly for its capability to harness self-supervised signals from perturbed views to mitigate the persistent challenge of data sparsity. The process of constructing perturbed views of the user-item bipartite graph and performing contrastive learning between perturbed views in a graph convolutional network (GCN) is called graph contrastive learning (GCL), which aims to enhance the robustness of representation learning. Although existing GCL-based models are effective, the weight assignment method for perturbed views has not been fully explored. A critical problem in existing GCL-based models is the irrational allocation of feature attention. This problem limits the model’s ability to effectively leverage crucial features, resulting in suboptimal performance. To address this, we propose a Weighted Graph Contrastive Learning framework (WeightedGCL). Specifically, WeightedGCL applies a robust perturbation strategy, which perturbs only the view of the final GCN layer. In addition, WeightedGCL incorporates a squeeze and excitation network (SENet) to dynamically weight the features of the perturbed views. Our WeightedGCL strengthens the model’s focus on crucial features and reduces the impact of less relevant information. Extensive experiments on widely used datasets demonstrate that our WeightedGCL achieves significant accuracy improvements compared to competitive baselines.

[IR-9] Decoding Recommendation Behaviors of In-Context Learning LLM s Through Gradient Descent

链接: https://arxiv.org/abs/2504.04386
作者: Yi Xu,Weicong Qin,Weijie Yu,Ming He,Jianping Fan,Jun Xu
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Recently, there has been a growing trend in utilizing large language models (LLMs) for recommender systems, referred to as LLMRec. A notable approach within this trend is not to fine-tune these models directly but instead to leverage In-Context Learning (ICL) methods tailored for LLMRec, denoted as LLM-ICL Rec. Many contemporary techniques focus on harnessing ICL content to enhance LLMRec performance. However, optimizing LLMRec with ICL content presents unresolved challenges. Specifically, two key issues stand out: (1) the limited understanding of why using a few demonstrations without model fine-tuning can lead to better performance compared to zero-shot recommendations. (2) the lack of evaluation metrics for demonstrations in LLM-ICL Rec and the absence of the theoretical analysis and practical design for optimizing the generation of ICL content for recommendation contexts. To address these two main issues, we propose a theoretical model, the LLM-ICL Recommendation Equivalent Gradient Descent model (LRGD) in this paper, which connects recommendation generation with gradient descent dynamics. We demonstrate that the ICL inference process in LLM aligns with the training procedure of its dual model, producing token predictions equivalent to the dual model’s testing outputs. Building on these theoretical insights, we propose an evaluation metric for assessing demonstration quality. We integrate perturbations and regularizations in LRGD to enhance the robustness of the recommender system. To further improve demonstration effectiveness, prevent performance collapse, and ensure long-term adaptability, we also propose a two-stage optimization process in practice. Extensive experiments and detailed analysis on three Amazon datasets validate the theoretical equivalence and support the effectiveness of our theoretical analysis and practical module design. Comments: 12 pages, 9 figures Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2504.04386 [cs.IR] (or arXiv:2504.04386v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.04386 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-10] Short Video Segment-level User Dynamic Interests Modeling in Personalized Recommendation SIGIR2025

链接: https://arxiv.org/abs/2504.04237
作者: Zhiyu He,Zhixin Ling,Jiayu Li,Zhiqiang Guo,Weizhi Ma,Xinchen Luo,Min Zhang,Guorui Zhou
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted by SIGIR 2025

点击查看摘要

Abstract:The rapid growth of short videos has necessitated effective recommender systems to match users with content tailored to their evolving preferences. Current video recommendation models primarily treat each video as a whole, overlooking the dynamic nature of user preferences with specific video segments. In contrast, our research focuses on segment-level user interest modeling, which is crucial for understanding how users’ preferences evolve during video browsing. To capture users’ dynamic segment interests, we propose an innovative model that integrates a hybrid representation module, a multi-modal user-video encoder, and a segment interest decoder. Our model addresses the challenges of capturing dynamic interest patterns, missing segment-level labels, and fusing different modalities, achieving precise segment-level interest prediction. We present two downstream tasks to evaluate the effectiveness of our segment interest modeling approach: video-skip prediction and short video recommendation. Our experiments on real-world short video datasets with diverse modalities show promising results on both tasks. It demonstrates that segment-level interest modeling brings a deep understanding of user engagement and enhances video recommendations. We also release a unique dataset that includes segment-level video data and diverse user behaviors, enabling further research in segment-level interest modeling. This work pioneers a novel perspective on understanding user segment-level preference, offering the potential for more personalized and engaging short video experiences.

[IR-11] Investigating and Mitigating Stereotype-aware Unfairness in LLM -based Recommendations

链接: https://arxiv.org/abs/2504.04199
作者: Zihuai Zhao,Wenqi Fan,Yao Wu,Qing Li
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated unprecedented language understanding and reasoning capabilities to capture diverse user preferences and advance personalized recommendations. Despite the growing interest in LLM-based personalized recommendations, unique challenges are brought to the trustworthiness of LLM-based recommender systems (LLM-RS), since LLMs are likely to inherit stereotypes that are embedded ubiquitously in word embeddings due to their training on large-scale uncurated datasets. This leads to LLM-RS exhibiting stereotypical linguistic associations between users and items. However, there remains a lack of studies investigating the simultaneous existence of stereotypes between users and items in LLM-RS. To bridge this gap, this study reveals a new variant of fairness between stereotype groups containing both users and items, to quantify discrimination against stereotypes in LLM-RS. Moreover, in this paper, to mitigate stereotype-aware unfairness in textual user and item information, we propose a novel framework (MoS), in which an insightful stereotype-wise routing strategy over multiple stereotype-relevant experts is designed to learn unbiased representations against different stereotypes in LLM- RS. Extensive experiments are conducted to analyze the influence of stereotype-aware fairness in LLM-RS and the effectiveness of our proposed methods, which consistently outperform competitive benchmarks under various fairness settings.

[IR-12] AiReview: An Open Platform for Accelerating Systematic Reviews with LLM s SIGIR2025

链接: https://arxiv.org/abs/2504.04193
作者: Xinyu Mao,Teerapong Leelanupab,Martin Potthast,Harrisen Scells,Guido Zuccon
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR 2025

点击查看摘要

Abstract:Systematic reviews are fundamental to evidence-based medicine. Creating one is time-consuming and labour-intensive, mainly due to the need to screen, or assess, many studies for inclusion in the review. Several tools have been developed to streamline this process, mostly relying on traditional machine learning methods. Large language models (LLMs) have shown potential in further accelerating the screening process. However, no tool currently allows end users to directly leverage LLMs for screening or facilitates systematic and transparent usage of LLM-assisted screening methods. This paper introduces (i) an extensible framework for applying LLMs to systematic review tasks, particularly title and abstract screening, and (ii) a web-based interface for LLM-assisted screening. Together, these elements form AiReview-a novel platform for LLM-assisted systematic review creation. AiReview is the first of its kind to bridge the gap between cutting-edge LLM-assisted screening methods and those that create medical systematic reviews. The tool is available at this https URL. The source code is also open sourced at this https URL.

[IR-13] MSL: Not All Tokens Are What You Need for Tuning LLM as a Recommender

链接: https://arxiv.org/abs/2504.04178
作者: Bohao Wang,Feng Liu,Jiawei Chen,Xingyu Lou,Changwang Zhang,Jun Wang,Yuegang Sun,Yan Feng,Chun Chen,Can Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs), known for their comprehension capabilities and extensive knowledge, have been increasingly applied to recommendation systems (RS). Given the fundamental gap between the mechanism of LLMs and the requirement of RS, researchers have focused on fine-tuning LLMs with recommendation-specific data to enhance their performance. Language Modeling Loss (LML), originally designed for language generation tasks, is commonly adopted. However, we identify two critical limitations of LML: 1) it exhibits significant divergence from the recommendation objective; 2) it erroneously treats all fictitious item descriptions as negative samples, introducing misleading training signals. To address these limitations, we propose a novel Masked Softmax Loss (MSL) tailored for fine-tuning LLMs on recommendation. MSL improves LML by identifying and masking invalid tokens that could lead to fictitious item descriptions during loss computation. This strategy can effectively avoid the interference from erroneous negative signals and ensure well alignment with the recommendation objective supported by theoretical guarantees. During implementation, we identify a potential challenge related to gradient vanishing of MSL. To overcome this, we further introduce the temperature coefficient and propose an Adaptive Temperature Strategy (ATS) that adaptively adjusts the temperature without requiring extensive hyperparameter tuning. Extensive experiments conducted on four public datasets further validate the effectiveness of MSL, achieving an average improvement of 42.24% in NDCG@10. The code is available at this https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2504.04178 [cs.IR] (or arXiv:2504.04178v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.04178 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-14] RIS-Empowered Integrated Location Sensing and Communication with Superimposed Pilots

链接: https://arxiv.org/abs/2504.04098
作者: Wenchao Xia,Ben Zhao,Wankai Tang,Yongxu Zhu,Kai-Kit Wong,Sangarapillai Lambotharan,Hyundong Shin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In addition to enhancing wireless communication coverage quality, reconfigurable intelligent surface (RIS) technique can also assist in positioning. In this work, we consider RIS-assisted superimposed pilot and data transmission without the assumption availability of prior channel state information and position information of mobile user equipments (UEs). To tackle this challenge, we design a frame structure of transmission protocol composed of several location coherence intervals, each with pure-pilot and data-pilot transmission durations. The former is used to estimate UE locations, while the latter is time-slotted, duration of which does not exceed the channel coherence time, where the data and pilot signals are transmitted simultaneously. We conduct the Fisher Information matrix (FIM) analysis and derive \text Cramér-Rao bound (CRB) for the position estimation error. The inverse fast Fourier transform (IFFT) is adopted to obtain the estimation results of UE positions, which are then exploited for channel estimation. Furthermore, we derive the closed-form lower bound of the ergodic achievable rate of superimposed pilot (SP) transmission, which is used to optimize the phase profile of the RIS to maximize the achievable sum rate using the genetic algorithm. Finally, numerical results validate the accuracy of the UE position estimation using the IFFT algorithm and the superiority of the proposed SP scheme by comparison with the regular pilot scheme.

[IR-15] QE-RAG : A Robust Retrieval-Augmented Generation Benchmark for Query Entry Errors

链接: https://arxiv.org/abs/2504.04062
作者: Kepu Zhang,Zhongxiang Sun,Weijie Yu,Xiaoxue Zang,Kai Zheng,Yang Song,Han Li,Jun Xu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retriever-augmented generation (RAG) has become a widely adopted approach for enhancing the factual accuracy of large language models (LLMs). While current benchmarks evaluate the performance of RAG methods from various perspectives, they share a common assumption that user queries used for retrieval are error-free. However, in real-world interactions between users and LLMs, query entry errors such as keyboard proximity errors, visual similarity errors, and spelling errors are frequent. The impact of these errors on current RAG methods against such errors remains largely unexplored. To bridge this gap, we propose QE-RAG, the first robust RAG benchmark designed specifically to evaluate performance against query entry errors. We augment six widely used datasets by injecting three common types of query entry errors into randomly selected user queries at rates of 20% and 40%, simulating typical user behavior in real-world scenarios. We analyze the impact of these errors on LLM outputs and find that corrupted queries degrade model performance, which can be mitigated through query correction and training a robust retriever for retrieving relevant documents. Based on these insights, we propose a contrastive learning-based robust retriever training method and a retrieval-augmented query correction method. Extensive in-domain and cross-domain experiments reveal that: (1) state-of-the-art RAG methods including sequential, branching, and iterative methods, exhibit poor robustness to query entry errors; (2) our method significantly enhances the robustness of RAG when handling query entry errors and it’s compatible with existing RAG methods, further improving their robustness.

[IR-16] owards Robust Offline Evaluation: A Causal and Information Theoretic Framework for Debiasing Ranking Systems

链接: https://arxiv.org/abs/2504.03997
作者: Seyedeh Baharan Khatami,Sayan Chakraborty,Ruomeng Xu,Babak Salimi
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Evaluating retrieval-ranking systems is crucial for developing high-performing models. While online A/B testing is the gold standard, its high cost and risks to user experience require effective offline methods. However, relying on historical interaction data introduces biases-such as selection, exposure, conformity, and position biases-that distort evaluation metrics, driven by the Missing-Not-At-Random (MNAR) nature of user interactions and favoring popular or frequently exposed items over true user preferences. We propose a novel framework for robust offline evaluation of retrieval-ranking systems, transforming MNAR data into Missing-At-Random (MAR) through reweighting combined with black-box optimization, guided by neural estimation of information-theoretic metrics. Our contributions include (1) a causal formulation for addressing offline evaluation biases, (2) a system-agnostic debiasing framework, and (3) empirical validation of its effectiveness. This framework enables more accurate, fair, and generalizable evaluations, enhancing model assessment before deployment. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2504.03997 [cs.IR] (or arXiv:2504.03997v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.03997 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-17] Automating Personalization: Prompt Optimization for Recommendation Reranking

链接: https://arxiv.org/abs/2504.03965
作者: Chen Wang,Mingdai Yang,Zhiwei Liu,Pan Li,Linsey Pang,Qingsong Wen,Philip Yu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modern recommender systems increasingly leverage large language models (LLMs) for reranking to improve personalization. However, existing approaches face two key limitations: (1) heavy reliance on manually crafted prompts that are difficult to scale, and (2) inadequate handling of unstructured item metadata that complicates preference inference. We present AGP (Auto-Guided Prompt Refinement), a novel framework that automatically optimizes user profile generation prompts for personalized reranking. AGP introduces two key innovations: (1) position-aware feedback mechanisms for precise ranking correction, and (2) batched training with aggregated feedback to enhance generalization.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-04-08

目录

概览 (2025-04-08)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载

目录

概览 (2025-04-08)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载

微信扫一扫：分享