Arxiv今日论文 | 2025-02-26

本篇博文主要内容为 2025-02-26 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型语言模型（LLMs）在作为密集检索器微调时带来的显著推理时间计算挑战，包括大规模语料库的高编码成本和增加的查询延迟，从而限制其实际部署。同时，较小的检索器虽然效率更高，但在有限的监督微调数据下通常无法有效泛化。论文的关键解决方案是提出DRAMA训练框架，利用LLMs训练更小且具泛化能力的密集检索器。具体而言，采用剪枝后的LLMs作为基础，并在一个单一阶段对比学习设置中使用多样化的LLM增强数据进行训练。

链接: https://arxiv.org/abs/2502.18460
作者: Xueguang Ma,Xi Victoria Lin,Barlas Oguz,Jimmy Lin,Wen-tau Yih,Xilun Chen
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong effectiveness and robustness while fine-tuned as dense retrievers. However, their large parameter size brings significant inference time computational challenges, including high encoding costs for large-scale corpora and increased query latency, limiting their practical deployment. While smaller retrievers offer better efficiency, they often fail to generalize effectively with limited supervised fine-tuning data. In this work, we introduce DRAMA, a training framework that leverages LLMs to train smaller generalizable dense retrievers. In particular, we adopt pruned LLMs as the backbone and train on diverse LLM-augmented data in a single-stage contrastive learning setup. Experiments show that DRAMA offers better multilingual and long-context capabilities than traditional encoder-based retrievers, and achieves strong performance across multiple tasks and languages. These highlight the potential of connecting the training of smaller retrievers with the growing advancements in LLMs, bridging the gap between efficiency and generalization.
zh

[NLP-1] FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response

【速读】：该论文旨在解决较小规模语言模型在复杂灾难场景中的常见物理常识推理能力不足的问题。关键解决方案在于引入FRIDA（Field Ready Instruction Decoding Agent）模型，通过领域专家与语言学家合作创建高质量种子数据，并利用其生成合成数据进行微调，从而提升模型在特定灾害领域的推理能力。研究发现仅训练物理状态和物体功能常识知识即可显著提升模型性能。

链接: https://arxiv.org/abs/2502.18452
作者: Mollie Shichman,Claire Bonial,Austin Blodgett,Taylor Hudson,Francis Ferraro,Rachel Rudinger
机构: University of Maryland, College Park(马里兰大学学院公园分校); Army Research Lab(陆军研究实验室); Oak Ridge Associated Universities(橡树岭联合大学); University of Maryland, Baltimore County(马里兰大学巴尔的摩郡分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have the potential for substantial common sense reasoning. However, these capabilities are often emergent in larger models. This means smaller models that can be run locally are less helpful and capable with respect to certain reasoning tasks. To meet our problem space requirements, we fine-tune smaller LLMs to disaster domains, as these domains involve complex and low-frequency physical common sense knowledge. We introduce a pipeline to create Field Ready Instruction Decoding Agent (FRIDA) models, where domain experts and linguists combine their knowledge to make high-quality seed data that is used to generate synthetic data for fine-tuning. We create a set of 130 seed instructions for synthetic generation, a synthetic dataset of 25000 instructions, and 119 evaluation instructions relating to both general and earthquake-specific object affordances. We fine-tune several LLaMa and Mistral instruction-tuned models and find that FRIDA models outperform their base models at a variety of sizes. We then run an ablation study to understand which kinds of synthetic data most affect performance and find that training physical state and object function common sense knowledge alone improves over FRIDA models trained on all data. We conclude that the FRIDA pipeline is capable of instilling general common sense, but needs to be augmented with information retrieval for specific domain knowledge.
zh

[NLP-2] SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

【速读】：该论文旨在解决大型语言模型（LLMs）在实际软件工程任务中的推理能力不足的问题。关键解决方案在于SWE-RL方法，它利用基于规则的轻量级奖励机制（例如，真实解与LLM生成解之间的相似度分数），使LLMs能够通过学习大规模开源软件演化数据来自主恢复开发者的推理过程和解决方案。该方法训练得到的模型Llama3-SWE-RL-70B，在SWE-bench Verified上的解决率达到41.0%，展示了其在软件工程领域卓越的性能，并且在未见过的任务上也表现出泛化推理能力。

链接: https://arxiv.org/abs/2502.18449
作者: Yuxiang Wei,Olivier Duchenne,Jade Copet,Quentin Carbonneaux,Lingming Zhang,Daniel Fried,Gabriel Synnaeve,Rishabh Singh,Sida I. Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer’s reasoning processes and solutions by learning from extensive open-source software evolution data – the record of a software’s entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified – a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.
zh

[NLP-3] Disambiguate First Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing

【速读】：该论文旨在解决自然语言接口中处理歧义和不完全指定的问题，特别是在文本到SQL语义解析任务中的挑战。论文的关键解决方案在于提出了一种模块化方法，该方法首先利用大型语言模型（LLMs）对含歧义的自然语言表达进行初始优选解歧，然后应用专门的填充模型来识别并生成缺失的解释。这种方法通过SQL执行验证不同的含义来训练填充模型，从而提高了解释的覆盖率，并且能够在不同标注风格、数据库结构和歧义类型的数据集上实现泛化。

链接: https://arxiv.org/abs/2502.18448
作者: Irina Saparina,Mirella Lapata
机构: Institute for Language, Cognition and Computation (语言、认知与计算研究所),
School of Informatics (信息学学院),
University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Handling ambiguity and underspecification is an important challenge in natural language interfaces, particularly for tasks like text-to-SQL semantic parsing. We propose a modular approach that resolves ambiguity using natural language interpretations before mapping these to logical forms (e.g., SQL queries). Although LLMs excel at parsing unambiguous utterances, they show strong biases for ambiguous ones, typically predicting only preferred interpretations. We constructively exploit this bias to generate an initial set of preferred disambiguations and then apply a specialized infilling model to identify and generate missing interpretations. To train the infilling model, we introduce an annotation method that uses SQL execution to validate different meanings. Our approach improves interpretation coverage and generalizes across datasets with different annotation styles, database structures, and ambiguity types.
zh

[NLP-4] olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

【速读】：该论文旨在解决从多样化PDF文档中提取和忠实表示结构化内容（structured content）以供语言模型训练的问题。解决方案的关键在于提出了一种名为olmOCR的开源Python工具包，它能够将PDF文档处理成自然阅读顺序的干净线性化纯文本（clean, linearized plain text），同时保留章节、表格、列表、公式等内容结构。olmOCR使用了一个经过微调的7B视觉语言模型（VLM），该模型是在包含图形、手写文本和低质量扫描的多样PDF页面样本上进行训练的。

链接: https://arxiv.org/abs/2502.18443
作者: Jake Poznanski,Jon Borchardt,Jason Dunkelberger,Regan Huff,Daniel Lin,Aman Rangapur,Christopher Wilhelm,Kyle Lo,Luca Soldaini
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. We present olmOCR, an open-source Python toolkit for processing PDFs into clean, linearized plain text in natural reading order while preserving structured content like sections, tables, lists, equations, and more. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on a sample of 260,000 pages from over 100,000 crawled PDFs with diverse properties, including graphics, handwritten text and poor quality scans. olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and convert a million PDF pages for only 190 USD. We release all components of olmOCR including VLM weights, data and training code, as well as inference code built on serving frameworks including vLLM and SGLang.
zh

[NLP-5] Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions

【速读】：该论文旨在探讨文本分布的替代分解方法是否能在某些任务中优于传统的从左到右（L2R）自回归分解。研究重点在于通过右到左（R2L）训练作为有力的替代方案，特别是在多项选择题（MCQs）任务中的知识提取和推理能力。研究发现，在多种模型规模（2B-8B参数）和训练数据集下，R2L模型在逻辑推理、常识理解和真实性评估等任务的多项选择题基准测试中显著优于L2R模型。研究分析表明，这种性能差异可能与校准、计算能力和方向条件熵等多个因素有关。通过使用算术任务进行控制模拟研究，进一步剖析了这些因素的影响。论文的关键在于探索文本分布的替代分解方法可以提升大规模语言模型（LLM）的能力，并提供了关于接近人类语言分布的最优分解方法的理论见解，以及何时每种推理顺序可能更有优势。

链接: https://arxiv.org/abs/2502.18435
作者: Yizhe Zhang,Richard Bai,Zijin Gu,Ruixiang Zhang,Jiatao Gu,Emmanuel Abbe,Samy Bengio,Navdeep Jaitly
机构: Apple
类目: Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability and directional conditional entropy. We ablate the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous.
zh

[NLP-6] Exploring Gender Disparities in Automatic Speech Recognition Technology ISCA

【速读】：该论文旨在探究影响自动语音识别（Automatic Speech Recognition, ASR）系统公平性和性能的性别因素，超越传统的种族人口统计学分析。研究通过使用LibriSpeech数据集和Whisper小模型，分析了训练数据中不同性别比例对ASR性能的影响。研究的关键在于发现ASR系统的最优公平性并非简单地依赖于50-50的性别比例，而是特定的性别分布。此外，研究还指出音高变化性等特征对ASR准确性有显著影响。因此，解决方案的关键在于精心设计和选择训练数据，以减轻性别偏见。

链接: https://arxiv.org/abs/2502.18434
作者: Hend ElGhazaly,Bahman Mirheidari,Nafise Sadat Moosavi,Heidi Christensen
机构: Computer Science, University of Sheffield (计算机科学，谢菲尔德大学), Sheffield (谢菲尔德), United Kingdom (英国)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models

点击查看摘要

Abstract:This study investigates factors influencing Automatic Speech Recognition (ASR) systems’ fairness and performance across genders, beyond the conventional examination of demographics. Using the LibriSpeech dataset and the Whisper small model, we analyze how performance varies across different gender representations in training data. Our findings suggest a complex interplay between the gender ratio in training data and ASR performance. Optimal fairness occurs at specific gender distributions rather than a simple 50-50 split. Furthermore, our findings suggest that factors like pitch variability can significantly affect ASR accuracy. This research contributes to a deeper understanding of biases in ASR systems, highlighting the importance of carefully curated training data in mitigating gender bias.
zh

[NLP-7] xtGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning

【速读】：该论文旨在评估大型语言模型（LLMs）在复杂问题解决方面的能力，特别是通过引入TextGames这一创新基准，该基准包含需要高级模式识别、空间意识、算术和逻辑推理技能的文字游戏。论文的关键解决方案在于设计这些文字游戏任务，以深入分析LLMs在单轮与多轮推理中的表现，并考察其利用反馈进行自我修正的能力。研究表明，尽管LLMs在大多数简单和中等难度的问题上表现出色，但在更困难的任务上面临显著挑战。此外，优化用于推理的模型比优先考虑指令跟随的预训练LLMs表现更好，强调了推理能力在解决高度复杂问题中的重要性。

链接: https://arxiv.org/abs/2502.18431
作者: Frederikus Hudi,Genta Indra Winata,Ruochen Zhang,Alham Fikri Aji
机构: NAIST(NAIST); Capital One(Capital One); Brown University(Brown University); MBZUAI(MBZUAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning is a fundamental capability of large language models (LLMs), enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LLMs’ performance in both single-turn and multi-turn reasoning, and their abilities in leveraging feedback to correct subsequent answers through self-reflection. Our findings reveal that, although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks. In contrast, humans are capable of solving all tasks when given sufficient time. Moreover, we observe that LLMs show improved performance in multi-turn predictions through self-reflection, yet they still struggle with sequencing, counting, and following complex rules consistently. Additionally, models optimized for reasoning outperform pre-trained LLMs that prioritize instruction following, highlighting the crucial role of reasoning skills in addressing highly complex problems.
zh

[NLP-8] Compressing Language Models for Specialized Domains

【速读】：该论文旨在解决通用语言模型压缩方法在特定领域（如生物医学或法律）应用时性能下降的问题。现有方法虽有所改进，但需要昂贵的全参数微调。论文提出的关键解决方案是“交叉校准”（cross-calibration），这是一种无需额外计算开销且训练免费的方法，通过利用基于海森矩阵的敏感性来识别对特定领域及通用性能均有影响的重要权重，从而显著提升特定领域的任务性能，同时保持整体性能不降低。

链接: https://arxiv.org/abs/2502.18424
作者: Miles Williams,George Chrysostomou,Vitor Jeronymo,Nikolaos Aletras
机构: University of Sheffield(谢菲尔德大学); Enterprise AI Services, AstraZeneca(阿斯利康企业人工智能服务)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Compression techniques such as pruning and quantization offer a solution for more efficient deployment of language models (LMs), albeit with small performance drops in benchmark performance. However, general-purpose LM compression methods can negatively affect performance in specialized domains (e.g. biomedical or legal). Recent work has sought to address this, yet requires computationally expensive full-parameter fine-tuning. To this end, we propose cross-calibration, a novel training-free approach for improving the domain performance of compressed LMs. Our approach effectively leverages Hessian-based sensitivity to identify weights that are influential for both in-domain and general performance. Through extensive experimentation, we demonstrate that cross-calibration substantially outperforms existing approaches on domain-specific tasks, without compromising general performance. Notably, these gains come without additional computational overhead, displaying remarkable potential towards extracting domain-specialized compressed models from general-purpose LMs.
zh

[NLP-9] Rank1: Test-Time Compute for Reranking in Information Retrieval

【速读】：该论文旨在解决检索系统中实时计算的高效性和模型性能之间的权衡问题。关键在于引入Rank1模型，通过利用推理语言模型（如OpenAI的o1、Deepseek的R1等）在测试阶段进行蒸馏，从而快速提升小型模型的性能。论文展示了Rank1模型在先进推理和指令遵循数据集上的最前沿表现，并且由于能够响应用户输入提示，在分布外数据上也表现出色。此外，量化版本的模型在减少计算和内存使用的同时仍保持强劲性能。

链接: https://arxiv.org/abs/2502.18418
作者: Orion Weller,Kathryn Ricci,Eugene Yang,Andrew Yates,Dawn Lawrie,Benjamin Van Durme
机构: Johns Hopkins University
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Rank1, the first reranking model trained to take advantage of test-time compute. Rank1 demonstrates the applicability within retrieval of using a reasoning language model (i.e. OpenAI’s o1, Deepseek’s R1, etc.) for distillation in order to rapidly improve the performance of a smaller model. We gather and open-source a dataset of more than 600,000 examples of R1 reasoning traces from queries and passages in MS MARCO. Models trained on this dataset show: (1) state-of-the-art performance on advanced reasoning and instruction following datasets; (2) work remarkably well out of distribution due to the ability to respond to user-input prompts; and (3) have explainable reasoning chains that can be given to users or RAG-based systems. Further, we demonstrate that quantized versions of these models retain strong performance while using less compute/memory. Overall, Rank1 shows that test-time compute allows for a fundamentally new type of explainable and performant reranker model for search.
zh

[NLP-10] GLEAN: Generalized Category Discovery with Diverse and Quality-Enhanced LLM Feedback

【速读】：该论文旨在解决广义类别发现（Generalized Category Discovery, GCD）任务中由于缺乏监督而导致的错误纠正困难和未能有效揭示及利用发现簇的语义含义的问题。为了解决这些问题，论文提出了一种名为GLEAN的统一框架，通过主动学习来自多样化且质量增强的大语言模型（LLM）反馈来改进实例级对比特征、生成类别描述，并将不确定实例与大语言模型选择的类别描述对齐。关键在于利用不同类型的LLM反馈以提高GCD任务的效果。

链接: https://arxiv.org/abs/2502.18414
作者: Henry Peng Zou,Siffi Singh,Yi Nian,Jianfeng He,Jason Cai,Saab Mansour,Hang Su
机构: University of Illinois Chicago(芝加哥伊利诺伊大学); AWS AI Labs(亚马逊AWS人工智能实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generalized Category Discovery (GCD) is a practical and challenging open-world task that aims to recognize both known and novel categories in unlabeled data using limited labeled data from known categories. Due to the lack of supervision, previous GCD methods face significant challenges, such as difficulty in rectifying errors for confusing instances, and inability to effectively uncover and leverage the semantic meanings of discovered clusters. Therefore, additional annotations are usually required for real-world applicability. However, human annotation is extremely costly and inefficient. To address these issues, we propose GLEAN, a unified framework for generalized category discovery that actively learns from diverse and quality-enhanced LLM feedback. Our approach leverages three different types of LLM feedback to: (1) improve instance-level contrastive features, (2) generate category descriptions, and (3) align uncertain instances with LLM-selected category descriptions. Extensive experiments demonstrate the superior performance of \MethodName over state-of-the-art models across diverse datasets, metrics, and supervision settings. Our code is available at this https URL.
zh

[NLP-11] Agent RM: Enhancing Agent Generalization with Reward Modeling

【速读】：该论文旨在解决现有基于大型语言模型（LLM）的代理在未见过的任务上的泛化能力不足的问题。解决方案的关键在于通过微调奖励模型（reward model）来指导策略模型（policy model），而非直接微调策略模型。论文提出了一种名为AgentRM的通用奖励模型，用于引导策略模型进行有效的测试时间搜索。实验结果表明，AgentRM在多种任务上显著提升了基础策略模型的性能，并展示了从弱到强的泛化能力。

链接: https://arxiv.org/abs/2502.18407
作者: Yu Xia,Jingru Fan,Weize Chen,Siyu Yan,Xin Cong,Zhong Zhang,Yaxi Lu,Yankai Lin,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing LLM-based agents have achieved strong performance on held-in tasks, but their generalizability to unseen tasks remains poor. Hence, some recent work focus on fine-tuning the policy model with more diverse tasks to improve the generalizability. In this work, we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model. Based on this finding, we propose AgentRM, a generalizable reward model, to guide the policy model for effective test-time search. We comprehensively investigate three approaches to construct the reward model, including explicit reward modeling, implicit reward modeling and LLM-as-a-judge. We then use AgentRM to guide the answer generation with Best-of-N sampling and step-level beam search. On four types of nine agent tasks, AgentRM enhances the base policy model by 8.8 points on average, surpassing the top general agent by 4.0 . Moreover, it demonstrates weak-to-strong generalization, yielding greater improvement of 12.6 on LLaMA-3-70B policy model. As for the specializability, AgentRM can also boost a finetuned policy model and outperform the top specialized agent by 11.4 on three held-in tasks. Further analysis verifies its effectiveness in test-time scaling. Codes will be released to facilitate the research in this area.
zh

[NLP-12] KiRAG : Knowledge-Driven Iterative Retriever for Enhancing Retrieval-Augmented Generation

【速读】：该论文旨在解决迭代检索增强生成（iRAG）模型在多跳问答（QA）中的两个关键挑战：1）检索过程可能受到不相关文档或事实性错误的链式思维干扰；2）现有的检索器无法动态适应多步推理过程中不断变化的信息需求，导致难以识别和检索每一步迭代所需的关键缺失信息。为了解决这些问题，论文提出KiRAG，其关键在于使用知识驱动的迭代检索模型来增强iRAG的检索过程。具体而言，KiRAG将文档分解为知识三元组，并利用这些三元组进行迭代检索，从而实现事实可靠的检索过程。此外，KiRAG将推理融入检索过程中，以动态识别和检索填补信息缺口的知识，从而有效适应不断变化的信息需求。

链接: https://arxiv.org/abs/2502.18397
作者: Jinyuan Fang,Zaiqiao Meng,Craig Macdonald
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Iterative retrieval-augmented generation (iRAG) models offer an effective approach for multi-hop question answering (QA). However, their retrieval process faces two key challenges: (1) it can be disrupted by irrelevant documents or factually inaccurate chain-of-thoughts; (2) their retrievers are not designed to dynamically adapt to the evolving information needs in multi-step reasoning, making it difficult to identify and retrieve the missing information required at each iterative step. Therefore, we propose KiRAG, which uses a knowledge-driven iterative retriever model to enhance the retrieval process of iRAG. Specifically, KiRAG decomposes documents into knowledge triples and performs iterative retrieval with these triples to enable a factually reliable retrieval process. Moreover, KiRAG integrates reasoning into the retrieval process to dynamically identify and retrieve knowledge that bridges information gaps, effectively adapting to the evolving information needs. Empirical results show that KiRAG significantly outperforms existing iRAG models, with an average improvement of 9.40% in R@3 and 5.14% in F1 on multi-hop QA.
zh

[NLP-13] Monte Carlo Temperature: a robust sampling strategy for LLM s uncertainty quantification methods

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）不确定性量化（Uncertainty Quantification, UQ）过程中温度参数选择对不确定性估计质量的影响。当前方法通常依赖于多次查询模型以不同非零温度进行采样，从而生成多样化输出来估算不确定性，但这种做法需要昂贵的超参数优化（Hyperparameter Optimization, HPO）。论文的关键解决方案是提出了一种名为Monte Carlo Temperature (MCT)的鲁棒采样策略，该策略无需温度校准即可提供更稳健的不确定性估计，并且在性能上与理想但计算成本高昂的最优温度方案相当。

链接: https://arxiv.org/abs/2502.18389
作者: Nicola Cecere,Andrea Bacciu,Ignacio Fernández Tobías,Amin Mantrach
机构: Amazon
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) in Large Language Models (LLMs) is essential for their safe and reliable deployment, particularly in critical applications where incorrect outputs can have serious consequences. Current UQ methods typically rely on querying the model multiple times using non-zero temperature sampling to generate diverse outputs for uncertainty estimation. However, the impact of selecting a given temperature parameter is understudied, and our analysis reveals that temperature plays a fundamental role in the quality of uncertainty estimates. The conventional approach of identifying optimal temperature values requires expensive hyperparameter optimization (HPO) that must be repeated for each new model-dataset combination. We propose Monte Carlo Temperature (MCT), a robust sampling strategy that eliminates the need for temperature calibration. Our analysis reveals that: 1) MCT provides more robust uncertainty estimates across a wide range of temperatures, 2) MCT improves the performance of UQ methods by replacing fixed-temperature strategies that do not rely on HPO, and 3) MCT achieves statistical parity with oracle temperatures, which represent the ideal outcome of a well-tuned but computationally expensive HPO process. These findings demonstrate that effective UQ can be achieved without the computational burden of temperature parameter calibration.
zh

[NLP-14] DBR: Divergence-Based Regularization for Debiasing Natural Language Understanding Models KDD

【速读】：该论文旨在解决预训练语言模型（Pre-trained Language Models, PLMs）在自然语言理解（Natural Language Understanding, NLU）任务中过度依赖表面特征和捷径而非真正理解语言的问题。这种依赖导致模型在域外数据上的泛化能力较差。论文的关键解决方案是提出了基于发散度的正则化方法（Divergence Based Regularization, DBR），通过测量原始示例与其中捷径词被屏蔽后的输出分布之间的差异，减少模型预测对捷径特征或偏差的过度依赖，从而增强大型预训练语言模型的泛化能力。

链接: https://arxiv.org/abs/2502.18353
作者: Zihao Li,Ruixiang Tang,Lu Cheng,Shuaiqiang Wang,Dawei Yin,Mengnan Du
机构: New Jersey Institute of Technology (新泽西理工学院); Rutgers University (罗格斯大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校); Baidu (百度)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by SIGKDD Explorations

点击查看摘要

Abstract:Pre-trained language models (PLMs) have achieved impressive results on various natural language processing tasks. However, recent research has revealed that these models often rely on superficial features and shortcuts instead of developing a genuine understanding of language, especially for natural language understanding (NLU) tasks. Consequently, the models struggle to generalize to out-of-domain data. In this work, we propose Divergence Based Regularization (DBR) to mitigate this shortcut learning behavior. Our method measures the divergence between the output distributions for original examples and examples where shortcut tokens have been masked. This process prevents the model’s predictions from being overly influenced by shortcut features or biases. We evaluate our model on three NLU tasks and find that it improves out-of-domain performance with little loss of in-domain accuracy. Our results demonstrate that reducing the reliance on shortcuts and superficial features can enhance the generalization ability of large pre-trained language models.
zh

[NLP-15] BRIDO: Bringing Democratic Order to Abstractive Summarization AAAI-25

【速读】：该论文旨在解决大型语言模型（LLMs）在抽象文本摘要任务中因暴露偏差（exposure bias）引起的幻觉现象（hallucination），即生成不准确、无关或不一致的文本。论文的关键解决方案在于采用对比学习（contrastive learning），通过鼓励候选摘要之间具有较高的ROUGE相似度分数，从而减少包含幻觉内容的候选摘要的比例。实验结果表明，所提出的方法在XSum和CNN/DM数据集上分别提升了6.25%和3.82%的一致性G-Eval评分，优于现有的BRIO模型。

链接: https://arxiv.org/abs/2502.18342
作者: Junhyun Lee,Harshith Goka,Hyeonmok Ko
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 1 figure; AAAI-25 Workshop on PDLM camera ready

点击查看摘要

Abstract:Hallucination refers to the inaccurate, irrelevant, and inconsistent text generated from large language models (LLMs). While the LLMs have shown great promise in a variety of tasks, the issue of hallucination still remains a major challenge for many practical uses. In this paper, we tackle the issue of hallucination in abstract text summarization by mitigating exposure bias. Existing models targeted for exposure bias mitigation, namely BRIO, aim for better summarization quality in the ROUGE score. We propose a model that uses a similar exposure bias mitigation strategy but with a goal that is aligned with less hallucination. We conjecture that among a group of candidate outputs, ones with hallucinations will comprise the minority of the whole group. That is, candidates with less similarity with others will have a higher chance of containing hallucinated content. Our method uses this aspect and utilizes contrastive learning, incentivizing candidates with high inter-candidate ROUGE scores. We performed experiments on the XSum and CNN/DM summarization datasets, and our method showed 6.25% and 3.82% improvement, respectively, on the consistency G-Eval score over BRIO.
zh

[NLP-16] Moderation Matters:Measuring Conversational Moderation Impact in English as a Second Language Group Discussion

【速读】：该论文旨在解决英语作为第二语言（ESL）学习者在小组讨论中因语言障碍而难以参与的问题。研究的关键在于开发了一个包含17个在线ESL会话俱乐部讨论环节的数据集，并引入了一种结合自动ESL对话评估和分类调节策略框架的方法。研究发现，有效的调节策略主要体现在积极的肯定和鼓励，而过多的信息和观点分享则可能产生负面影响。

链接: https://arxiv.org/abs/2502.18341
作者: Rena Gao,Ming-Bin Chen,Lea Frermann,Jey Han Lau
机构: The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:English as a Second Language (ESL) speakers often struggle to engage in group discussions due to language barriers. While moderators can facilitate participation, few studies assess conversational engagement and evaluate moderation effectiveness. To address this gap, we develop a dataset comprising 17 sessions from an online ESL conversation club, which includes both moderated and non-moderated discussions. We then introduce an approach that integrates automatic ESL dialogue assessment and a framework that categorizes moderation strategies. Our findings indicate that moderators help improve the flow of topics and start/end a conversation. Interestingly, we find active acknowledgement and encouragement to be the most effective moderation strategy, while excessive information and opinion sharing by moderators has a negative impact. Ultimately, our study paves the way for analyzing ESL group discussions and the role of moderators in non-native conversation settings.
zh

[NLP-17] Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

【速读】：该论文旨在解决经典自然语言处理（NLP）基准与昂贵且耗时的人类评估之间的关系不明确的问题。关键在于通过大规模对比研究，发现大多数NLP基准与人类评估高度相关，从而证明自动化指标可以作为可靠的人类偏好预测器，同时提出通过过参数化线性回归可以在不同模型规模下利用NLP得分预测人类评估，为减少昂贵的人工标注提供了一条途径。

链接: https://arxiv.org/abs/2502.18339
作者: Rylan Schaeffer,Punit Singh Koura,Binh Tang,Ranjan Subramanian,Aaditya K Singh,Todor Mihaylov,Prajjwal Bhargava,Lovish Madaan,Niladri S. Chatterji,Vedanuj Goswami,Sergey Edunov,Dieuwke Hupkes,Sanmi Koyejo,Sharan Narang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are uncorrelated. Moreover, through overparameterized linear regressions, we show that NLP scores can accurately predict human evaluations across different model scales, offering a path to reduce costly human annotation without sacrificing rigor. Overall, our results affirm the continued value of classic benchmarks and illuminate how to harness them to anticipate real-world user satisfaction - pointing to how NLP benchmarks can be leveraged to meet evaluation needs of our new era of conversational AI.
zh

[NLP-18] BottleHumor: Self-Informed Humor Explanation using the Information Bottleneck Principle

【速读】：该论文旨在解决在多模态情境下理解幽默所需的相关世界知识识别问题。论文的关键在于提出了一种基于信息瓶颈原理的方法（\method），该方法能够从视觉和语言模型中提取相关的世界知识，并通过迭代优化生成幽默解释，且无需监督学习。这一方法在三个数据集上的实验验证了其相对于多种基线方法的优势。

链接: https://arxiv.org/abs/2502.18331
作者: EunJeong Hwang,Peter West,Vered Shwartz
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute for AI (向量人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humor is prevalent in online communications and it often relies on more than one modality (e.g., cartoons and memes). Interpreting humor in multimodal settings requires drawing on diverse types of knowledge, including metaphorical, sociocultural, and commonsense knowledge. However, identifying the most useful knowledge remains an open question. We introduce \method, a method inspired by the information bottleneck principle that elicits relevant world knowledge from vision and language models which is iteratively refined for generating an explanation of the humor in an unsupervised manner. Our experiments on three datasets confirm the advantage of our method over a range of baselines. Our method can further be adapted in the future for additional tasks that can benefit from eliciting and conditioning on relevant world knowledge and open new research avenues in this direction.
zh

[NLP-19] Mapping of Subjective Accounts into Interpreted Clusters (MOSAIC): Topic Modelling and LLM applied to Stroboscopic Phenomenology

【速读】：该论文旨在解决通过开放性主观报告分析复杂视觉体验（Visual Hallucinations, VHs）时所面临的挑战，特别是系统性识别模式方面的难题。解决方案的关键在于采用数据驱动的方法，利用大型语言模型（Large Language Models）和主题建模（Topic Modelling）技术，直接从《梦想机》项目的文本报告中揭示和解读潜在的经验主题。这种方法不仅确认了传统科学研究所记录的简单视觉幻觉的存在，还发现了意识状态改变和复杂幻觉的新经验，从而扩展了主观体验的系统研究方法。

链接: https://arxiv.org/abs/2502.18318
作者: Romy Beauté,David J. Schwartzman,Guillaume Dumas,Jennifer Crook,Fiona Macpherson,Adam B. Barrett,Anil K. Seth
机构: Sussex Center for Consciousness Science, University of Sussex (苏塞克斯意识科学中心, 苏塞克斯大学); be.AI, University of Sussex (be.AI, 苏塞克斯大学); CHUSJ Azrieli Research Center/Mila-Quebec AI Institute, University of Montreal (CHUSJ Azrieli研究中心/魁北克人工智能研究所, 蒙特利尔大学); Collective Act (Collective Act); Canadian Institute for Advanced Research (加拿大高级研究院); Centre for the Study of Perceptual Experience, University of Glasgow (感知体验研究中心, 格拉斯哥大学)
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Stroboscopic light stimulation (SLS) on closed eyes typically induces simple visual hallucinations (VHs), characterised by vivid, geometric and colourful patterns. A dataset of 862 sentences, extracted from 422 open subjective reports, was recently compiled as part of the Dreamachine programme (Collective Act, 2022), an immersive multisensory experience that combines SLS and spatial sound in a collective setting. Although open reports extend the range of reportable phenomenology, their analysis presents significant challenges, particularly in systematically identifying patterns. To address this challenge, we implemented a data-driven approach leveraging Large Language Models and Topic Modelling to uncover and interpret latent experiential topics directly from the Dreamachine’s text-based reports. Our analysis confirmed the presence of simple VHs typically documented in scientific studies of SLS, while also revealing experiences of altered states of consciousness and complex hallucinations. Building on these findings, our computational approach expands the systematic study of subjective experience by enabling data-driven analyses of open-ended phenomenological reports, capturing experiences not readily identified through standard questionnaires. By revealing rich and multifaceted aspects of experiences, our study broadens our understanding of stroboscopically-induced phenomena while highlighting the potential of Natural Language Processing and Large Language Models in the emerging field of computational (neuro)phenomenology. More generally, this approach provides a practically applicable methodology for uncovering subtle hidden patterns of subjective experience across diverse research domains.
zh

[NLP-20] WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

【速读】：该论文旨在通过引入WiCkeD方法来提高现有多项选择基准测试的复杂度，其关键是通过随机将一个选项替换为"以上皆非"（None of the above），这一方法在教育测试中常被采用。研究表明，WiCkeD能够自动应用于任何现有的基准测试，从而使其更具挑战性。

链接: https://arxiv.org/abs/2502.18316
作者: Ahmed Elhady,Eneko Agirre,Mikel Artetxe
机构: HiTZ Center, University of the Basque Country (UPV/EHU) (巴斯克大学HiTZ中心); Reka AI (Reka AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with “None of the above”, a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at this https URL.
zh

[NLP-21] Looking forward: Linguistic theory and methods

【速读】：该论文旨在探讨当前语言学理论与方法的发展趋势，重点关注计算、认知和进化视角的不断整合。论文的关键在于通过连接语言学与计算机科学、心理学、神经科学及生物学，提供一种前瞻性的研究视角。其核心议题包括：(1) 对符号表征假设（如效率、局部性和概念语义基础）进行明确检验；(2) 人工神经网络对理论辩论和语言分析的影响；(3) 主体间性在语言理论中的重要性；以及(4) 进化语言学的增长。因此，论文的关键解决方案在于综合多学科视角以推动语言学研究的发展。

链接: https://arxiv.org/abs/2502.18313
作者: John Mansfield,Ethan Gotlieb Wilcox
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This chapter examines current developments in linguistic theory and methods, focusing on the increasing integration of computational, cognitive, and evolutionary perspectives. We highlight four major themes shaping contemporary linguistics: (1) the explicit testing of hypotheses about symbolic representation, such as efficiency, locality, and conceptual semantic grounding; (2) the impact of artificial neural networks on theoretical debates and linguistic analysis; (3) the importance of intersubjectivity in linguistic theory; and (4) the growth of evolutionary linguistics. By connecting linguistics with computer science, psychology, neuroscience, and biology, we provide a forward-looking perspective on the changing landscape of linguistic research.
zh

[NLP-22] RefuteBench 2.0 – Agent ic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction

【速读】：该论文旨在解决大型语言模型（LLMs）在多轮交互过程中有效整合用户反驳反馈的能力评估问题。解决方案的关键在于引入RefuteBench 2.0，通过集成LLM代理作为反驳者和评估者，并设计具有不同有效性周期的临时及持久反驳指令，从而实现灵活且全面的评估。实验结果表明，当前模型能够有效应对反驳但难以记住反驳信息，同时分析显示LLMs在长对话上下文中保留和正确使用先前信息存在困难。

链接: https://arxiv.org/abs/2502.18308
作者: Jianhao Yan,Yun Luo,Yue Zhang
机构: Zhejiang University; Westlake University
类目: Computation and Language (cs.CL)
备注: Work on progess

点击查看摘要

Abstract:In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM’s ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly, we also observe that the performance of the initial task decreases as the refutations increase. Analysis of the attention scores further shows a potential weakness of current LLMs: they struggle to retain and correctly use previous information during long context dialogues. this https URL Comments: Work on progess Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.18308 [cs.CL] (or arXiv:2502.18308v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.18308 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-23] AMPO: Active Multi-Preference Optimization

【速读】：该论文旨在解决多偏好优化在大规模语言模型训练中的计算可行性问题，特别是在自对齐过程中，由于每个查询产生大量候选答案，导致无法将所有响应纳入训练目标。论文的关键解决方案是提出了一种名为\textit{Active Multi-Preference Optimization (AMPO)}的方法，该方法结合了按策略生成、多偏好群组对比损失以及主动子集选择。通过评分和嵌入大候选池中的响应，并选择一个小但信息量大的子集来覆盖奖励极值和不同的语义聚类，以进行偏好优化。这种方法不仅能识别最佳和最差的答案，还能发现对稳健对齐至关重要的细微且未充分探索的模式。理论上，该方法提供了使用主动选择方法实现期望奖励最大化的保证；实证结果显示，AMPO在Llama 8B模型上使用AlpacaEval达到了最先进的结果。

链接: https://arxiv.org/abs/2502.18293
作者: Taneesh Gupta,Rahul Madhavan,Xuchao Zhang,Chetan Bansal,Saravan Rajmohan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, thereby enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, rendering it computationally infeasible to include all responses in the training objective. In this work, we propose \textitActive Multi-Preference Optimization (AMPO), a novel approach that combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses and then select a small, yet informative, subset that covers reward extremes and distinct semantic clusters for preference optimization. Our contrastive training scheme is capable of identifying not only the best and worst answers but also subtle, underexplored modes that are crucial for robust alignment. Theoretically, we provide guarantees for expected reward maximization using our active selection method, and empirically, AMPO achieves state-of-the-art results on \textitAlpacaEval using Llama 8B.
zh

[NLP-24] How Vital is the Jurisprudential Relevance: Law Article Intervened Legal Case Retrieval and Matching

【速读】：该论文旨在解决法律案例检索（Legal Case Retrieval, LCR）和相似案例匹配（Similar Case Matching, LCM）中的法律理性相似度评估问题。过去的方法依赖于领域专家知识或不切实际的假设，限制了其在实际应用中的有效性。论文提出的关键解决方案是端到端的模型LCM-LAI，它采用依赖多任务学习框架通过法律条文预测（Law Article Prediction, LAP）子任务来捕捉法律理性信息，而无需额外假设。此外，LCM-LAI引入了基于法律分布的文章感知注意力机制，以更有效地评估跨案例句子之间的法律理性相似性，优于传统的语义相似性方法。

链接: https://arxiv.org/abs/2502.18292
作者: Nuo Xu,Pinghui Wang,Zi Liang,Junzhou Zhao,Xiaohong Guan
机构: MOE KLINNS Lab, Xi’an Jiaotong University(教育部快速智能网络实验室, 西安交通大学); China and Department of Automation and NLIST Lab, Tsinghua University(自动化系和类脑智能技术与应用实验室, 清华大学), Beijing, China(中国北京)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Legal case retrieval (LCR) aims to automatically scour for comparable legal cases based on a given query, which is crucial for offering relevant precedents to support the judgment in intelligent legal systems. Due to similar goals, it is often associated with a similar case matching (LCM) task. To address them, a daunting challenge is assessing the uniquely defined legal-rational similarity within the judicial domain, which distinctly deviates from the semantic similarities in general text retrieval. Past works either tagged domain-specific factors or incorporated reference laws to capture legal-rational information. However, their heavy reliance on expert or unrealistic assumptions restricts their practical applicability in real-world scenarios. In this paper, we propose an end-to-end model named LCM-LAI to solve the above challenges. Through meticulous theoretical analysis, LCM-LAI employs a dependent multi-task learning framework to capture legal-rational information within legal cases by a law article prediction (LAP) sub-task, without any additional assumptions in inference. Besides, LCM-LAI proposes an article-aware attention mechanism to evaluate the legal-rational similarity between across-case sentences based on law distribution, which is more effective than conventional semantic similarity. Weperform a series of exhaustive experiments including two different tasks involving four real-world datasets. Results demonstrate that LCM-LAI achieves state-of-the-art performance.
zh

[NLP-25] Uncertainty Modeling in Multimodal Speech Analysis Across the Psychosis Spectrum

【速读】：该论文旨在解决在精神分裂症谱系中捕捉细微言语障碍的挑战，这些障碍因言语模式的固有变异性而难以把握。论文的关键解决方案在于开发一种能够量化不确定性的模型，该模型整合了声学和语言特征来预测症状严重程度和与精神病相关的特质。通过量化特定模态的不确定性，模型能够应对言语变异性，从而提高预测准确性。这一方法通过动态调整任务结构，根据不同交互环境加权不同的特征，增强了早期检测、个性化评估以及临床决策的能力。

链接: https://arxiv.org/abs/2502.18285
作者: Morteza Rohanian,Roya M. Hüppi,Farhad Nooralahzadeh,Noemi Dannecker,Yves Pauli,Werner Surbeck,Iris Sommer,Wolfram Hinzen,Nicolas Langer,Michael Krauthammer,Philipp Homan
机构: University of Zurich(苏黎世大学); Department of Quantitative Biomedicine(定量生物医学系); University of Zurich(苏黎世大学); Department of Adult Psychiatry and Psychotherapy(成人精神病学和心理治疗系); University of Zurich(苏黎世大学); Department of Quantitative Biomedicine(定量生物医学系); University of Zurich(苏黎世大学); Department of Adult Psychiatry and Psychotherapy(成人精神病学和心理治疗系); University of Zurich(苏黎世大学); Department of Adult Psychiatry and Psychotherapy(成人精神病学和心理治疗系); University of Zurich(苏黎世大学); Department of Adult Psychiatry and Psychotherapy(成人精神病学和心理治疗系); University of Zurich(苏黎世大学); Department of Neuroscience(神经科学系), University Medical Center Groningen(格罗宁根大学医学中心), Antoni Deusinglaan 2, room 117 Groningen, Netherland(荷兰格罗宁根); Department of Translation and Language Sciences(翻译与语言科学系), Universitat Pompeu Fabra(庞培法布拉大学), Barcelona, Spain(西班牙巴塞罗那); Institució Catalana de Recerca i Estudis Avançats (ICREA)(加泰罗尼亚研究与高级研究所 (ICREA)), Barcelona, Spain(西班牙巴塞罗那); Department of Psychology(心理学系), University of Zurich(苏黎世大学); Neuroscience Center Zurich(苏黎世神经科学中心), University of Zurich and ETH Zurich(苏黎世联邦理工学院), Zurich, Switzerland(瑞士苏黎世)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Capturing subtle speech disruptions across the psychosis spectrum is challenging because of the inherent variability in speech patterns. This variability reflects individual differences and the fluctuating nature of symptoms in both clinical and non-clinical populations. Accounting for uncertainty in speech data is essential for predicting symptom severity and improving diagnostic precision. Speech disruptions characteristic of psychosis appear across the spectrum, including in non-clinical individuals. We develop an uncertainty-aware model integrating acoustic and linguistic features to predict symptom severity and psychosis-related traits. Quantifying uncertainty in specific modalities allows the model to address speech variability, improving prediction accuracy. We analyzed speech data from 114 participants, including 32 individuals with early psychosis and 82 with low or high schizotypy, collected through structured interviews, semi-structured autobiographical tasks, and narrative-driven interactions in German. The model improved prediction accuracy, reducing RMSE and achieving an F1-score of 83% with ECE = 4.5e-2, showing robust performance across different interaction contexts. Uncertainty estimation improved model interpretability by identifying reliability differences in speech markers such as pitch variability, fluency disruptions, and spectral instability. The model dynamically adjusted to task structures, weighting acoustic features more in structured settings and linguistic features in unstructured contexts. This approach strengthens early detection, personalized assessment, and clinical decision-making in psychosis-spectrum research.
zh

[NLP-26] Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLM s on U.S. Supreme Court Cases

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）的政治倾向及其成因。研究的关键在于提出一种定量评估方法，以分析嵌入在大规模预训练语料库中的政治倾向，并进一步考察这些模型的政治倾向更倾向于其预训练数据还是人类调查意见。通过分析32个美国最高法院案件的数据，研究发现LLMs强烈反映了其训练数据中的政治倾向，而与人类意见的相关性不强。这强调了负责任地管理训练数据以及建立稳健评估指标的重要性，以确保LLMs与人类中心价值观的一致性。

链接: https://arxiv.org/abs/2502.18282
作者: Shanshan Xu,T.Y.S.S Santosh,Yanai Elazar,Quirin Vogel,Barbara Plank,Matthias Grabmair
机构: Technical University of Munich(慕尼黑工业大学); Allen Institute for AI(艾伦人工智能研究所); University of Washington(华盛顿大学); IT University of Copenhagen(哥本哈根信息技术大学); LMU Munich & Munich Center for Machine Learning (MCML)(慕尼黑大学 & 慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:The increased adoption of Large Language Models (LLMs) and their potential to shape public opinion have sparked interest in assessing these models’ political leanings. Building on previous research that compared LLMs and human opinions and observed political bias in system responses, we take a step further to investigate the underlying causes of such biases by empirically examining how the values and biases embedded in training corpora shape model outputs. Specifically, we propose a method to quantitatively evaluate political leanings embedded in the large pretraining corpora. Subsequently we investigate to whom are the LLMs’ political leanings more aligned with, their pretrainig corpora or the surveyed human opinions. As a case study, we focus on probing the political leanings of LLMs in 32 U.S. Supreme Court cases, addressing contentious topics such as abortion and voting rights. Our findings reveal that LLMs strongly reflect the political leanings in their training data, and no strong correlation is observed with their alignment to human opinions as expressed in surveys. These results underscore the importance of responsible curation of training data and the need for robust evaluation metrics to ensure LLMs’ alignment with human-centered values.
zh

[NLP-27] Self-Adjust Softmax

【速读】：该论文旨在解决Transformer注意力机制中softmax函数在处理极端值时出现的梯度消失问题。解决方案的关键在于提出了一种名为Self-Adjust Softmax (SA-Softmax)的新方法，通过修改softmax函数为 $x \cdot \text{softmax}(x)$ 及其归一化变体 $\frac{x - \min(x_{\min},0)}{\max(0,x_{\max})-\min(x_{\min},0)} \cdot \text{softmax}(x)$ ，以增强梯度特性。

链接: https://arxiv.org/abs/2502.18277
作者: Chuanyang Zheng,Yihang Gao,Guoxuan Chen,Han Shi,Jing Xiong,Xiaozhe Ren,Chao Huang,Xin Jiang,Zhenguo Li,Yu Li
机构: The Chinese University of Hong Kong(香港中文大学); National University of Singapore(新加坡国立大学); The University of Hong Kong(香港大学); Noah’s Ark Lab(诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: Tech Report

点击查看摘要

Abstract:The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying softmax(x) to x \cdot softmax(x) and its normalized variant \frac(x - min(x_\min,0))max(0,x_max)-min(x_min,0) \cdot softmax(x) . We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted experiments to evaluate the empirical performance of Transformer models using SA-Softmax compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.
zh

[NLP-28] Citrus: Leverag ing Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support

【速读】：该论文旨在解决大型语言模型（LLMs）在医疗领域，尤其是在疾病推理任务中的部署难题。这一难题主要源于难以获取专家级别的认知数据。为了解决这一问题，论文提出的关键方案是引入Citrus模型，该模型通过模拟医学专家的认知过程，利用一个包含大量模拟专家疾病推理数据的大语料库进行训练。这些数据采用了一种新颖的方法合成，能够精确捕捉临床医生的决策路径，从而更好地模拟复杂的诊断和治疗推理过程。此外，为了进一步应对缺乏公开可用的医疗推理数据集的问题，论文还发布了用于模型训练的最后一阶段数据，包括一个自建的医学诊断对话数据集。这一开源贡献旨在支持该领域的进一步研究和发展。

链接: https://arxiv.org/abs/2502.18274
作者: Guoxin Wang,Minyu Gao,Shuai Yang,Ya Zhang,Lizhi He,Liang Huang,Hanlin Xiao,Yexuan Zhang,Wanyue Li,Lu Chen,Jintao Fei,Xin Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs), particularly those with reasoning capabilities, have rapidly advanced in recent years, demonstrating significant potential across a wide range of applications. However, their deployment in healthcare, especially in disease reasoning tasks, is hindered by the challenge of acquiring expert-level cognitive data. In this paper, we introduce Citrus, a medical language model that bridges the gap between clinical expertise and AI reasoning by emulating the cognitive processes of medical experts. The model is trained on a large corpus of simulated expert disease reasoning data, synthesized using a novel approach that accurately captures the decision-making pathways of clinicians. This approach enables Citrus to better simulate the complex reasoning processes involved in diagnosing and treating medical this http URL further address the lack of publicly available datasets for medical reasoning tasks, we release the last-stage training data, including a custom-built medical diagnostic dialogue dataset. This open-source contribution aims to support further research and development in the field. Evaluations using authoritative benchmarks such as MedQA, covering tasks in medical reasoning and language understanding, show that Citrus achieves superior performance compared to other models of similar size. These results highlight Citrus potential to significantly enhance medical decision support systems, providing a more accurate and efficient tool for clinical decision-making.
zh

[NLP-29] Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization

【速读】：该论文旨在解决在分布偏移（Distribution Shift）条件下，基于变压器的语言模型（Transformer-based Language Models）泛化到新颖复合任务（novel compound tasks）的问题。论文的关键解决方案在于通过链式思考（Chain-of-Thought, CoT）推理来增强语言模型的域外（OOD）泛化能力。研究揭示，细粒度的CoT数据与更好的泛化性能相关，并且CoT展示了显著的数据高效性。此外，论文理论分析表明，CoT推理能够促进语言模型内部化有效的依赖结构，从而实现更好的泛化。

链接: https://arxiv.org/abs/2502.18273
作者: Ru Wang,Wei Huang,Selena Song,Haoyu Zhang,Yusuke Iwasawa,Yutaka Matsuo,Jiaxian Guo
机构: The University of Tokyo (东京大学); RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目); Google Research, Australia (谷歌研究，澳大利亚)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generalization to novel compound tasks under distribution shift is important for deploying transformer-based language models (LMs). This work investigates Chain-of-Thought (CoT) reasoning as a means to enhance OOD generalization. Through controlled experiments across several compound tasks, we reveal three key insights: (1) While QA-trained models achieve near-perfect in-distribution accuracy, their OOD performance degrades catastrophically, even with 10000k+ training examples; (2) the granularity of CoT data strongly correlates with generalization performance; finer-grained CoT data leads to better generalization; (3) CoT exhibits remarkable sample efficiency, matching QA performance with much less (even 80%) data. Theoretically, we demonstrate that compound tasks inherently permit shortcuts in Q-A data that misalign with true reasoning principles, while CoT forces internalization of valid dependency structures, and thus can achieve better generalization. Further, we show that transformer positional embeddings can amplify generalization by emphasizing subtask condition recurrence in long CoT sequences. Our combined theoretical and empirical analysis provides compelling evidence for CoT reasoning as a crucial training paradigm for enabling LM generalization under real-world distributional shifts for compound tasks. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.18273 [cs.CL] (or arXiv:2502.18273v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.18273 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-30] Iterative Counterfactual Data Augmentation AAAI2025

【速读】：该论文旨在解决通过数据增强控制训练数据中的信息或偏差的问题。现有方法依赖于手工规则或算法式的数据增强（CDA），这些方法可能未能完全消除不期望的信息。论文的关键在于提出迭代式CDA（ICDA）方法，通过初始阶段的高噪声干预，最终收敛至低噪声状态。这种方法能够使训练数据集中一个目标信号与相应标签保持高互信息，同时减少虚假信号的信息，从而在增强数据集上训练模型时，生成的解释更符合人工标注。

链接: https://arxiv.org/abs/2502.18249
作者: Mitchell Plyler,Min Chi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: AAAI 2025

点击查看摘要

Abstract:Counterfactual data augmentation (CDA) is a method for controlling information or biases in training datasets by generating a complementary dataset with typically opposing biases. Prior work often either relies on hand-crafted rules or algorithmic CDA methods which can leave unwanted information in the augmented dataset. In this work, we show iterative CDA (ICDA) with initial, high-noise interventions can converge to a state with significantly lower noise. Our ICDA procedure produces a dataset where one target signal in the training dataset maintains high mutual information with a corresponding label and the information of spurious signals are reduced. We show training on the augmented datasets produces rationales on documents that better align with human annotation. Our experiments include six human produced datasets and two large-language model generated datasets.
zh

[NLP-31] Debt Collection Negotiations with Large Language Models : An Evaluation System and Optimizing Decision Making with Multi-Agent

【速读】：该论文旨在解决债务催收谈判（Debt Collection Negotiations, DCN）过程中自动化程度低以及传统方法劳动密集的问题。研究发现，现有的大型语言模型（Large Language Models, LLMs）在自动化DCN时倾向于过度让步，无法与人类谈判者匹敌。为了解决这一问题，论文提出了Multi-Agent Debt Negotiation (MADeN)框架，并引入了规划和决策判断模块以提高决策合理性。此外，还采用了包括拒绝采样在内的后训练技术来优化性能。关键解决方案在于结合多智能体系统与改进的决策机制，以提升LLMs在DCN中的表现。

链接: https://arxiv.org/abs/2502.18228
作者: Xiaofeng Wang,Zhixin Zhang,Jinguang Zheng,Yiming Ai,Rui Wang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注: 21 pages

点击查看摘要

Abstract:Debt collection negotiations (DCN) are vital for managing non-performing loans (NPLs) and reducing creditor losses. Traditional methods are labor-intensive, while large language models (LLMs) offer promising automation potential. However, prior systems lacked dynamic negotiation and real-time decision-making capabilities. This paper explores LLMs in automating DCN and proposes a novel evaluation framework with 13 metrics across 4 aspects. Our experiments reveal that LLMs tend to over-concede compared to human negotiators. To address this, we propose the Multi-Agent Debt Negotiation (MADeN) framework, incorporating planning and judging modules to improve decision rationality. We also apply post-training techniques, including DPO with rejection sampling, to optimize performance. Our studies provide valuable insights for practitioners and researchers seeking to enhance efficiency and outcomes in this domain.
zh

[NLP-32] Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus ISCA

【速读】：该论文旨在解决低资源语言中对齐音频语料库稀缺的问题，这限制了这些语言在语音识别（ASR）和语音翻译等自然语言处理技术中的应用。解决方案的关键在于通过LoReASR子语料库构建短音频及其转录本的对齐，并进一步利用工具如MFA（Montreal Forced Aligner）对长篇音频文本进行对齐，最终形成LoReSpeech语料库，以促进多语言ASR系统、直接语音到语音翻译模型的发展以及语言保护工作，同时推动数字包容性。

链接: https://arxiv.org/abs/2502.18215
作者: Samy Ouzerrout
机构: Université d’Orléans (奥尔良大学); Yanantic AI (雅纳蒂克AI)
类目: Computation and Language (cs.CL)
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models

点击查看摘要

Abstract:Aligned audio corpora are fundamental to NLP technologies such as ASR and speech translation, yet they remain scarce for underrepresented languages, hindering their technological integration. This paper introduces a methodology for constructing LoReSpeech, a low-resource speech-to-speech translation corpus. Our approach begins with LoReASR, a sub-corpus of short audios aligned with their transcriptions, created through a collaborative platform. Building on LoReASR, long-form audio recordings, such as biblical texts, are aligned using tools like the MFA. LoReSpeech delivers both intra- and inter-language alignments, enabling advancements in multilingual ASR systems, direct speech-to-speech translation models, and linguistic preservation efforts, while fostering digital inclusivity. This work is conducted within Tutlayt AI project (this https URL).
zh

[NLP-33] LAG: LLM agents for Leaderboard Auto Generation on Demanding

【速读】：该论文旨在解决在快速发展领域如人工智能中，自动构建研究主题排行榜的问题。面对每日更新的大量论文，研究人员难以追踪每篇论文提出的方法、实验结果及设置，从而需要高效自动的排行榜生成方法。论文的关键解决方案是通过系统性的方法，包括论文收集、实验结果提取与整合、排行榜生成以及质量评估，来应对多文档摘要、排行榜生成及实验公平比较等挑战。

链接: https://arxiv.org/abs/2502.18209
作者: Jian Wu,Jiayu Zhang,Dongyuan Li,Linyi Yang,Aoxiao Zhong,Renhe Jiang,Qingsong Wen,Yue Zhang
机构: Institute of Science Tokyo; Peking University, China; University of Tokyo; University College London; AI Research Institute, Squirrel AI Learning, China; School of Engineering, Westlake Univeristy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces Leaderboard Auto Generation (LAG), a novel and well-organized framework for automatic generation of leaderboards on a given research topic in rapidly evolving fields like Artificial Intelligence (AI). Faced with a large number of AI papers updated daily, it becomes difficult for researchers to track every paper’s proposed methods, experimental results, and settings, prompting the need for efficient automatic leaderboard construction. While large language models (LLMs) offer promise in automating this process, challenges such as multi-document summarization, leaderboard generation, and experiment fair comparison still remain under exploration. LAG solves these challenges through a systematic approach that involves the paper collection, experiment results extraction and integration, leaderboard generation, and quality evaluation. Our contributions include a comprehensive solution to the leaderboard construction problem, a reliable evaluation method, and experimental results showing the high quality of leaderboards.
zh

[NLP-34] Grandes modelos de lenguaje: de la predicción de palabras a la comprensión?

【速读】：该论文旨在探讨大型语言模型（Large Language Models）的发展历程及其工作原理，重点介绍其能力和局限性，并引入围绕其开发和应用的主要辩论。论文的关键在于通过描述技术发展的背景和基本原理，使读者能够更好地理解这些模型的功能和潜在问题。

链接: https://arxiv.org/abs/2502.18205
作者: Carlos Gómez-Rodríguez
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 26 pages, in Spanish. Chapter from book “La Inteligencia Artificial hoy y sus aplicaciones con Big Data”, (Amparo Alonso Betanzos, Daniel Peña y Pilar Poncela, eds.). Publisher: Funcas. ISBN 978-84-17609-94-8

点击查看摘要

Abstract:Large language models, such as the well-known ChatGPT, have brought about an unexpected revolution in the field of artificial intelligence. On the one hand, they have numerous practical applications and enormous potential still to be explored. On the other hand, they are also the subject of debate from scientific, philosophical, and social perspectives: there are doubts about the exact mechanisms of their functioning and their actual capacity for language comprehension, and their applications raise ethical dilemmas. In this chapter, we describe how this technology has been developed and the fundamentals of its operation, allowing us to better understand its capabilities and limitations and to introduce some of the main debates surrounding its development and use. – Los grandes modelos de lenguaje, como el conocido ChatGPT, han supuesto una inesperada revolución en el ámbito de la inteligencia artificial. Por un lado, cuentan con multitud de aplicaciones prácticas y un enorme potencial todavía por explorar. Por otro lado, son también objeto de debate, tanto desde el punto de vista científico y filosófico como social: hay dudas sobre los mecanismos exactos de su funcionamiento y su capacidad real de comprensión del lenguaje, y sus aplicaciones plantean dilemas éticos. En este capítulo describimos cómo se ha llegado a esta tecnología y los fundamentos de su funcionamiento, permitiéndonos así comprender mejor sus capacidades y limitaciones e introducir algunos de los principales debates que rodean su desarrollo y uso. Comments: 26 pages, in Spanish. Chapter from book “La Inteligencia Artificial hoy y sus aplicaciones con Big Data”, (Amparo Alonso Betanzos, Daniel Peña y Pilar Poncela, eds.). Publisher: Funcas. ISBN 978-84-17609-94-8 Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) MSC classes: 68T50 ACMclasses: I.2.7; K.4.0 Cite as: arXiv:2502.18205 [cs.CL] (or arXiv:2502.18205v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.18205 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Amparo Alonso Betanzos, Daniel Peña y Pilar Poncela (eds.), “La Inteligencia Artificial hoy y sus aplicaciones con Big Data”, pp. 73-98, Funcas, 2025. ISBN 978-84-17609-94-8 (digital), 978-84-17609-93-1 (printed)
zh

[NLP-35] Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

【速读】：该论文旨在解决大规模音频语言模型（Audio Language Models, ALMs）在语音情感识别（Speech Emotion Recognition, SER）任务中常见的幻觉问题，这导致了误分类或不相关的输出。为了解决这些问题，论文提出了一种名为C^2 SER的新模型。其关键是通过上下文感知（Contextual perception）和链式思维（Chain of Thought, CoT）方法来增强SER的稳定性和准确性。C^2 SER集成了Whisper编码器以实现语义感知，并使用扩展了半监督学习的Emotion2Vec-S进行声学感知，同时采用逐步处理的方式利用语音内容和说话风格来改进识别过程。此外，引入从显式CoT到隐式CoT的自蒸馏方法，进一步提高了模型的稳定性并减少了误差累积。

链接: https://arxiv.org/abs/2502.18186
作者: Zhixian Zhao,Xinfa Zhu,Xinsheng Wang,Shuiyuan Wang,Xuelong Geng,Wenjie Tian,Lei Xie
机构: Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University (西北工业大学); Hong Kong University of Science and Technology (香港科技大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C ^2 SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C ^2 SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C ^2 SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C ^2 SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C ^2 SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.
zh

[NLP-36] Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLM s

【速读】：该论文旨在解决从布局丰富的文档中进行信息提取（Information Extraction, IE）时使用大规模语言模型（Large Language Models, LLMs）所面临的三个核心挑战：数据结构化（data structuring）、模型参与度（model engagement）和输出优化（output refinement）。论文通过深入研究这些核心挑战中的子问题，如输入表示、分块、提示设计以及选择单模态和多模态模型，探索了解决方案。关键在于通过配置优化的通用型LLMs，实现了与专用模型相当的性能，并且在F1分数上相较于基线模型提升了14.1分，同时通过全因子探索进一步将增益提高到15.1分，尽管这需要大约36倍的标记使用量。

链接: https://arxiv.org/abs/2502.18179
作者: Gaye Colakoglu,Gürkan Solmaz,Jonathan Fürst
机构: Zurich University of Applied Sciences (苏黎世应用科技大学), Switzerland; NEC Laboratories Europe (NEC欧洲实验室), Heidelberg, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study delves into the sub-problems within these core challenges, such as input representation, chunking, prompting, and selection of LLMs and multimodal models. It examines the outcomes of different design choices through a new layout-aware IE test suite, benchmarking against the state-of-art (SoA) model LayoutLMv3. The results show that the configuration from one-factor-at-a-time (OFAT) trial achieves near-optimal results with 14.1 points F1-score gain from the baseline model, while full factorial exploration yields only a slightly higher 15.1 points gain at around 36x greater token usage. We demonstrate that well-configured general-purpose LLMs can match the performance of specialized models, providing a cost-effective alternative. Our test-suite is freely available at this https URL.
zh

[NLP-37] SECURA: Sigmoid-Enhanced CUR Decomposition with Uninterrupted Retention and Low-Rank Adaptation in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）全量微调（Fully Tuning, FT）因高计算需求导致的不可行性及由此引发的灾难性遗忘（Catastrophic Forgetting）问题。解决方案的关键在于提出了一种新型的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法——SECURA：Sigmoid增强CUR分解LoRA。SECURA通过引入新的归一化技术SigNorm来增强参数保留能力和整体性能，从而缓解灾难性遗忘问题并提升微调效果。

链接: https://arxiv.org/abs/2502.18168
作者: Zhang Yuxuan,Li Ruizhe
机构: Department of Computing Science, University of Aberdeen (计算科学系，阿伯丁大学); Aberdeen Institute of Data Science and Artificial Intelligence, South China Normal University (阿伯丁数据科学与人工智能研究所，华南师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: New work on Parameter-Efficient Fine-Tuning (PEFT) for large language models. Includes new techniques SigNorm and CABR-LoRA for optimizing fine-tune performance and Knowledge retention

点击查看摘要

Abstract:With the rapid development of large language models (LLMs), fully fine-tuning (FT) these models has become increasingly impractical due to the high computational demands. Additionally, FT can lead to catastrophic forgetting. As an alternative, Low-Rank Adaptation (LoRA) has been proposed, which fine-tunes only a small subset of parameters, achieving similar performance to FT while significantly reducing resource requirements. However, since LoRA inherits FT’s design, the issue of catastrophic forgetting remains. To address these challenges, we propose SECURA: Sigmoid-Enhanced CUR Decomposition LoRA, a novel parameter-efficient fine-tuning (PEFT) variant that mitigates catastrophic forgetting while improving fine-tuning performance. Our method introduces a new normalization technique, SigNorm, to enhance parameter retention and overall performance. SECURA has been evaluated on a variety of tasks, including mathematical problem-solving (GSM8K), challenging question-answering (CNNDM), translation (NewsDE), and complex multiple-choice reasoning (LogiQA). Experimental results show that SECURA achieves an average fine-tuning improvement of 3.59% across four multiple-choice question (MCQ) tasks and a 2.51% improvement across five question-answering (QA) tasks on models such as Gemma2 2b, Qwen2 1.5b, Qwen 2 7b, Llama3 8b, and Llama3.1 8b, compared to DoRA. Moreover, SECURA demonstrates superior knowledge retention capabilities, maintaining more than 70% accuracy on basic LLM knowledge across 16 continual learning tests, outperforming Experience Replay (ER), Sequential Learning (SEQ), EWC, I-LoRA, and CUR-LoRA. Comments: New work on Parameter-Efficient Fine-Tuning (PEFT) for large language models. Includes new techniques SigNorm and CABR-LoRA for optimizing fine-tune performance and Knowledge retention Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; I.2.7 Cite as: arXiv:2502.18168 [cs.CL] (or arXiv:2502.18168v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.18168 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-38] Can LLM s Explain Themselves Counterfactually?

【速读】：该论文旨在研究大型语言模型（Large Language Model, LLM）在生成自生成反事实解释（Self-Generated Counterfactual Explanations, SCEs）方面的能力，并设计测试方法评估其有效性。研究发现，尽管LLMs具备强大的推理能力，但在生成SCEs时仍存在困难，且其预测结果往往与其自身的反事实推理不一致。关键在于探索和理解LLMs在生成SCEs过程中的局限性和潜在偏差。

链接: https://arxiv.org/abs/2502.18156
作者: Zahra Dehghanighobadi,Asja Fischer,Muhammad Bilal Zafar
机构: Ruhr University Bochum(鲁尔大学波鸿校区); UAR Research Center for Trustworthy Data Science and Security
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explanations are an important tool for gaining insights into the behavior of ML models, calibrating user trust and ensuring regulatory compliance. Past few years have seen a flurry of post-hoc methods for generating model explanations, many of which involve computing model gradients or solving specially designed optimization problems. However, owing to the remarkable reasoning abilities of Large Language Model (LLMs), self-explanation, that is, prompting the model to explain its outputs has recently emerged as a new paradigm. In this work, we study a specific type of self-explanations, self-generated counterfactual explanations (SCEs). We design tests for measuring the efficacy of LLMs in generating SCEs. Analysis over various LLM families, model sizes, temperature settings, and datasets reveals that LLMs sometimes struggle to generate SCEs. Even when they do, their prediction often does not agree with their own counterfactual reasoning.
zh

[NLP-39] NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts

【速读】：该论文旨在解决印尼语言及其原始手稿在自然语言处理（NLP）领域的不足，特别是现有NLP进展主要集中在罗马化文本上。论文的关键解决方案是提出了NusaAksara基准数据集，该数据集涵盖了印尼语的八种书写系统，并包括文本和图像模态下的多种任务，如图像分割、光学字符识别（OCR）、音译、翻译和语言识别。数据集由专家通过严格流程构建，包含了低资源语言以及未被Unicode支持的Lampung文。

链接: https://arxiv.org/abs/2502.18148
作者: Muhammad Farid Adilazuarda,Musa Izzanardi Wijanarko,Lucky Susanto,Khumaisa Nur’aini,Derry Wijaya,Alham Fikri Aji
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Indonesia is rich in languages and scripts. However, most NLP progress has been made using romanized text. In this paper, we present NusaAksara, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NusaAksara covers 8 scripts across 7 languages, including low-resource languages not commonly seen in NLP benchmarks. Although unsupported by Unicode, the Lampung script is included in this dataset. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID, and show that most NLP technologies cannot handle Indonesia’s local scripts, with many achieving near-zero performance.
zh

[NLP-40] Jacobian Sparse Autoencoders: Sparsify Computations Not Just Activations

【速读】：该论文旨在解决如何理解大型语言模型（LLMs）的计算过程，而不仅仅是其表示。传统稀疏自编码器（SAEs）虽然能够发现潜在激活的稀疏且人类可解释的表示，但无法直接揭示模型内部的计算过程。论文的关键解决方案是提出雅可比稀疏自编码器（JSAEs），它们不仅在模型组件的输入和输出激活中产生稀疏性，还在连接这些激活的计算（形式上为雅可比矩阵）中引入稀疏性。关键技术贡献在于找到一种高效计算LLMs中雅可比矩阵的方法。研究表明，JSAEs能够在保持下游LLM性能的同时，提取出相对较高的计算稀疏性，并且通过实验验证了雅可比矩阵作为计算稀疏性合理代理的有效性。此外，JSAEs在预训练的LLMs中表现出比等效随机模型更高的计算稀疏性，表明LLMs通过训练学习到的计算图稀疏性可以通过JSAEs更好地理解，优于传统的SAEs。

链接: https://arxiv.org/abs/2502.18147
作者: Lucy Farnik,Tim Lawson,Conor Houghton,Laurence Aitchison
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of LLMs. However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to “sparsify” computations in any sense, only latent activations. To solve this, we propose Jacobian SAEs (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a naïve implementation, the Jacobians in LLMs would be computationally intractable due to their size. One key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that Jacobians are a reasonable proxy for computational sparsity because MLPs are approximately linear when rewritten in the JSAE basis. Lastly, we show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.
zh

[NLP-41] LevelRAG : Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers

【速读】：该论文旨在解决现有检索增强生成（Retrieval-Augmented Generation, RAG）方法中查询改写与密集检索器紧密耦合的问题，这限制了其与混合检索的兼容性，从而阻碍了进一步的性能提升。论文的关键解决方案是引入了一种高级搜索器（high-level searcher），它能够将复杂查询分解为原子查询，而不依赖于特定检索器的优化。此外，开发了一种新的稀疏搜索器（sparse searcher），使用Lucene语法以增强精确关键词检索能力。在LevelRAG方法中，高级搜索器负责管理检索逻辑，而低级搜索器（包括稀疏、网络和密集搜索器）则负责优化查询以实现最佳检索效果。这种方法提升了检索过程的完整性和准确性，克服了当前查询改写技术在混合检索场景下的挑战。

链接: https://arxiv.org/abs/2502.18139
作者: Zhuocheng Zhang,Yang Feng,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: First submit

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a crucial method for mitigating hallucinations in Large Language Models (LLMs) and integrating external knowledge into their responses. Existing RAG methods typically employ query rewriting to clarify the user intent and manage multi-hop logic, while using hybrid retrieval to expand search scope. However, the tight coupling of query rewriting to the dense retriever limits its compatibility with hybrid retrieval, impeding further RAG performance improvements. To address this challenge, we introduce a high-level searcher that decomposes complex queries into atomic queries, independent of any retriever-specific optimizations. Additionally, to harness the strengths of sparse retrievers for precise keyword retrieval, we have developed a new sparse searcher that employs Lucene syntax to enhance retrieval this http URL web and dense searchers, these components seamlessly collaborate within our proposed method, \textbfLevelRAG. In LevelRAG, the high-level searcher orchestrates the retrieval logic, while the low-level searchers (sparse, web, and dense) refine the queries for optimal retrieval. This approach enhances both the completeness and accuracy of the retrieval process, overcoming challenges associated with current query rewriting techniques in hybrid retrieval scenarios. Empirical experiments conducted on five datasets, encompassing both single-hop and multi-hop question answering tasks, demonstrate the superior performance of LevelRAG compared to existing RAG methods. Notably, LevelRAG outperforms the state-of-the-art proprietary model, GPT4o, underscoring its effectiveness and potential impact on the RAG field.
zh

[NLP-42] HyperG: Hypergraph-Enhanced LLM s for Structured Knowledge

【速读】：该论文旨在解决大型语言模型（LLMs）在处理结构化数据时难以全面捕捉结构性关系和有效处理稀疏数据的问题。解决方案的关键在于提出HyperG框架，它通过引入基于超图的生成方法来增强LLMs处理结构化知识的能力。具体而言，HyperG首先利用LLMs的生成能力为稀疏数据补充上下文信息，并结合提示关注的超图学习（PHL）网络，以编码增强的信息及其复杂的数据结构关系。

链接: https://arxiv.org/abs/2502.18125
作者: Sirui Huang,Hanqian Li,Yanggan Gu,Xuming Hu,Qing Li,Guandong Xu
机构: Hong Kong Polytechnic University(Hong Kong Polytechnic University); UTS (University Technology Sydney); Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); Education University of Hong Kong(Hong Kong 教育大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given that substantial amounts of domain-specific knowledge are stored in structured formats, such as web data organized through HTML, Large Language Models (LLMs) are expected to fully comprehend this structured information to broaden their applications in various real-world downstream tasks. Current approaches for applying LLMs to structured data fall into two main categories: serialization-based and operation-based methods. Both approaches, whether relying on serialization or using SQL-like operations as an intermediary, encounter difficulties in fully capturing structural relationships and effectively handling sparse data. To address these unique characteristics of structured data, we propose HyperG, a hypergraph-based generation framework aimed at enhancing LLMs’ ability to process structured knowledge. Specifically, HyperG first augment sparse data with contextual information, leveraging the generative power of LLMs, and incorporate a prompt-attentive hypergraph learning (PHL) network to encode both the augmented information and the intricate structural relationships within the data. To validate the effectiveness and generalization of HyperG, we conduct extensive experiments across two different downstream tasks requiring structured knowledge.
zh

[NLP-43] Bayesian Optimization for Controlled Image Editing via LLM s

【速读】：该论文旨在解决图像生成领域中内容控制不精确及语义一致性维持困难的问题，特别是在接地技术和模型微调方面的需求。解决方案的关键在于提出了一种名为BayesGenie的方法，该方法集成了大型语言模型（Large Language Models, LLMs）与贝叶斯优化（Bayesian Optimization），实现了通过自然语言描述进行精准且用户友好的图像编辑，同时保持原始图像的语义完整性。BayesGenie采用了一种适应性贝叶斯优化策略，能够自动调整推理过程参数，从而实现高精度图像编辑，且用户干预最小化。

链接: https://arxiv.org/abs/2502.18116
作者: Chengkun Cai,Haoliang Liu,Xu Zhao,Zhongyu Jiang,Tianfang Zhang,Zongkai Wu,Jenq-Neng Hwang,Serge Belongie,Lei Li
机构: University of Edinburgh(爱丁堡大学); University of Manchester(曼彻斯特大学); Tsinghua University(清华大学); FancyTech; University of Washington(华盛顿大学); University of Copenhagen(哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 figures

点击查看摘要

Abstract:In the rapidly evolving field of image generation, achieving precise control over generated content and maintaining semantic consistency remain significant limitations, particularly concerning grounding techniques and the necessity for model fine-tuning. To address these challenges, we propose BayesGenie, an off-the-shelf approach that integrates Large Language Models (LLMs) with Bayesian Optimization to facilitate precise and user-friendly image editing. Our method enables users to modify images through natural language descriptions without manual area marking, while preserving the original image’s semantic integrity. Unlike existing techniques that require extensive pre-training or fine-tuning, our approach demonstrates remarkable adaptability across various LLMs through its model-agnostic design. BayesGenie employs an adapted Bayesian optimization strategy to automatically refine the inference process parameters, achieving high-precision image editing with minimal user intervention. Through extensive experiments across diverse scenarios, we demonstrate that our framework significantly outperforms existing methods in both editing accuracy and semantic preservation, as validated using different LLMs including Claude3 and GPT-4.
zh

[NLP-44] Uncertainty Quantification in Retrieval Augmented Question Answering

【速读】：该论文旨在解决 Retrieval Augmented Question Answering (QA) 系统中，虽然通过引入检索证据提升了性能并减少了幻觉现象，但未能评估所检索的文档是否真正有助于正确回答问题。论文的关键解决方案在于提出一种量化 QA 模型不确定性的方法，通过估计提供给模型的文档的实用性来实现。为此，训练了一个轻量级神经网络模型来预测目标 QA 模型的文档实用性，并证明简单的信息论度量在一定程度上可以预测答案的正确性，而所提出的方法能够高效地近似或超越基于采样的更昂贵方法。

链接: https://arxiv.org/abs/2502.18108
作者: Laura Perez-Beltrachini,Mirella Lapata
机构: Institute for Language, Cognition and Computation (语言、认知与计算研究所); School of Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods. Code and data are available at this https URL.
zh

[NLP-45] Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models

【速读】：该论文旨在解决传统在线内容审核系统难以有效分类现代多模态交流形式（如表情包）的问题，特别是在新加坡这种文化多元的社会中，低资源语言的使用及对本地语境的广泛理解需求使得这一任务尤为困难。论文的关键解决方案在于构建了一个包含光学字符识别（OCR）、翻译以及一个拥有70亿参数的视觉语言模型（VLM）的处理流程，并通过GPT-4V标注的112K表情包数据集对该VLM进行微调，以实现对新加坡语境下具有攻击性的表情包进行分类。此方法在预留的测试集中达到了80.62%的准确率和0.8192的AUROC评分，显著提升了人工在线内容审核的效率。

链接: https://arxiv.org/abs/2502.18101
作者: Cao Yuxuan,Wu Jiayang,Alistair Cheong Liang Chuen,Bryan Shan Guanrong,Theodore Lee Chong Jen,Sherman Chann Zhi Shen
机构: Nanyang Technological University (南洋理工大学); Independent Researcher (独立研究员); Independent Researcher (独立研究员); Independent Researcher (独立研究员); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional online content moderation systems struggle to classify modern multimodal means of communication, such as memes, a highly nuanced and information-dense medium. This task is especially hard in a culturally diverse society like Singapore, where low-resource languages are used and extensive knowledge on local context is needed to interpret online content. We curate a large collection of 112K memes labeled by GPT-4V for fine-tuning a VLM to classify offensive memes in Singapore context. We show the effectiveness of fine-tuned VLMs on our dataset, and propose a pipeline containing OCR, translation and a 7-billion parameter-class VLM. Our solutions reach 80.62% accuracy and 0.8192 AUROC on a held-out test set, and can greatly aid human in moderating online contents. The dataset, code, and model weights will be open-sourced at this https URL.
zh

[NLP-46] owards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

【速读】：该论文旨在探究过长的Chain of Thoughts (CoTs)是否会对大规模语言模型（LLMs）在复杂推理任务中的表现产生负面影响。研究发现，在某些领域内，过长的CoTs确实会损害LLMs的推理性能，并且存在一个针对不同领域的最优CoT长度分布。为此，论文提出了一种名为Thinking-Optimal Scaling的策略。该方法首先利用一组具有不同响应长度分布的种子数据，使模型学会根据不同推理需求进行深度思考；随后，模型在额外的问题上选择其最短的正确响应以实现自我提升。基于此方法，改进后的模型在多种数学基准测试中超越了其他基于蒸馏的同类模型，并且达到了与QwQ-32B-Preview相当的性能。

链接: https://arxiv.org/abs/2502.18080
作者: Wenkai Yang,Shuming Ma,Yankai Lin,Furu Wei
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Microsoft Research, Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. While current researches continue to explore the benefits of increasing test-time compute by extending the CoT lengths of Large Language Models (LLMs), we are concerned about a potential issue hidden behind the current pursuit of test-time scaling: Would excessively scaling the CoT length actually bring adverse effects to a model’s reasoning performance? Our explorations on mathematical reasoning tasks reveal an unexpected finding that scaling with longer CoTs can indeed impair the reasoning performance of LLMs in certain domains. Moreover, we discover that there exists an optimal scaled length distribution that differs across different domains. Based on these insights, we propose a Thinking-Optimal Scaling strategy. Our method first uses a small set of seed data with varying response length distributions to teach the model to adopt different reasoning efforts for deep thinking. Then, the model selects its shortest correct response under different reasoning efforts on additional problems for self-improvement. Our self-improved models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks, and achieve performance on par with QwQ-32B-Preview.
zh

[NLP-47] Defining bias in AI-systems: Biased models are fair models ISCA

【速读】：该论文旨在解决算法公平性讨论中对“偏见”（bias）缺乏明确定义的问题，并挑战了无偏模型即为公平的假设。论文的关键在于区分偏见与歧视的概念，强调精准定义偏见的重要性，以更有效地应对公平性关切。这种概念上的转变有助于促进关于AI系统公平性的学术辩论更加富有建设性。

链接: https://arxiv.org/abs/2502.18060
作者: Chiara Lindloff,Ingo Siegert
机构: Otto von Guericke University Magdeburg (奥托·冯·格里克大学马德堡)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models

点击查看摘要

Abstract:The debate around bias in AI systems is central to discussions on algorithmic fairness. However, the term bias often lacks a clear definition, despite frequently being contrasted with fairness, implying that an unbiased model is inherently fair. In this paper, we challenge this assumption and argue that a precise conceptualization of bias is necessary to effectively address fairness concerns. Rather than viewing bias as inherently negative or unfair, we highlight the importance of distinguishing between bias and discrimination. We further explore how this shift in focus can foster a more constructive discourse within academic debates on fairness in AI systems.
zh

[NLP-48] Uncertainty-aware abstention in medical diagnosis based on medical texts

【速读】：该论文旨在解决AI辅助医学诊断中的可靠性问题。关键在于选择性预测（或弃权）方法的应用，即当诊断系统对诊断结果不够自信时，可以选择不提供决策。论文重点探讨了机器学习模型在医学文本分析中的不确定性量化，并引入了一种新的方法HUQ-2，以增强选择性预测任务中的可靠性。该方法在多个数据集上进行了验证，包括MIMIC-III中的二元死亡率预测、MIMIC-IV中的多标签医学代码预测以及一个私人门诊访问数据集中的多类分类任务。此外，论文还分析了针对抑郁症和焦虑症检测的多种文本来源的数据集。研究表明，HUQ-2在捕捉和评估不确定性方面表现出色，为医学文本分析中更可靠且可解释的应用铺平了道路。

链接: https://arxiv.org/abs/2502.18050
作者: Artem Vazhentsev,Ivan Sviridov,Alvard Barseghyan,Gleb Kuzmin,Alexander Panchenko,Aleksandr Nesterov,Artem Shelmanov,Maxim Panov
机构: AIRI (莫斯科人工智能研究院), MBZUAI (穆罕默德·本·扎耶德国际人工智能大学), Sber AI Lab (斯伯银行人工智能实验室), YerevaNN (耶烈万神经网络研究室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study addresses the critical issue of reliability for AI-assisted medical diagnosis. We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis. Such selective prediction (or abstention) approaches are usually based on the modeling predictive uncertainty of machine learning models involved. This study explores uncertainty quantification in machine learning models for medical text analysis, addressing diverse tasks across multiple datasets. We focus on binary mortality prediction from textual data in MIMIC-III, multi-label medical code prediction using ICD-10 codes from MIMIC-IV, and multi-class classification with a private outpatient visits dataset. Additionally, we analyze mental health datasets targeting depression and anxiety detection, utilizing various text-based sources, such as essays, social media posts, and clinical descriptions. In addition to comparing uncertainty methods, we introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks. Our results provide a detailed comparison of uncertainty quantification methods. They demonstrate the effectiveness of HUQ-2 in capturing and evaluating uncertainty, paving the way for more reliable and interpretable applications in medical text analysis. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.18050 [cs.CL] (or arXiv:2502.18050v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.18050 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-49] Harnessing Multiple Large Language Models : A Survey on LLM Ensemble

【速读】：该论文旨在系统性地回顾大型语言模型（LLMs）集成（LLM Ensemble）领域中的最新进展。论文的关键在于提出了一种新的LLM集成分类方法，并深入探讨了“前置推理集成”、“推理过程中集成”以及“后置推理集成”这三种主要集成策略。通过这种方法，论文全面梳理了相关研究问题、方法及其应用，从而为未来的研究方向提供了有价值的建议。

链接: https://arxiv.org/abs/2502.18036
作者: Zhijun Chen,Jingzheng Li,Pengpeng Chen,Zhuoran Li,Kai Sun,Yuankai Luo,Qianren Mao,Dingqi Yang,Hailong Sun,Philip S. Yu
机构: State Key Laboratory of Complex & Critical Software Environment, Beihang University (北航大学), Beijing, China; Zhongguancun Laboratory (中关村实验室), Beijing, China; Aviation System Engineering Institute of China, Beijing, China; Xi’an Jiaotong University (西安交通大学), Xi’an, China; University of Macau (澳门大学), Macau SAR, China; University of Illinois at Chicago (芝加哥大学伊利诺伊分校), Chicago, USA
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, codebase: this https URL

点击查看摘要

Abstract:LLM Ensemble – which involves the comprehensive use of multiple large language models (LLMs), each aimed at handling user queries during downstream inference, to benefit from their individual strengths – has gained substantial attention recently. The widespread availability of LLMs, coupled with their varying strengths and out-of-the-box usability, has profoundly advanced the field of LLM Ensemble. This paper presents the first systematic review of recent developments in LLM Ensemble. First, we introduce our taxonomy of LLM Ensemble and discuss several related research problems. Then, we provide a more in-depth classification of the methods under the broad categories of “ensemble-before-inference, ensemble-during-inference, ensemble-after-inference”, and review all relevant methods. Finally, we introduce related benchmarks and applications, summarize existing studies, and suggest several future research directions. A curated list of papers on LLM Ensemble is available at this https URL.
zh

[NLP-50] Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference

【速读】：该论文旨在解决视觉大型语言模型（Visual Large Language Models, VLLMs）在处理需要实时信息或知识密集型问题时的局限性。论文的关键解决方案是提出一种方法来检测VLLMs的知识边界，通过这种方式可以在减少无谓检索的同时，保持甚至提升性能。具体而言，该方法通过微调VLLM于一个自动构建的数据集上以识别边界，并且实验结果显示这种方法能够有效描绘出VLLMs的知识边界，从而实现更高效地利用如检索增强生成（Retrieval Augmented Generation, RAG）等技术。此外，论文表明针对某一VLLM识别出的知识边界可以作为其他VLLMs的代理边界。

链接: https://arxiv.org/abs/2502.18023
作者: Zhuo Chen,Xinyu Wang,Yong Jiang,Zhen Zhang,Xinyu Geng,Pengjun Xie,Fei Huang,Kewei Tu
机构: Institute for Intelligent Computing, Alibaba Group (阿里巴巴集团智能计算研究所); School of Information Science and Technology, ShanghaiTech University (上海科技大学信息科学与技术学院); Shanghai Engineering Research Center of Intelligent Vision and Imaging (上海智能视觉成像工程技术研究中心)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Despite the advancements made in Visual Large Language Models (VLLMs), like text Large Language Models (LLMs), they have limitations in addressing questions that require real-time information or are knowledge-intensive. Indiscriminately adopting Retrieval Augmented Generation (RAG) techniques is an effective yet expensive way to enable models to answer queries beyond their knowledge scopes. To mitigate the dependence on retrieval and simultaneously maintain, or even improve, the performance benefits provided by retrieval, we propose a method to detect the knowledge boundary of VLLMs, allowing for more efficient use of techniques like RAG. Specifically, we propose a method with two variants that fine-tunes a VLLM on an automatically constructed dataset for boundary identification. Experimental results on various types of Visual Question Answering datasets show that our method successfully depicts a VLLM’s knowledge boundary based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance. In addition, we show that the knowledge boundary identified by our method for one VLLM can be used as a surrogate boundary for other VLLMs. Code will be released at this https URL
zh

[NLP-51] AfroXLMR-Comet: Multilingual Knowledge Distillation with Attention Matching for Low-Resource languages

【速读】：该论文旨在解决在资源受限环境中部署大型多语言语言模型时性能下降的问题，特别是在低资源语言方面。解决方案的关键在于提出了一种结合传统知识蒸馏与简化注意力匹配机制的混合蒸馏方法，设计用于多语言环境。这种方法引入了一个极其紧凑的学生模型架构，显著小于传统的多语言模型，并且成功地在减少模型大小超过85%的同时，保留了教师模型（AfroXLMR-Large）的输出分布和内部注意力模式。

链接: https://arxiv.org/abs/2502.18020
作者: Joshua Sakthivel Raju,Sanjay S,Jaskaran Singh Walia,Srinivas Raghav,Vukosi Marivate
机构: School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, India; Department of Computer Science, University of Pretoria, South Africa
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language model compression through knowledge distillation has emerged as a promising approach for deploying large language models in resource-constrained environments. However, existing methods often struggle to maintain performance when distilling multilingual models, especially for low-resource languages. In this paper, we present a novel hybrid distillation approach that combines traditional knowledge distillation with a simplified attention matching mechanism, specifically designed for multilingual contexts. Our method introduces an extremely compact student model architecture, significantly smaller than conventional multilingual models. We evaluate our approach on five African languages: Kinyarwanda, Swahili, Hausa, Igbo, and Yoruba. The distilled student model; AfroXLMR-Comet successfully captures both the output distribution and internal attention patterns of a larger teacher model (AfroXLMR-Large) while reducing the model size by over 85%. Experimental results demonstrate that our hybrid approach achieves competitive performance compared to the teacher model, maintaining an accuracy within 85% of the original model’s performance while requiring substantially fewer computational resources. Our work provides a practical framework for deploying efficient multilingual models in resource-constrained environments, particularly benefiting applications involving African languages.
zh

[NLP-52] Verdict: A Library for Scaling Judge-Time Compute

【速读】：该论文旨在解决大型语言模型（LLMs）作为自动化裁判（“LLM-as-a-judge”）在可靠性方面存在的诸多问题。为应对这些挑战，论文提出了一种名为Verdict的开源库，其关键在于利用模块化推理单元（如验证、辩论和聚合）的组合以及增加推理时间的计算资源，以提升自动化评估器的准确性、可靠性和可解释性。

链接: https://arxiv.org/abs/2502.18018
作者: Nimit Kalra,Leonard Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The use of LLMs as automated judges (“LLM-as-a-judge”) is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units – such as verification, debate, and aggregation – and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieve state-of-the-art (SOTA) or near-SOTA performance, surpassing orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Ultimately, we hope Verdict serves as a useful framework for researchers and practitioners building scalable, interpretable, and reliable LLM-based evaluators.
zh

[NLP-53] ViDoRAG : Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

【速读】：该论文旨在解决传统检索增强生成（Retrieval-Augmented Generation, RAG）方法在处理视觉丰富文档时所面临的挑战，特别是高效检索、理解及推理方面的不足。现有基准主要集中在基于图像的问题回答（QA），忽视了这些核心挑战。为填补这一空白，论文引入了ViDoSeek数据集，并在此基础上识别出当前RAG方法的关键局限：一是纯粹的视觉检索方法难以有效整合文本和视觉特征；二是先前的方法通常分配的推理标记不足，限制了其效果。

为应对这些挑战，论文提出了一种名为ViDoRAG的新框架，这是一种专为视觉文档中的复杂推理设计的新型多代理RAG系统。ViDoRAG采用基于高斯混合模型（Gaussian Mixture Model, GMM）的混合策略来有效处理多模态检索。此外，通过引入包含探索、总结和反思的迭代代理工作流程，进一步激发模型的推理能力，为研究测试时扩展提供框架。实验结果表明，ViDoRAG在竞争性的ViDoSeek基准上比现有方法高出超过10%，验证了其有效性和泛化能力。

链接: https://arxiv.org/abs/2502.18017
作者: Qiuchen Wang,Ruixue Ding,Zehui Chen,Weiqi Wu,Shihang Wang,Pengjun Xie,Feng Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model’s reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.
zh

[NLP-54] Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

【速读】：该论文旨在解决如何将大型语言模型（Large Language Models, LLMs）中的链式思维（Chain-of-Thought, CoT）能力有效迁移至小型语言模型（Small Language Models, SLMs）的问题。研究的关键在于通过系统性地考察影响CoT蒸馏的不同因素，包括粒度选择、格式以及教师模型的选择，发现SLMs在CoT蒸馏过程中表现出与LLMs不同的特性，如非单调的粒度关系、CoT格式对SLMs影响较小，以及更强的教师模型并不总是产生更优的学生模型。这些发现强调了需要针对特定的SLM定制CoT策略，从而优化SLMs中的CoT蒸馏过程。

链接: https://arxiv.org/abs/2502.18001
作者: Xinghao Chen,Zhijing Sun,Wenjin Guo,Miaoran Zhang,Yanjun Chen,Yirong Sun,Hui Su,Yijie Pan,Dietrich Klakow,Wenjie Li,Xiaoyu Shen
机构: Department of Computing, The Hong Kong Polytechnic University (香港理工大学计算学系); Ningbo Digital Twin Institute, Eastern Institute of Technology, Ningbo, China (宁波数字孪生研究院，东方理工学院，宁波); Saarland University (萨尔兰大学); Meituan Inc. (美团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs. The code and datasets are available at this https URL.
zh

[NLP-55] MAGE: Multi-Head Attention Guided Embeddings for Low Resource Sentiment Classification

【速读】：该论文旨在解决低资源班图语言文本分类中因数据质量不足而引发的重大挑战。关键解决方案在于结合使用语言无关的数据增强（Language-Independent Data Augmentation, LiDA）与基于多头注意力机制的加权嵌入（Multi-Head Attention based weighted embeddings），以有针对性地增强重要数据点，从而提升文本分类性能。这种方法不仅解决了数据稀缺的问题，还为未来低资源语言处理和分类任务的研究奠定了基础。

链接: https://arxiv.org/abs/2502.17987
作者: Varun Vashisht,Samar Singh,Mihir Konduskar,Jaskaran Singh Walia,Vukosi Marivate
机构: School of Computer Science and Engineering, Vellore Institute of Technology (计算机科学与工程学院, 佛罗里达理工学院); Department of Computer Science, University of Pretoria (计算机科学系, 前途大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Due to the lack of quality data for low-resource Bantu languages, significant challenges are presented in text classification and other practical implementations. In this paper, we introduce an advanced model combining Language-Independent Data Augmentation (LiDA) with Multi-Head Attention based weighted embeddings to selectively enhance critical data points and improve text classification performance. This integration allows us to create robust data augmentation strategies that are effective across various linguistic contexts, ensuring that our model can handle the unique syntactic and semantic features of Bantu languages. This approach not only addresses the data scarcity issue but also sets a foundation for future research in low-resource language processing and classification tasks.
zh

[NLP-56] LLM Knows Geometry Better than Algebra: Numerical Understanding of LLM -Based Agents in A Trading Arena

【速读】：该论文旨在解决大型语言模型（LLMs）在处理动态、未见过的任务，尤其是在数值推理方面泛化能力不足的问题。现有基准主要评估LLMs在具有预定义最优解的问题上的表现，这可能与实际场景不符，在这些场景中往往缺乏明确的答案。为了解决这一问题，论文设计了一个名为Agent Trading Arena的虚拟数值游戏，通过零和博弈模拟复杂的经济系统，其中智能体投资于股票组合。关键解决方案在于引入视觉数据（如散点图或K线图），这显著提升了LLMs的几何推理能力，并通过加入反射模块进一步增强了其对复杂数据的分析和解读能力。验证实验表明，在NASDAQ股票数据集上，相较于文本数据，LLMs在视觉数据上展现出更强的推理能力。

链接: https://arxiv.org/abs/2502.17967
作者: Tianmi Ma,Jiawei Du,Wenxin Huang,Wenjie Wang,Liang Xie,Xian Zhong,Joey Tianyi Zhou
机构: Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology (湖北物联网交通重点实验室, 武汉理工大学); Hubei Key Laboratory of Big Data Intelligent Analysis and Application, Hubei University (湖北大数据智能分析与应用重点实验室, 湖北大学); Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore (前沿人工智能研究中心, 新加坡科技研究局); Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore (高性能计算研究所, 新加坡科技研究局); School of Computing, National University of Singapore (计算机学院, 新加坡国立大学); School of Science, Wuhan University of Technology (理学院, 武汉理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Statistical Finance (q-fin.ST)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly improved performance in natural language processing tasks. However, their ability to generalize to dynamic, unseen tasks, particularly in numerical reasoning, remains a challenge. Existing benchmarks mainly evaluate LLMs on problems with predefined optimal solutions, which may not align with real-world scenarios where clear answers are absent. To bridge this gap, we design the Agent Trading Arena, a virtual numerical game simulating complex economic systems through zero-sum games, where agents invest in stock portfolios. Our experiments reveal that LLMs, including GPT-4o, struggle with algebraic reasoning when dealing with plain-text stock data, often focusing on local details rather than global trends. In contrast, LLMs perform significantly better with geometric reasoning when presented with visual data, such as scatter plots or K-line charts, suggesting that visual representations enhance numerical reasoning. This capability is further improved by incorporating the reflection module, which aids in the analysis and interpretation of complex data. We validate our findings on NASDAQ Stock dataset, where LLMs demonstrate stronger reasoning with visual data compared to text. Our code and data are publicly available at this https URL.
zh

[NLP-57] On Synthetic Data Strategies for Domain-Specific Generative Retrieval

【速读】：该论文旨在解决在开发针对特定领域语料库的生成式检索模型时，手动标注领域内查询所面临的可扩展性挑战。解决方案的关键在于研究合成数据生成策略以及两阶段训练框架中的困难负样本挖掘方法：第一阶段通过使用大语言模型生成具有不同粒度（如片段、句子）及领域相关搜索约束的查询，以更好地捕捉细微的相关性信号；第二阶段通过偏好学习优化文档排序，并探索基于初始模型预测的困难负样本挖掘策略。实验结果表明，该合成数据生成和困难负样本采样方法的有效性。

链接: https://arxiv.org/abs/2502.17957
作者: Haoyang Wen,Jiang Guo,Yi Zhang,Jiarong Jiang,Zhiguo Wang
机构: AWS AI (AWS人工智能); Language Technologies Institute (语言技术研究所), Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper investigates synthetic data generation strategies in developing generative retrieval models for domain-specific corpora, thereby addressing the scalability challenges inherent in manually annotating in-domain queries. We study the data strategies for a two-stage training framework: in the first stage, which focuses on learning to decode document identifiers from queries, we investigate LLM-generated queries across multiple granularity (e.g. chunks, sentences) and domain-relevant search constraints that can better capture nuanced relevancy signals. In the second stage, which aims to refine document ranking through preference learning, we explore the strategies for mining hard negatives based on the initial model’s predictions. Experiments on public datasets over diverse domains demonstrate the effectiveness of our synthetic data generation and hard negative sampling approach.
zh

[NLP-58] owards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

【速读】：该论文旨在解决多步推理在多语言环境下的挑战，特别是Chain-of-Thought (CoT)提示方法在非英语语言中的局限性。论文的关键解决方案是提出Program-of-Thought (PoT)提示方法，通过分离推理与执行过程来改善多语言推理能力，并通过微调提升推理质量与答案准确性之间的关联。研究表明，PoT微调显著提升了多语言推理性能，优于CoT微调模型，并且推理质量与答案准确性之间存在强相关性。

链接: https://arxiv.org/abs/2502.17956
作者: Patomporn Payoungkhamdee,Pume Tuchinda,Jinheon Baek,Samuel Cahyawijaya,Can Udomcharoenchaikit,Potsawee Manakul,Peerat Limkonchotiwat,Ekapol Chuangsuwanich,Sarana Nutanong
机构: School of Information Science and Technology, VISTEC(信息科学与技术学院，VISTEC); KAIST(韩国科学技术院); Cohere(未知); SCB 10X(未知); AI Singapore(未知); Department of Computer Engineering, Chulalongkorn University(朱拉隆功大学计算机工程系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.
zh

[NLP-59] Language Models Factuality Depends on the Language of Inquiry

【速读】：该论文旨在解决多语言语言模型（Multilingual Language Models, LMs）在不同语言之间有效传递事实性知识（factual knowledge）的问题。论文的关键在于引入了一个包含13种语言的10,000个与国家相关的事实的基准测试，并提出了三个新的评估指标：事实回忆评分（Factual Recall Score）、知识可转移性评分（Knowledge Transferability Score）以及跨语言事实知识可转移性评分（Cross-Lingual Factual Knowledge Transferability Score）。这些方法用于量化和评估语言模型在不同语言间传递事实性知识的能力，从而揭示当前最先进的语言模型在跨语言泛化方面的基本弱点。

链接: https://arxiv.org/abs/2502.17955
作者: Tushar Aggarwal,Kumar Tanmay,Ayush Agrawal,Kumar Ayush,Hamid Palangi,Paul Pu Liang
机构: Microsoft Research (微软研究); Harvard University (哈佛大学); Université de Montréal (蒙特利尔大学); Mila; MIT (麻省理工学院); Stanford University (斯坦福大学); Google (谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual language models (LMs) are expected to recall factual knowledge consistently across languages, yet they often fail to transfer knowledge between languages even when they possess the correct information in one of the languages. For example, we find that an LM may correctly identify Rashed Al Shashai as being from Saudi Arabia when asked in Arabic, but consistently fails to do so when asked in English or Swahili. To systematically investigate this limitation, we introduce a benchmark of 10,000 country-related facts across 13 languages and propose three novel metrics: Factual Recall Score, Knowledge Transferability Score, and Cross-Lingual Factual Knowledge Transferability Score-to quantify factual recall and knowledge transferability in LMs across different languages. Our results reveal fundamental weaknesses in today’s state-of-the-art LMs, particularly in cross-lingual generalization where models fail to transfer knowledge effectively across different languages, leading to inconsistent performance sensitive to the language used. Our findings emphasize the need for LMs to recognize language-specific factual reliability and leverage the most trustworthy information across languages. We release our benchmark and evaluation framework to drive future research in multilingual knowledge transfer.
zh

[NLP-60] DeepSeek -R1 Outperforms Gemini 2.0 Pro OpenAI o1 and o3-mini in Bilingual Complex Ophthalmology Reasoning

【速读】：该论文旨在评估DeepSeek-R1与其他三个近期发布的大型语言模型（Large Language Models, LLMs）在双语复杂眼科病例中的准确性和推理能力。研究通过设置多选题（MCQs）涵盖诊断和管理两个方面，并将其翻译成英文，以比较这些模型在中文和英文环境下的表现。关键解决方案在于通过设计全面的多选题测试集以及严谨的评分标准来量化各模型的准确性及推理逻辑，从而得出DeepSeek-R1在中英文环境中均表现出色，尤其是在管理问题的处理上具有显著优势。

链接: https://arxiv.org/abs/2502.17947
作者: Pusheng Xu,Yue Wu,Kai Jin,Xiaolan Chen,Mingguang He,Danli Shi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 29 pages, 4 figures, 1 table

点击查看摘要

Abstract:Purpose: To evaluate the accuracy and reasoning ability of DeepSeek-R1 and three other recently released large language models (LLMs) in bilingual complex ophthalmology cases. Methods: A total of 130 multiple-choice questions (MCQs) related to diagnosis (n = 39) and management (n = 91) were collected from the Chinese ophthalmology senior professional title examination and categorized into six topics. These MCQs were translated into English using DeepSeek-R1. The responses of DeepSeek-R1, Gemini 2.0 Pro, OpenAI o1 and o3-mini were generated under default configurations between February 15 and February 20, 2025. Accuracy was calculated as the proportion of correctly answered questions, with omissions and extra answers considered incorrect. Reasoning ability was evaluated through analyzing reasoning logic and the causes of reasoning error. Results: DeepSeek-R1 demonstrated the highest overall accuracy, achieving 0.862 in Chinese MCQs and 0.808 in English MCQs. Gemini 2.0 Pro, OpenAI o1, and OpenAI o3-mini attained accuracies of 0.715, 0.685, and 0.692 in Chinese MCQs (all P0.001 compared with DeepSeek-R1), and 0.746 (P=0.115), 0.723 (P=0.027), and 0.577 (P0.001) in English MCQs, respectively. DeepSeek-R1 achieved the highest accuracy across five topics in both Chinese and English MCQs. It also excelled in management questions conducted in Chinese (all P0.05). Reasoning ability analysis showed that the four LLMs shared similar reasoning logic. Ignoring key positive history, ignoring key positive signs, misinterpretation medical data, and too aggressive were the most common causes of reasoning errors. Conclusion: DeepSeek-R1 demonstrated superior performance in bilingual complex ophthalmology reasoning tasks than three other state-of-the-art LLMs. While its clinical applicability remains challenging, it shows promise for supporting diagnosis and clinical decision-making.
zh

[NLP-61] Assessing Large Language Models in Agent ic Multilingual National Bias

【速读】：该论文旨在解决大型语言模型（LLMs）在跨语言推理性推荐中的偏差问题，这一领域目前缺乏深入研究。论文的关键解决方案在于通过分析LLMs在多语言环境下对决策任务的响应，评估其在提供个性化建议时的偏差情况，并量化这些偏差。研究特别关注了不同语言环境下的本地语言偏差问题，以及推理策略如Chain-of-Thought提示对偏差的影响。研究表明，尽管GPT-4和Sonnet在英语国家中减少了偏差，但未能实现稳健的多语言一致性，从而揭示了多语言AI代理和应用（如教育）中存在的更广泛问题。

链接: https://arxiv.org/abs/2502.17945
作者: Qianying Liu,Katrina Qiyao Wang,Fei Cheng,Sadao Kurohashi
机构: National Institute of Informatics(国家信息学研究院), Japan; University of Wisconsin—Madison(威斯康星大学麦迪逊分校), USA; Kyoto University(京都大学), Japan
类目: Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Large Language Models have garnered significant attention for their capabilities in multilingual natural language processing, while studies on risks associated with cross biases are limited to immediate context preferences. Cross-language disparities in reasoning-based recommendations remain largely unexplored, with a lack of even descriptive analysis. This study is the first to address this gap. We test LLM’s applicability and capability in providing personalized advice across three key scenarios: university applications, travel, and relocation. We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages. We quantify bias in model-generated scores and assess the impact of demographic factors and reasoning strategies (e.g., Chain-of-Thought prompting) on bias patterns. Our findings reveal that local language bias is prevalent across different tasks, with GPT-4 and Sonnet reducing bias for English-speaking countries compared to GPT-3.5 but failing to achieve robust multilingual alignment, highlighting broader implications for multilingual AI agents and applications such as education.
zh

[NLP-62] CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation

【速读】：该论文旨在解决自动化生成法律案件文档过程中面临的挑战，特别是在现有基准未能充分反映真实场景复杂性的情况下。论文的关键解决方案是引入CaseGen基准，这是一个针对中文法律领域的多阶段法律案件文档生成基准。CaseGen基于500个由法律专家标注的真实案例样本，涵盖七个关键案件部分，并支持四项主要任务：起草辩护陈述、撰写审判事实、编写法律推理和生成判决结果。此外，论文设计了“LLM作为法官”的评估框架，并通过人工标注验证其有效性，从而全面评估通用领域及法律专用的大规模语言模型在法律案件文档生成中的表现，并指出其局限性和改进方向。这项工作标志着迈向更有效的自动化法律案件文档起草框架的重要一步，为AI在法律领域的可靠应用铺平道路。

链接: https://arxiv.org/abs/2502.17943
作者: Haitao Li,Jiaying Ye,Yiran Hu,Jia Chen,Qingyao Ai,Yueyue Wu,Junjie Chen,Yifan Chen,Cheng Luo,Quan Zhou,Yiqun Liu
机构: DCST, Tsinghua University(清华大学); Quan Cheng Laboratory(全成实验室); University of Waterloo(滑铁卢大学); Xiaohongshu Inc(小红书公司); DCST, Beijing University of Posts and Telecommunications(北邮xDCST); MegaTech.AI( MegaTech.AI)
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Legal case documents play a critical role in judicial proceedings. As the number of cases continues to rise, the reliance on manual drafting of legal case documents is facing increasing pressure and challenges. The development of large language models (LLMs) offers a promising solution for automating document generation. However, existing benchmarks fail to fully capture the complexities involved in drafting legal case documents in real-world scenarios. To address this gap, we introduce CaseGen, the benchmark for multi-stage legal case documents generation in the Chinese legal domain. CaseGen is based on 500 real case samples annotated by legal experts and covers seven essential case sections. It supports four key tasks: drafting defense statements, writing trial facts, composing legal reasoning, and generating judgment results. To the best of our knowledge, CaseGen is the first benchmark designed to evaluate LLMs in the context of legal case document generation. To ensure an accurate and comprehensive evaluation, we design the LLM-as-a-judge evaluation framework and validate its effectiveness through human annotations. We evaluate several widely used general-domain LLMs and legal-specific LLMs, highlighting their limitations in case document generation and pinpointing areas for potential improvement. This work marks a step toward a more effective framework for automating legal case documents drafting, paving the way for the reliable application of AI in the legal field. The dataset and code are publicly available at this https URL.
zh

[NLP-63] Advantage-Guided Distillation for Preference Alignment in Small Language Models ICLR2025

【速读】：该论文旨在解决小规模语言模型（Small Language Models, SLMs）在应用现有对齐技术（alignment techniques）时效果不佳的问题。论文的关键解决方案在于利用一个已经对齐的大规模语言模型（Large Language Model, LLM）作为教师模型，通过知识蒸馏（knowledge distillation）引导小规模语言模型（学生模型）进行对齐，从而将教师模型对人类偏好的理解转移给学生模型。论文提出了两种方法：双约束知识蒸馏（Dual-Constrained Knowledge Distillation, DCKD）和优势引导蒸馏偏好对齐（Advantage-Guided Distillation for Preference Alignment, ADPA），以显著提升小规模语言模型的对齐效果，并缩小其与大规模模型之间的性能差距。其中，ADPA表现出更优的效果，尤其是在与DCKD结合使用时。

链接: https://arxiv.org/abs/2502.17927
作者: Shiping Gao,Fanqi Wan,Jiajian Guo,Xiaojun Quan,Qifan Wang
机构: Sun Yat-sen University (中山大学); Meta (Meta)
类目: Computation and Language (cs.CL)
备注: Accepted by ICLR 2025(spotlight)

点击查看摘要

Abstract:Alignment techniques enable Large Language Models (LLMs) to generate outputs that align with human preferences and play a crucial role in their effectiveness. However, their impact often diminishes when applied to Small Language Models (SLMs), likely due to the limited capacity of these models. Instead of directly applying existing alignment techniques to SLMs, we propose to utilize a well-aligned teacher LLM to guide the alignment process for these models, thereby facilitating the transfer of the teacher’s knowledge of human preferences to the student model. To achieve this, we first explore a straightforward approach, Dual-Constrained Knowledge Distillation (DCKD), that employs knowledge distillation with two KL-divergence constraints from the aligned teacher to the unaligned student. To further enhance the student’s ability to distinguish between preferred and dispreferred responses, we then propose Advantage-Guided Distillation for Preference Alignment (ADPA), which leverages an advantage function from the aligned teacher to deliver more nuanced, distribution-level reward signals for the student’s alignment. Our experimental results show that these two approaches appreciably improve the alignment of SLMs and narrow the performance gap with larger counterparts. Among them, ADPA demonstrates superior performance and achieves even greater effectiveness when integrated with DCKD. Our code is available at this https URL.
zh

[NLP-64] FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models

【速读】：该论文旨在解决现有自动化事实核查评估方法依赖静态数据集和分类指标的问题，这些方法无法自动评估论据生成并揭示大型语言模型（LLMs）在事实核查方面的细微局限性。论文的关键解决方案是引入FACT-AUDIT框架，该框架通过利用重要性采样原则和多智能体协作，生成自适应且可扩展的数据集，执行迭代式的以模型为中心的评估，并基于特定模型的响应更新评估结果。通过结合论据生产和裁决预测，此框架提供了LLMs事实推理能力的全面且不断演进的审计，以探究其可信度。

链接: https://arxiv.org/abs/2502.17924
作者: Hongzhan Lin,Yang Deng,Yuxuan Gu,Wenxuan Zhang,Jing Ma,See-Kiong Ng,Tat-Seng Chua
机构: Hong Kong Baptist University(香港浸会大学); National University of Singapore(新加坡国立大学); Singapore Management University(新加坡管理大学); Harbin Institute of Technology(哈尔滨工业大学); Singapore University of Design and Technology(新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced the fact-checking studies. However, existing automated fact-checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact-checking. In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs’ fact-checking capabilities. Leveraging importance sampling principles and multi-agent collaboration, FACT-AUDIT generates adaptive and scalable datasets, performs iterative model-centric evaluations, and updates assessments based on model-specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs’ factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT-AUDIT effectively differentiates among state-of-the-art LLMs, providing valuable insights into model strengths and limitations in model-centric fact-checking analysis.
zh

[NLP-65] Scaling LLM Pre-training with Vocabulary Curriculum

【速读】：该论文旨在解决现代语言模型静态词汇表与人类语言学习过程中自适应词汇获取之间的差距。关键解决方案在于引入词汇课程学习（Vocabulary Curriculum Learning），通过熵引导的词汇扩展和模型优化交替进行，实现相对于词汇量的对数线性缩放增益。此方法使得模型能够在不同标记粒度下学习可迁移的表示，从而优化计算资源分配：较长的标记捕捉可预测内容，而较短的标记则专注于复杂且难以预测的上下文。

链接: https://arxiv.org/abs/2502.17910
作者: Fangyuan Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:Modern language models rely on static vocabularies, fixed before pretraining, in contrast to the adaptive vocabulary acquisition observed in human language learning. To bridge this gap, we introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size. Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities. This approach naturally gives rise to an optimal computation allocation pattern: longer tokens capture predictable content, while shorter tokens focus on more complex, harder-to-predict contexts. Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization. We release our code to support further research and plan to extend our experiments to larger models and diverse domains.
zh

[NLP-66] Can Large Language Models Identify Implicit Suicidal Ideation? An Empirical Evaluation

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在自杀预防中的能力，重点关注两个关键方面：隐性自杀意念的识别（Identification of Implicit Suicidal ideation, IIS）和提供适当的支持性回应（Provision of Appropriate Supportive responses, PAS）。论文通过构建一个包含1,308个测试案例的新型数据集\ourdata，并在不同情境设置下对8种广泛使用的LLMs进行广泛的实验，揭示了当前模型在检测隐性自杀意念和提供适当支持方面的显著不足。研究结果强调了开发和评估用于敏感心理应用的LLMs需要更复杂的方法。关键在于提出一个综合评估框架，并引入新的数据集以更准确地评估LLMs在这类任务上的表现。

链接: https://arxiv.org/abs/2502.17899
作者: Tong Li,Shu Yang,Junchao Wu,Jiyao Wei,Lijie Hu,Mengdi Li,Derek F. Wong,Joshua R. Oltmanns,Di Wang
机构: Provable Responsible AI and Data Analytics (PRADA) Lab; King Abdullah University of Science and Technology; Washington University in St.Louis; University of Macau; Institute of Computing Technology, Chinese Academy of Sciences
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a comprehensive evaluation framework for assessing Large Language Models’ (LLMs) capabilities in suicide prevention, focusing on two critical aspects: the Identification of Implicit Suicidal ideation (IIS) and the Provision of Appropriate Supportive responses (PAS). We introduce \ourdata, a novel dataset of 1,308 test cases built upon psychological frameworks including D/S-IAT and Negative Automatic Thinking, alongside real-world scenarios. Through extensive experiments with 8 widely used LLMs under different contextual settings, we find that current models struggle significantly with detecting implicit suicidal ideation and providing appropriate support, highlighting crucial limitations in applying LLMs to mental health contexts. Our findings underscore the need for more sophisticated approaches in developing and evaluating LLMs for sensitive psychological applications.
zh

[NLP-67] RankCoT: Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在利用检索文档中的知识时遇到的挑战，特别是如何避免无关或噪声信息的误导。解决方案的关键在于引入RankCoT方法，通过整合重排序信号来生成基于链式思考（Chain-of-Thought, CoT）的总结，从而实现知识精炼。RankCoT通过提示LLM基于查询和单个文档生成CoT候选，并进一步微调LLM直接从这些候选输出中复现最佳CoT，以过滤掉无关文档。此外，RankCoT还结合了自我反思机制，以进一步优化CoT输出，从而生成高质量的训练数据。

链接: https://arxiv.org/abs/2502.17888
作者: Mingyan Wu,Zhenghao Liu,Yukun Yan,Xinze Li,Shi Yu,Zheni Zeng,Yu Gu,Ge Yu
机构: Department of Computer Science and Technology, Northeastern University, China (东北大学计算机科学与技术学院,中国); Department of Computer Science and Technology, Institute for AI, Tsinghua University, China (清华大学计算机科学与技术系人工智能研究所,中国); Beijing National Research Center for Information Science and Technology, China (北京信息科学技术国家研究中心,中国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances the performance of Large Language Models (LLMs) by incorporating external knowledge. However, LLMs still encounter challenges in effectively utilizing the knowledge from retrieved documents, often being misled by irrelevant or noisy information. To address this issue, we introduce RankCoT, a knowledge refinement method that incorporates reranking signals in generating CoT-based summarization for knowledge refinement based on given query and all retrieval documents. During training, RankCoT prompts the LLM to generate Chain-of-Thought (CoT) candidates based on the query and individual documents. It then fine-tunes the LLM to directly reproduce the best CoT from these candidate outputs based on all retrieved documents, which requires LLM to filter out irrelevant documents during generating CoT-style summarization. Additionally, RankCoT incorporates a self-reflection mechanism that further refines the CoT outputs, resulting in higher-quality training data. Our experiments demonstrate the effectiveness of RankCoT, showing its superior performance over other knowledge refinement models. Further analysis reveals that RankCoT can provide shorter but effective refinement results, enabling the generator to produce more accurate answers. All code and data are available at this https URL.
zh

[NLP-68] Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers

【速读】：该论文旨在解决学术期刊出版主要使用英语所导致的非英语母语研究人员面临的语言障碍问题。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）在保持原JATS XML格式的情况下自动翻译已发表的科学文章，并通过引入一种新的基于问答（Question-and-Answer, QA）基准测试方法来评估翻译准确性。研究表明，该方法能够准确传达关键科学细节，平均性能达到95.9%。此外，研究还展示了如何利用情境学习技术适应特定领域的翻译偏好，从而进一步提高翻译的质量与实用性。

链接: https://arxiv.org/abs/2502.17882
作者: Hannah Calzi Kleidermacher,James Zou
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific research is inherently global. However, the vast majority of academic journals are published exclusively in English, creating barriers for non-native-English-speaking researchers. In this study, we leverage large language models (LLMs) to translate published scientific articles while preserving their native JATS XML formatting, thereby developing a practical, automated approach for implementation by academic journals. Using our approach, we translate articles across multiple scientific disciplines into 28 languages. To evaluate translation accuracy, we introduce a novel question-and-answer (QA) benchmarking method, in which an LLM generates comprehension-based questions from the original text and then answers them based on the translated text. Our benchmark results show an average performance of 95.9%, showing that the key scientific details are accurately conveyed. In a user study, we translate the scientific papers of 15 researchers into their native languages, finding that the authors consistently found the translations to accurately capture the original information in their articles. Interestingly, a third of the authors found many technical terms “overtranslated,” expressing a preference to keep terminology more familiar in English untranslated. Finally, we demonstrate how in-context learning techniques can be used to align translations with domain-specific preferences such as mitigating overtranslation, highlighting the adaptability and utility of LLM-driven scientific translation. The code and translated articles are available at this https URL.
zh

[NLP-69] owards Enhanced Immersion and Agency for LLM -based Interactive Drama

【速读】：该论文旨在解决在基于大型语言模型（LLM）的互动戏剧中沉浸感（Immersion）与能动性（Agency）不足的问题。关键解决方案包括提出剧本引导生成（Playwriting-guided Generation）方法，以显著提升故事结构和叙述质量，以及引入基于情节反思（Plot-based Reflection），使LLM角色能够更好地理解并响应玩家的意图，从而增强沉浸感和能动性。

链接: https://arxiv.org/abs/2502.17878
作者: Hongqiu Wu,Weiqi Wu,Tianyang Xu,Jiameng Zhang,Hai Zhao
机构: Department of Computer Science and Engineering, Shanghai Jiao Tong University (上海交通大学计算机科学与工程系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based Interactive Drama is a novel AI-based dialogue scenario, where the user (i.e. the player) plays the role of a character in the story, has conversations with characters played by LLM agents, and experiences an unfolding story. This paper begins with understanding interactive drama from two aspects: Immersion, the player’s feeling of being present in the story, and Agency, the player’s ability to influence the story world. Both are crucial to creating an enjoyable interactive experience, while they have been underexplored in previous work. To enhance these two aspects, we first propose Playwriting-guided Generation, a novel method that helps LLMs craft dramatic stories with substantially improved structures and narrative quality. Additionally, we introduce Plot-based Reflection for LLM agents to refine their reactions to align with the player’s intentions. Our evaluation relies on human judgment to assess the gains of our methods in terms of immersion and agency.
zh

[NLP-70] SYNTHEMPATHY: A Scalable Empathy Corpus Generated Using LLM s Without Any Crowdsourcing

【速读】：该论文旨在解决大型语言模型 (LLMs) 在生成共情对话方面的数据稀缺问题。现有的共情对话数据集主要依赖众包来模拟共情交流，这种方法成本高、耗时且难以扩展到大规模数据集。论文的关键解决方案是提出了一种数据生成框架，用于开发SYNTHEMPATHY数据集，该数据集包含通过LLM生成的105,000条针对现实生活中情境的共情响应。关键在于利用生成式方法大幅增加共情对话数据的规模，从而提升模型的共情能力。

链接: https://arxiv.org/abs/2502.17857
作者: Run Chen,Jun Shin,Julia Hirschberg
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Previous research has shown that humans are more receptive towards language models that that exhibit empathetic behavior. While empathy is essential for developing helpful dialogue agents, very few large corpora containing empathetic dialogues are available for fine-tune LLMs. The few existing corpora have largely relied on crowdsourcing to simulate empathetic conversations, a process that is expensive, time-consuming, and not scalable to larger datasets. We propose a data generation framework for developing SYNTHEMPATHY, a large corpus containing 105k empathetic responses to real-life situations compiled through LLM generation. A base Mistral 7B model fine-tuned on our SYNTHEMPATHY corpus exhibits an increase in the average empathy score.
zh

[NLP-71] LR2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在长链反射推理（Long-chain Reflective Reasoning, LR²）能力评估方面的不足。论文的关键解决方案是引入了一个名为LR² Bench的新基准，该基准包含六个约束满足问题（Constraint Satisfaction Problems, CSPs），用以全面评估LLMs在不同约束模式下的问题解决能力，包括基于知识、逻辑和空间的约束。通过这一基准测试，论文揭示了现有最先进的推理特定模型在处理这些复杂任务时仍存在显著不足，从而强调了提升LLMs反射推理能力的重要性和迫切性。

链接: https://arxiv.org/abs/2502.17848
作者: Jianghao Chen,Zhenlin Wei,Zhenjiang Ren,Ziyong Li,Jiajun Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent progress in o1-like models has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR ^2 Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR ^2 Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. We conduct extensive evaluation on both conventional models and o1-like models. Our experimental results reveal that even the most advanced reasoning-specific models, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR ^2 Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs. The leaderboard of our benchmark is available at this https URL
zh

[NLP-72] Say Less Mean More: Leverag ing Prag matics in Retrieval-Augmented Generation

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）框架中检索到的上下文利用效率不高的问题。其解决方案的关键在于提出了一种简单的无监督方法，通过识别与问题最相关的句子，并在不进行截断或修改的情况下突出显示这些句子的上下文，从而提升检索到的文档在问答任务中的效用。实验结果表明，该方法在三个不同的问答任务中（ARC-Challenge, PubHealth 和 PopQA）显著提升了相对准确率，最高可提升19.7%（PubHealth）和10%（ARC-Challenge），相比于传统的RAG系统。

链接: https://arxiv.org/abs/2502.17839
作者: Haris Riaz,Ellen Riloff,Mihai Surdeanu
机构: University of Arizona (亚利桑那大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 2 figures

点击查看摘要

Abstract:We propose a simple, unsupervised method that injects pragmatic principles in retrieval-augmented generation (RAG) frameworks such as Dense Passage Retrieval~\citekarpukhin2020densepassageretrievalopendomain to enhance the utility of retrieved contexts. Our approach first identifies which sentences in a pool of documents retrieved by RAG are most relevant to the question at hand, cover all the topics addressed in the input question and no more, and then highlights these sentences within their context, before they are provided to the LLM, without truncating or altering the context in any other way. We show that this simple idea brings consistent improvements in experiments on three question answering tasks (ARC-Challenge, PubHealth and PopQA) using five different LLMs. It notably enhances relative accuracy by up to 19.7% on PubHealth and 10% on ARC-Challenge compared to a conventional RAG system.
zh

[NLP-73] A General Framework to Enhance Fine-tuning-based LLM Unlearning

【速读】：该论文旨在解决在去除大型语言模型（Large Language Models, LLMs）中的版权和隐私敏感数据时，现有基于微调的方法通常会降低模型效用（即响应常规提示的能力）的问题。论文的关键解决方案是提出了一种名为Gated Representation UNlearning (GRUN) 的框架，该框架通过引入软门控函数来区分目标数据，并采用基于表示微调（Representation Fine-tuning, ReFT）的抑制模块来调整表征而非模型参数，从而显著提升了无学习（unlearning）效果和模型效用，同时保持了方法的通用性和高效性。

链接: https://arxiv.org/abs/2502.17823
作者: Jie Ren,Zhenwei Dai,Xianfeng Tang,Hui Liu,Jingying Zeng,Zhen Li,Rahul Goutam,Suhang Wang,Yue Xing,Qi He,Hui Liu
机构: Amazon(亚马逊); Michigan State University(密歇根州立大学); The Pennsylvania State University(宾夕法尼亚州立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unlearning has been proposed to remove copyrighted and privacy-sensitive data from Large Language Models (LLMs). Existing approaches primarily rely on fine-tuning-based methods, which can be categorized into gradient ascent-based (GA-based) and suppression-based methods. However, they often degrade model utility (the ability to respond to normal prompts). In this work, we aim to develop a general framework that enhances the utility of fine-tuning-based unlearning methods. To achieve this goal, we first investigate the common property between GA-based and suppression-based methods. We unveil that GA-based methods unlearn by distinguishing the target data (i.e., the data to be removed) and suppressing related generations, which is essentially the same strategy employed by suppression-based methods. Inspired by this finding, we introduce Gated Representation UNlearning (GRUN) which has two components: a soft gate function for distinguishing target data and a suppression module using Representation Fine-tuning (ReFT) to adjust representations rather than model parameters. Experiments show that GRUN significantly improves the unlearning and utility. Meanwhile, it is general for fine-tuning-based methods, efficient and promising for sequential unlearning.
zh

[NLP-74] Predicting Through Generation: Why Generation Is Better for Prediction

【速读】：该论文旨在解决使用生成式方法进行预测任务时面临的两个主要挑战：曝光偏差（exposure bias）和格式不匹配（format mismatch）。论文的关键解决方案是提出PredGen框架，该框架通过采用计划采样（scheduled sampling）来减轻曝光偏差，并引入任务适配器（task adapter）将生成的离散标记转换为所需的结构化输出。此外，还引入了作者-导演对齐损失（Writer-Director Alignment Loss, WDAL），以确保生成的标记与最终任务预测之间的一致性，从而提高文本连贯性和数值准确性。

链接: https://arxiv.org/abs/2502.17817
作者: Md Kowsher,Nusrat Jahan Prottasha,Prakash Bhat,Chun-Nam Yu,Mojtaba Soltanalian,Ivan Garibay,Ozlem Garibay,Chen Chen,Niloofar Yousefi
机构: University of Central Florida (中佛罗里达大学), USA; DotStar Inc (DotStar Inc), USA; Nokia Bell Labs (诺基亚贝尔实验室), USA; University of Illinois Chicago (芝加哥伊利诺伊大学), USA
类目: Computation and Language (cs.CL)
备注: Preprint paper

点击查看摘要

Abstract:This paper argues that generating output tokens is more effective than using pooled representations for prediction tasks because token-level generation retains more mutual information. Since LLMs are trained on massive text corpora using next-token prediction, generation aligns naturally with their learned behavior. Using the Data Processing Inequality (DPI), we provide both theoretical and empirical evidence supporting this claim. However, autoregressive models face two key challenges when used for prediction: (1) exposure bias, where the model sees ground truth tokens during training but relies on its own predictions during inference, leading to errors, and (2) format mismatch, where discrete tokens do not always align with the tasks required output structure. To address these challenges, we introduce PredGen(Predicting Through Generating), an end to end framework that (i) uses scheduled sampling to reduce exposure bias, and (ii) introduces a task adapter to convert the generated tokens into structured outputs. Additionally, we introduce Writer-Director Alignment Loss (WDAL), which ensures consistency between token generation and final task predictions, improving both text coherence and numerical accuracy. We evaluate PredGen on multiple classification and regression benchmarks. Our results show that PredGen consistently outperforms standard baselines, demonstrating its effectiveness in structured prediction tasks.
zh

[NLP-75] Can Multimodal LLM s Perform Time Series Anomaly Detection?

【速读】：该论文旨在探讨多模态大型语言模型（Multimodal Large Language Models, MLLMs）在时间序列异常检测（Time Series Anomaly Detection, TSAD）中的应用潜力。论文的关键在于通过创建VisualTimeAnomaly基准数据集来评估不同类型的MLLMs（包括专有模型和开源模型）在处理时间序列图像数据时的性能。论文的关键解决方案是将时间序列数值数据转换成图像格式，并使用这些图像作为输入，以检测点状、区间以及变量级别的异常情况，从而验证MLLMs在处理单变量、多变量及不规则时间序列中的有效性与鲁棒性。

链接: https://arxiv.org/abs/2502.17812
作者: Xiongxiao Xu,Haoran Wang,Yueqing Liang,Philip S. Yu,Yue Zhao,Kai Shu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages for the main content; 32 pages for the full paper including the appendix. More resources on the intersection of multimodal LLMs and time series analysis are on the website this https URL

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly used in time series analysis. However, the potential of multimodal LLMs (MLLMs), particularly vision-language models, for time series remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. Motivated by this, we raise a critical and practical research question: Can multimodal LLMs perform time series anomaly detection? To answer this, we propose VisualTimeAnomaly benchmark to evaluate MLLMs in time series anomaly detection (TSAD). Our approach transforms time series numerical data into the image format and feed these images into various MLLMs, including proprietary models (GPT-4o and Gemini-1.5) and open-source models (LLaVA-NeXT and Qwen2-VL), each with one larger and one smaller variant. In total, VisualTimeAnomaly contains 12.4k time series images spanning 3 scenarios and 3 anomaly granularities with 9 anomaly types across 8 MLLMs. Starting with the univariate case (point- and range-wise anomalies), we extend our evaluation to more practical scenarios, including multivariate and irregular time series scenarios, and variate-wise anomalies. Our study reveals several key insights: 1) MLLMs detect range- and variate-wise anomalies more effectively than point-wise anomalies. 2) MLLMs are highly robust to irregular time series, even with 25% of the data missing. 3) Open-source MLLMs perform comparably to proprietary models in TSAD. While open-source MLLMs excel on univariate time series, proprietary MLLMs demonstrate superior effectiveness on multivariate time series. To the best of our knowledge, this is the first work to comprehensively investigate MLLMs for TSAD, particularly for multivariate and irregular time series scenarios. We release our dataset and code at this https URL to support future research. Comments: 9 pages for the main content; 32 pages for the full paper including the appendix. More resources on the intersection of multimodal LLMs and time series analysis are on the website this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2502.17812 [cs.CL] (or arXiv:2502.17812v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.17812 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xiongxiao Xu [view email] [v1] Tue, 25 Feb 2025 03:37:43 UTC (1,174 KB) Full-text links: Access Paper: View a PDF of the paper titled Can Multimodal LLMs Perform Time Series Anomaly Detection?, by Xiongxiao Xu and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-02 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[NLP-76] URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models

【速读】：该论文旨在解决语音到语音（Speech-to-Speech, S2S）场景下端到端口语对话模型（Spoken Dialogue Models, SDMs）综合评估不足的问题。关键解决方案是提出了URO-Bench，这是一个涵盖多语言、多轮对话及副语言信息评估的全面基准测试，分为基础和专业两个难度等级，以评估模型在理解（Understanding）、推理（Reasoning）和口语对话（Oral conversation）方面的能力。

链接: https://arxiv.org/abs/2502.17810
作者: Ruiqi Yan,Xiquan Li,Wenxi Chen,Zhikang Niu,Chen Yang,Ziyang Ma,Kai Yu,Xie Chen
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In recent years, with advances in large language models (LLMs), end-to-end spoken dialogue models (SDMs) have made significant strides. Compared to text-based LLMs, the evaluation of SDMs needs to take speech-related aspects into account, such as paralinguistic information and speech quality. However, there is still a lack of comprehensive evaluations for SDMs in speech-to-speech (S2S) scenarios. To address this gap, we propose URO-Bench, an extensive benchmark for SDMs. Notably, URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics. Our benchmark is divided into two difficulty levels: basic track and pro track, consisting of 16 and 20 datasets respectively, evaluating the model’s abilities in Understanding, Reasoning, and Oral conversation. Evaluations on our proposed benchmark reveal that current open-source SDMs perform rather well in daily QA tasks, but lag behind their backbone LLMs in terms of instruction-following ability and also suffer from catastrophic forgetting. Their performance in advanced evaluations of paralinguistic information and audio understanding remains subpar, highlighting the need for further research in this direction. We hope that URO-Bench can effectively facilitate the development of spoken dialogue models by providing a multifaceted evaluation of existing models and helping to track progress in this area.
zh

[NLP-77] Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training

【速读】：该论文旨在解决大型语言模型（LLMs）在处理查询表述微小变化时性能显著下降的问题，即使这些变化保留了语义上的核心意义。论文的关键解决方案是提出了一种名为“对称增强数据增强”（syMmetry-ENhanced Data Augmentation, MEND）的方法，通过改进模型在知识提取阶段对查询表述变化的敏感性，从而提高模型的鲁棒性和泛化能力，特别是针对分布外（Out-of-Distribution, OOD）的数据。实验结果表明，MEND方法能够提升模型在逻辑推理和算术推理任务中的表现，并且适用于多种查询表述的变化形式。

链接: https://arxiv.org/abs/2502.17800
作者: Yihang Yao,Zhepeng Cen,Miao Li,William Han,Yuyou Zhang,Emerson Liu,Zuxin Liu,Chuang Gan,Ding Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong reasoning capabilities across various tasks. However, even minor variations in query phrasing, despite preserving the underlying semantic meaning, can significantly affect their performance. To address this, we focus on enhancing LLMs’ awareness of symmetry in query variations and propose syMmetry-ENhanceD (MEND) Data Augmentation, a data-centric approach that improves the model’s ability to extract useful information from context. Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage through query augmentations, enabling more data-efficient training and stronger generalization to Out-of-Distribution (OOD) settings. Extensive experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations, providing new insight into improving LLM robustness through structured dataset curation.
zh

[NLP-78] Enhancing Human Evaluation in Machine Translation with Comparative Judgment

【速读】：该论文旨在解决在评估快速发展的语言模型过程中，因评价者能力和任务设计差异导致的人类评估偏差问题。论文的关键解决方案在于探索比较判断（Comparative Judgment）在机器翻译（Machine Translation, MT）人类标注中的整合，并评估了三种标注设置：点对点多维质量度量（Multidimensional Quality Metrics, MQM）、并排（side-by-side, SxS）MQM以及其简化版本SxS相对排名（relative ranking, RR）。研究发现，SxS设置相较于MQM实现了更高的评价者间一致性，并提高了翻译错误标注的一致性，同时保持系统排名稳定，且SxS RR提供了比SxS MQM更为高效的选择。此外，SxS设置能够突出MQM中容易被忽视的细微错误，而不会改变系统的绝对评估结果。

链接: https://arxiv.org/abs/2502.17797
作者: Yixiao Song,Parker Riley,Daniel Deutsch,Markus Freitag
机构: UMass Amherst(马萨诸塞大学阿默斯特分校); Google(谷歌)
类目: Computation and Language (cs.CL)
备注: Preprint, 15 pages

点击查看摘要

Abstract:Human evaluation is crucial for assessing rapidly evolving language models but is influenced by annotator proficiency and task design. This study explores the integration of comparative judgment into human annotation for machine translation (MT) and evaluates three annotation setups-point-wise Multidimensional Quality Metrics (MQM), side-by-side (SxS) MQM, and its simplified version SxS relative ranking (RR). In MQM, annotators mark error spans with categories and severity levels. SxS MQM extends MQM to pairwise error annotation for two translations of the same input, while SxS RR focuses on selecting the better output without labeling errors. Key findings are: (1) the SxS settings achieve higher inter-annotator agreement than MQM; (2) SxS MQM enhances inter-translation error marking consistency compared to MQM by, on average, 38.5% for explicitly compared MT systems and 19.5% for others; (3) all annotation settings return stable system rankings, with SxS RR offering a more efficient alternative to (SxS) MQM; (4) the SxS settings highlight subtle errors overlooked in MQM without altering absolute system evaluations. To spur further research, we will release the triply annotated datasets comprising 377 ZhEn and 104 EnDe annotation examples. Comments: Preprint, 15 pages Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.17797 [cs.CL] (or arXiv:2502.17797v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.17797 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-79] AIR: Complex Instruction Generation via Automatic Iterative Refinement

【速读】：该论文旨在解决大型语言模型（LLMs）在遵循复杂指令方面的不足。当前方法生成的指令往往与实际需求不符或缺乏可扩展性和多样性，且现有方法如反向翻译未能充分利用大规模网络语料库中的丰富内容和结构。论文的关键解决方案是提出了一种名为自动迭代优化框架（AIR）的新方法，该框架通过两个阶段生成并逐步优化指令：首先从文档生成初始指令；其次，在大型语言模型作为裁判的指导下，通过比较模型输出与文档以纳入有价值的约束条件，从而迭代地改进指令。最终，论文构建了包含10,000个复杂指令的AIR-10K数据集，并证明了采用该方法生成的指令显著提升了模型遵循复杂指令的能力，优于现有的指令生成方法。

链接: https://arxiv.org/abs/2502.17787
作者: Wei Liu,Yancheng He,Hui Huang,Chengwei Hu,Jiaheng Liu,Shilong Li,Wenbo Su,Bo Zheng
机构: Alibaba Group (阿里集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The first three authors contributed equally, 20 pages

点击查看摘要

Abstract:With the development of large language models, their ability to follow simple instructions has significantly improved. However, adhering to complex instructions remains a major challenge. Current approaches to generating complex instructions are often irrelevant to the current instruction requirements or suffer from limited scalability and diversity. Moreover, methods such as back-translation, while effective for simple instruction generation, fail to leverage the rich contents and structures in large web corpora. In this paper, we propose a novel automatic iterative refinement framework to generate complex instructions with constraints, which not only better reflects the requirements of real scenarios but also significantly enhances LLMs’ ability to follow complex instructions. The AIR framework consists of two stages: (1)Generate an initial instruction from a document; (2)Iteratively refine instructions with LLM-as-judge guidance by comparing the model’s output with the document to incorporate valuable constraints. Finally, we construct the AIR-10K dataset with 10K complex instructions and demonstrate that instructions generated with our approach significantly improve the model’s ability to follow complex instructions, outperforming existing methods for instruction generation.
zh

[NLP-80] Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty

【速读】：该论文旨在解决通过传统方法（如语言分析和项目反应理论（IRT））评估阅读理解问题难度时面临的挑战，这些方法需要大量的人工标注和大规模测试，导致其可扩展性有限。论文的关键解决方案是利用大型语言模型（Large Language Models, LLMs），特别是OpenAI的GPT-4o和o1，来自动化估计阅读理解问题的难度。研究结果表明，这些模型不仅能有效回答理解问题，还能在一定程度上根据IRT定义的难度级别进行分类，尽管它们对极端题目特征的敏感度有所不同。这表明LLMs可以作为可扩展的方法，用于自动化的难度评估，特别是在学习者与自适应教学系统（Adaptive Instructional Systems, AIS）之间的动态交互中，从而弥合传统心理测量技术与现代AIS之间的差距，推动更加适应性和个性化的教育评估。

链接: https://arxiv.org/abs/2502.17785
作者: Yoshee Jain,John Hollander,Amber He,Sunny Tang,Liang Zhang,John Sabatini
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Reading comprehension is a key for individual success, yet the assessment of question difficulty remains challenging due to the extensive human annotation and large-scale testing required by traditional methods such as linguistic analysis and Item Response Theory (IRT). While these robust approaches provide valuable insights, their scalability is limited. There is potential for Large Language Models (LLMs) to automate question difficulty estimation; however, this area remains underexplored. Our study investigates the effectiveness of LLMs, specifically OpenAI’s GPT-4o and o1, in estimating the difficulty of reading comprehension questions using the Study Aid and Reading Assessment (SARA) dataset. We evaluated both the accuracy of the models in answering comprehension questions and their ability to classify difficulty levels as defined by IRT. The results indicate that, while the models yield difficulty estimates that align meaningfully with derived IRT parameters, there are notable differences in their sensitivity to extreme item characteristics. These findings suggest that LLMs can serve as the scalable method for automated difficulty assessment, particularly in dynamic interactions between learners and Adaptive Instructional Systems (AIS), bridging the gap between traditional psychometric techniques and modern AIS for reading comprehension and paving the way for more adaptive and personalized educational assessments.
zh

[NLP-81] MuCoS: Efficient Drug-Target Prediction through Multi-Context-Aware Sampling

【速读】：该论文旨在解决传统药物-靶点相互作用预测方法（如ComplEx-SE、TransE和DistMult）在处理未见关系和负三元组时存在的局限性，从而限制了其在药物发现中的有效性。论文的关键解决方案是提出了一种名为多上下文感知采样（Multi-Context-Aware Sampling, MuCoS）的方法。MuCoS通过优先考虑高密度邻居来减少计算复杂度，并捕获信息性的结构模式。这些优化的邻域表示与BERT结合，实现了上下文化的嵌入，以准确预测缺失的关系或尾实体。MuCoS避免了负三元组采样的需要，减少了计算量同时提升了对未见实体和关系的预测性能。

链接: https://arxiv.org/abs/2502.17784
作者: Haji Gul,Abdul Gani Haji Naim,Ajaz A. Bhat
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Drug-target interactions are critical for understanding biological processes and advancing drug discovery. However, traditional methods such as ComplEx-SE, TransE, and DistMult struggle with unseen relationships and negative triplets, which limits their effectiveness in drug-target prediction. To address these challenges, we propose Multi-Context-Aware Sampling (MuCoS), an efficient and positively accurate method for drug-target prediction. MuCoS reduces computational complexity by prioritizing neighbors of higher density to capture informative structural patterns. These optimized neighborhood representations are integrated with BERT, enabling contextualized embeddings for accurate prediction of missing relationships or tail entities. MuCoS avoids the need for negative triplet sampling, reducing computation while improving performance over unseen entities and relations. Experiments on the KEGG50k biomedical dataset show that MuCoS improved over existing models by 13% on MRR, 7% on Hits@1, 4% on Hits@3, and 18% on Hits@10 for the general relationship, and by 6% on MRR, 1% on Hits@1, 3% on Hits@3, and 12% on Hits@10 for prediction of drug-target relationship.
zh

[NLP-82] p of the Tongue Query Elicitation for Simulated Evaluation

【速读】：该论文旨在解决现有搜索系统在处理Tip-of-the-tongue (TOT) 搜索场景时效果不佳的问题。论文的关键解决方案在于引入两种方法来获取TOT查询：利用大型语言模型（Large Language Models, LLMs）生成合成查询，以及通过视觉刺激使人类参与者进入TOT状态以收集自然查询。这些方法减少了对社区问答（Community Question-Answering, CQA）网站数据的依赖，并扩展了对未充分代表领域的覆盖范围，如地标（Landmark）和人物（Person）。

链接: https://arxiv.org/abs/2502.17776
作者: Yifan He,To Eun Kim,Fernando Diaz,Jaime Arguello,Bhaskar Mitra
机构: Carnegie Mellon University(卡内基梅隆大学); UNC Chapel Hill(北卡罗来纳大学教堂山分校); Microsoft Research(微软研究院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Tip-of-the-tongue (TOT) search occurs when a user struggles to recall a specific identifier, such as a document title. While common, existing search systems often fail to effectively support TOT scenarios. Research on TOT retrieval is further constrained by the challenge of collecting queries, as current approaches rely heavily on community question-answering (CQA) websites, leading to labor-intensive evaluation and domain bias. To overcome these limitations, we introduce two methods for eliciting TOT queries - leveraging large language models (LLMs) and human participants - to facilitate simulated evaluations of TOT retrieval systems. Our LLM-based TOT user simulator generates synthetic TOT queries at scale, achieving high correlations with how CQA-based TOT queries rank TOT retrieval systems when tested in the Movie domain. Additionally, these synthetic queries exhibit high linguistic similarity to CQA-derived queries. For human-elicited queries, we developed an interface that uses visual stimuli to place participants in a TOT state, enabling the collection of natural queries. In the Movie domain, system rank correlation and linguistic similarity analyses confirm that human-elicited queries are both effective and closely resemble CQA-based queries. These approaches reduce reliance on CQA-based data collection while expanding coverage to underrepresented domains, such as Landmark and Person. LLM-elicited queries for the Movie, Landmark, and Person domains have been released as test queries in the TREC 2024 TOT track, with human-elicited queries scheduled for inclusion in the TREC 2025 TOT track. Additionally, we provide source code for synthetic query generation and the human query collection interface, along with curated visual stimuli used for eliciting TOT queries.
zh

[NLP-83] FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在空间推理任务中视角理解（Frame of Reference, FoR）不足的问题。论文的关键解决方案是引入了FoREST基准测试，并提出了空间引导提示（Spatial-Guided prompting），以提高LLMs提取关键空间概念的能力，从而改善其在空间推理任务中的表现。

链接: https://arxiv.org/abs/2502.17775
作者: Tanawan Premsri,Parisa Kordjamshidi
机构: Michigan State University (密歇根州立大学)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Spatial reasoning is a fundamental aspect of human intelligence. One key concept in spatial cognition is the Frame of Reference (FoR), which identifies the perspective of spatial expressions. Despite its significance, FoR has received limited attention in AI models that need spatial intelligence. There is a lack of dedicated benchmarks and in-depth evaluation of large language models (LLMs) in this area. To address this issue, we introduce the Frame of Reference Evaluation in Spatial Reasoning Tasks (FoREST) benchmark, designed to assess FoR comprehension in LLMs. We evaluate LLMs on answering questions that require FoR comprehension and layout generation in text-to-image models using FoREST. Our results reveal a notable performance gap across different FoR classes in various LLMs, affecting their ability to generate accurate layouts for text-to-image generation. This highlights critical shortcomings in FoR comprehension. To improve FoR understanding, we propose Spatial-Guided prompting, which improves LLMs ability to extract essential spatial concepts. Our proposed method improves overall performance across spatial reasoning tasks.
zh

[NLP-84] LLM Inference Acceleration via Efficient Operation Fusion

【速读】：该论文旨在解决Transformer架构中归一化操作（如Softmax和Layernorm）引起的集体通信瓶颈问题，这些问题导致Transformer模型推理速度降低约20%。论文的关键解决方案在于利用线性层与非线性归一化操作可以在不同硬件引擎上并行执行的特性，通过将归一化操作推迟到线性层计算之后，使得归一化所需的集体缩放因子的计算可以与矩阵乘法并发执行，从而隐藏了这些通信开销，显著提高了硬件利用率并减少了整体延迟。

链接: https://arxiv.org/abs/2502.17728
作者: Mahsa Salmani,Ilya Soloveychik
机构: d-Matrix(矩阵公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:The rapid development of the Transformer-based Large Language Models (LLMs) in recent years has been closely linked to their ever-growing and already enormous sizes. Many LLMs contain hundreds of billions of parameters and require dedicated hardware resources for training and inference. One of the key challenges inherent to the Transformer architecture is the requirement to support numerous non-linear transformations that involves normalization. For instance, each decoder block typically contains at least one Softmax operation and two Layernorms. The computation of the corresponding normalization scaling factors becomes a major bottleneck as it requires spatial collective operations. In other words, when it comes to the computation of denominators for Softmax and Layernorm, all vector elements must be aggregated into a single location, requiring significant communication. These collective operations slow down inference on Transformers by approximately 20%, defeating the whole purpose of distributed in-memory compute. In this work, we propose an extremely efficient technique that can completely hide the overhead caused by such collective operations. Note that each Softmax and Layernorm operation is typically followed by a linear layer. Since non-linear and linear operations are performed on different hardware engines, they can be easily parallelized once the algebra allows such commutation. By leveraging the inherent properties of linear operations, we can defer the normalization of the preceding Softmax and Layernorm until after the linear layer is computed. Now we can compute the collective scaling factors concurrently with the matrix multiplication and completely hide the latency of the former behind the latter. Such parallelization preserves the numerical accuracy while significantly improving the hardware utilization and reducing the overall latency.
zh

[NLP-85] Spontaneous Giving and Calculated Greed in Language Models

【速读】：该论文旨在探究推理能力（Reasoning）如何影响模型在社会困境中的表现，并特别关注生成式人工智能（Generative AI）在社会博弈中的行为。研究的关键在于通过对比非推理模型与具备推理能力的模型在六个合作与惩罚相关的经济博弈中的表现，揭示出推理模型倾向于减少合作和规范执行，更加侧重个体理性。论文指出，推理模型的行为类似于人类的“自发给予和精明贪婪”的倾向，从而导致群体中拥有更多推理模型时，合作水平降低且通过重复互动获得的收益减少。因此，论文强调需要设计融合社会智能与推理能力的AI架构，以确保AI能够促进而非破坏人类的合作直觉。

链接: https://arxiv.org/abs/2502.17720
作者: Yuxuan Li,Hirokazu Shirado
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models, when trained with reinforcement learning, demonstrate advanced problem-solving capabilities through reasoning techniques like chain of thoughts and reflection. However, it is unclear how these reasoning capabilities extend to social intelligence. In this study, we investigate how reasoning influences model outcomes in social dilemmas. First, we examine the effects of chain-of-thought and reflection techniques in a public goods game. We then extend our analysis to six economic games on cooperation and punishment, comparing off-the-shelf non-reasoning and reasoning models. We find that reasoning models reduce cooperation and norm enforcement, prioritizing individual rationality. Consequently, groups with more reasoning models exhibit less cooperation and lower gains through repeated interactions. These behaviors parallel human tendencies of “spontaneous giving and calculated greed.” Our results suggest the need for AI architectures that incorporate social intelligence alongside reasoning capabilities to ensure that AI supports, rather than disrupts, human cooperative intuition.
zh

[NLP-86] Knowledge Distillation with Training Wheels

【速读】：本文旨在解决在生成式语言模型中知识蒸馏的应用问题，提出了一种更为通用的知识蒸馏框架。关键在于不仅让学生模型在训练过程中从教师模型学习，还在测试阶段遵循特定规则向教师模型寻求帮助。通过将知识蒸馏表述为一个熵正则化值优化问题，并采用路径一致性学习方法，发展出一种结合策略内(on-policy)与策略外(off-policy)演示的新算法。进一步地，通过引入约束强化学习，使学生模型能够在测试时将教师模型作为参考，在限制条件下优先求助于教师模型。这模拟了人类学习者的行为，即不仅要掌握学习材料，还要学会评估不同部分的难度以决定何时需要求助。

链接: https://arxiv.org/abs/2502.17717
作者: Guanlin Liu,Anand Ramachandran,Tanmay Gangwani,Yan Fu,Abhinav Sethy
机构: Amazon Alexa AI (亚马逊Alexa AI)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge distillation is used, in generative language modeling, to train a smaller student model using the help of a larger teacher model, resulting in improved capabilities for the student model. In this paper, we formulate a more general framework for knowledge distillation where the student learns from the teacher during training, and also learns to ask for the teacher’s help at test-time following rules specifying test-time restrictions. Towards this, we first formulate knowledge distillation as an entropy-regularized value optimization problem. Adopting Path Consistency Learning to solve this, leads to a new knowledge distillation algorithm using on-policy and off-policy demonstrations. We extend this using constrained reinforcement learning to a framework that incorporates the use of the teacher model as a test-time reference, within constraints. In this situation, akin to a human learner, the model needs to learn not only the learning material, but also the relative difficulty of different sections to prioritize for seeking teacher help. We examine the efficacy of our method through experiments in translation and summarization tasks, observing trends in accuracy and teacher use, noting that our approach unlocks operating points not available to the popular Speculative Decoding approach.
zh

[NLP-87] Bridging Information Gaps with Comprehensive Answers: Improving the Diversity and Informativeness of Follow-Up Questions ACL2025

【速读】：该论文旨在解决现有对话系统在动态生成上下文相关跟进问题以获取新信息方面难以达到人类水平表现的问题。关键解决方案在于提出一种方法，通过利用假设的大型语言模型（LLM）生成的“全面答案”来定位未回答的信息，从而生成多样化且信息丰富的提问。这种方法被应用于扩充现有的跟进问题数据集，实验结果表明，经过增强数据集微调的语言模型能够生成显著更高质量和更多样化的跟进问题。

链接: https://arxiv.org/abs/2502.17715
作者: Zhe Liu,Taekyu Kang,Haoyu Wang,Seyed Hossein Alavi,Vered Shwartz
机构: University of British Columbia (不翻译)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 2 figures, submitted to ACL 2025

点击查看摘要

Abstract:Effective conversational systems are expected to dynamically generate contextual follow-up questions to elicit new information while maintaining the conversation flow. While humans excel at asking diverse and informative questions by intuitively assessing both obtained and missing information, existing models often fall short of human performance on this task. To mitigate this, we propose a method that generates diverse and informative questions based on targeting unanswered information using a hypothetical LLM-generated “comprehensive answer”. Our method is applied to augment an existing follow-up questions dataset. The experimental results demonstrate that language models fine-tuned on the augmented datasets produce follow-up questions of significantly higher quality and diversity. This promising approach could be effectively adopted to future work to augment information-seeking dialogues for reducing ambiguities and improving the accuracy of LLM answers.
zh

[NLP-88] Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures

【速读】：该论文旨在解决AI系统在全球应用中可能无意间引发的文化冒犯问题。论文的关键在于引入了一个名为Multi-Cultural Set of Inappropriate Gestures and Nonverbal Signs (MC-SIGNS)的数据集，该数据集包含288个手势-国家配对，并标注了其在25种手势和85个国家中的冒犯性、文化重要性和情境因素。通过使用MC-SIGNS进行系统评估，发现文本到图像(text-to-image, T2I)系统存在强烈的美国中心偏见，大型语言模型(large language models, LLMs)倾向于过度标记手势为冒犯性，而视觉-语言模型(vision-language models, VLMs)在处理全球通用概念时默认采用基于美国的解释，这可能导致文化不适当的建议。这些发现强调了亟需开发具有文化意识的AI安全机制，以确保AI技术在全球部署中的公平性。

链接: https://arxiv.org/abs/2502.17710
作者: Akhila Yerukola,Saadia Gabriel,Nanyun Peng,Maarten Sap
机构: Carnegie Mellon University (卡内基梅隆大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 40 pages, 49 figures

点击查看摘要

Abstract:Gestures are an integral part of non-verbal communication, with meanings that vary across cultures, and misinterpretations that can have serious social and diplomatic consequences. As AI systems become more integrated into global applications, ensuring they do not inadvertently perpetuate cultural offenses is critical. To this end, we introduce Multi-Cultural Set of Inappropriate Gestures and Nonverbal Signs (MC-SIGNS), a dataset of 288 gesture-country pairs annotated for offensiveness, cultural significance, and contextual factors across 25 gestures and 85 countries. Through systematic evaluation using MC-SIGNS, we uncover critical limitations: text-to-image (T2I) systems exhibit strong US-centric biases, performing better at detecting offensive gestures in US contexts than in non-US ones; large language models (LLMs) tend to over-flag gestures as offensive; and vision-language models (VLMs) default to US-based interpretations when responding to universal concepts like wishing someone luck, frequently suggesting culturally inappropriate gestures. These findings highlight the urgent need for culturally-aware AI safety mechanisms to ensure equitable global deployment of AI technologies.
zh

[NLP-89] Contrastive Visual Data Augmentation

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在识别新颖或罕见概念时存在的困难，这些问题源于它们依赖于预训练知识且难以捕捉细微的视觉特征。解决方案的关键在于提出了一种对比视觉数据增强（Contrastive visual Data Augmentation, CoDA）策略。CoDA通过提取目标概念与被错误识别的概念之间的对比文本和视觉特征，并利用多模态生成模型生成针对性的合成数据，从而改进模型对细微视觉特征的理解和推理能力。

链接: https://arxiv.org/abs/2502.17709
作者: Yu Zhou,Bingxuan Li,Mohan Tang,Xiaomeng Jin,Te-Lin Wu,Kuan-Hao Huang,Heng Ji,Kai-Wei Chang,Nanyun Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.
zh

[NLP-90] From Perceptions to Decisions: Wildfire Evacuation Decision Prediction with Behavioral Theory-informed LLM s

【速读】：该论文旨在解决野火疏散决策预测中的复杂性和多样性问题，传统统计方法难以准确捕捉不同个体的行为逻辑。解决方案的关键在于引入FLARE框架，这是一个基于大规模语言模型（Large Language Model, LLM）的方法，集成了行为理论与模型以优化链式思维（Chain-of-Thought, CoT）推理，并结合基于记忆的强化学习（Reinforcement Learning, RL）模块，从而实现更精准的疏散决策预测及理解。该方法有效克服了现有LLMs在疏散行为预测上的局限性，如有限的调查数据、与行为理论的不匹配、个体偏好的冲突、隐晦且复杂的心理状态以及心理状态与行为之间的复杂映射关系。实验结果显示，FLARE框架相比传统的基于理论的行为模型平均性能提升了20.47%，并且具有较强的事态泛化能力。

链接: https://arxiv.org/abs/2502.17701
作者: Ruxiao Chen,Chenguang Wang,Yuran Sun,Xilei Zhao,Susu Xu
机构: Johns Hopkins University (约翰霍普金斯大学); University of Florida (佛罗里达大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 24 pages, 9 figures

点击查看摘要

Abstract:Evacuation decision prediction is critical for efficient and effective wildfire response by helping emergency management anticipate traffic congestion and bottlenecks, allocate resources, and minimize negative impacts. Traditional statistical methods for evacuation decision prediction fail to capture the complex and diverse behavioral logic of different individuals. In this work, for the first time, we introduce FLARE, short for facilitating LLM for advanced reasoning on wildfire evacuation decision prediction, a Large Language Model (LLM)-based framework that integrates behavioral theories and models to streamline the Chain-of-Thought (CoT) reasoning and subsequently integrate with memory-based Reinforcement Learning (RL) module to provide accurate evacuation decision prediction and understanding. Our proposed method addresses the limitations of using existing LLMs for evacuation behavioral predictions, such as limited survey data, mismatching with behavioral theory, conflicting individual preferences, implicit and complex mental states, and intractable mental state-behavior mapping. Experiments on three post-wildfire survey datasets show an average of 20.47% performance improvement over traditional theory-informed behavioral models, with strong cross-event generalizability. Our complete code is publicly available at this https URL
zh

[NLP-91] Semantics drives analogical change in Germanic strong verb paradigms: a phylogenetic study

【速读】：该论文旨在探讨在德语系语言中，过去分词表达叙事过去时的情况下，不规则形态交替模式（如所谓的ABB模式）的存在与保留情况。论文的关键在于验证这种模式是否更可能出现在需要标记重要二元语义对立（如现在与过去的对立）的情境下，并通过分析14种古今德语系语言中的107个同源动词数据，使用新颖的层级系统发育模型来支持这一假设。研究结果表明，ABB模式更有可能在标记重要语义区分的情境中被保留，但较少扩展到其他模式的动词。这为跨语言中观察到的不规则性分布主要是由于不规则模式的保存还是主动趋向于不规则化提供了证据，更倾向于前者。

链接: https://arxiv.org/abs/2502.17670
作者: Alexandru Craevschi,Sarah Babinski,Chundra Cathcart
机构: Institute for the Interdisciplinary Study of Language Evolution (语言演化跨学科研究所); University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A large body of research on morphological paradigms makes the prediction that irregular morphological patterns of allomorphy are more likely to emerge and persist when they serve to mark important functional distinctions. More specifically, it has been observed that in some Germanic languages in which narrative past tense is expressed by the past participle, there is a greater affinity for stem allomorphy shared by preterite forms and past participles to the exclusion of present forms (the so-called ABB pattern), as it serves to enhance marking of the binary semantic opposition between present and past. Using data from 107 cognate verbs attested across 14 archaic and contemporary Germanic languages and a novel hierarchical phylogenetic model, we show that there is a greater long-term preference for this alternation pattern in situations where narrative past tense has been extended to the past participle, confirming this hypothesis. We further elucidate the mechanisms underlying this association, demonstrating that this association holds because verbs with the ABB pattern are more likely to preserve it in situations where it marks an important binary semantic opposition; however, there is less evidence that the ABB pattern is extended to verbs with different patterns under the same circumstances. These results bear on debate as to whether the distribution of irregularity we observe cross-linguistically is due primarily to (1) the preservation of irregular patterns or (2) an active drive toward irregularization in certain contexts, and are more in line with the first hypothesis.
zh

[NLP-92] owards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models

【速读】：该论文旨在解决多模态结构化启动数据集缺乏及评估方法不足的问题。为解决这一问题，论文提出了PRISMATIC，首个多模态结构化启动数据集，并引入了一种无需参考句子的评估指标来衡量启动效应。关键解决方案在于使用这一新评估方法，通过对比双编码器和融合编码器模型的性能，揭示了仅融合编码模型在启动效应与视觉相似度之间表现出显著的正相关性，从而更符合人类心理语言学模式。

链接: https://arxiv.org/abs/2502.17669
作者: Bushi Xiao,Michael Bennie,Jayetri Bardhan,Daisy Zhe Wang
机构: University of Florida
类目: Computation and Language (cs.CL)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:We introduced PRISMATIC, the first multimodal structural priming dataset, and proposed a reference-free evaluation metric that assesses priming effects without predefined target sentences. Using this metric, we constructed and tested models with different multimodal encoding architectures (dual encoder and fusion encoder) to investigate their structural preservation capabilities. Our findings show that models with both encoding methods demonstrate comparable syntactic priming effects. However, only fusion-encoded models exhibit robust positive correlations between priming effects and visual similarity, suggesting a cognitive process more aligned with human psycholinguistic patterns. This work provides new insights into evaluating and understanding how syntactic information is processed in multimodal language models.
zh

[NLP-93] owards Typologically Aware Rescoring to Mitigate Unfaithfulness in Lower-Resource Languages ISCA

【速读】：该论文旨在解决多语言大型语言模型（Large Language Models, LLMs）在资源受限语言中更频繁产生不忠实输出的问题。论文的关键解决方案是使用计算成本较低的辅助模型来重新评分大型架构生成的输出。通过实验，论文展示了从少于700MB数据中随机初始化预训练且未经微调的单语4层BERT模型，在三种基因上无关且形态复杂度不同的语言（越南语、波兰语和格鲁吉亚语）中识别忠实摘要的平均准确率达到88.33%，证明了这种方法的可行性。

链接: https://arxiv.org/abs/2502.17664
作者: Tsan Tsai Chan,Xin Tong,Thi Thu Uyen Hoang,Barbare Tepnadze,Wojciech Stempniak
机构: Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models

点击查看摘要

Abstract:Multilingual large language models (LLMs) are known to more frequently generate non-faithful output in resource-constrained languages (Guerreiro et al., 2023 - arXiv:2303.16104), potentially because these typologically diverse languages are underrepresented in their training data. To mitigate unfaithfulness in such settings, we propose using computationally light auxiliary models to rescore the outputs of larger architectures. As proof of the feasibility of such an approach, we show that monolingual 4-layer BERT models pretrained from scratch on less than 700 MB of data without fine-tuning are able to identify faithful summaries with a mean accuracy of 88.33% in three genetically unrelated languages that differ in their morphological complexity - Vietnamese, Polish and Georgian. The same hyperparameter combination moreover generalises well to three other tasks, suggesting applications for rescoring beyond improving faithfulness. In order to inform typologically aware model selection, we also investigate how morphological complexity interacts with regularisation, model depth and training objectives, ultimately demonstrating that morphologically complex languages are more likely to benefit from dropout, while across languages downstream performance is enhanced most by shallow architectures as well as training using the standard BERT objectives.
zh

[NLP-94] METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling

【速读】：该论文旨在解决图表自动生成中的复杂多模态推理问题，以生成满足特定视觉属性（如文本、布局、颜色和类型）的高质量图表。论文的关键解决方案是提出了METAL框架，该框架通过将图表生成任务分解为多个专业化代理的迭代协作来应对这一挑战，从而实现了比现有最佳结果高出5.2%的准确率，并展示了测试时可扩展性。此外，分离不同模态在METAL的批评过程中增强了视觉语言模型（VLMs）在多模态环境下的自我修正能力。

链接: https://arxiv.org/abs/2502.17651
作者: Bingxuan Li,Yiwei Wang,Jiuxiang Gu,Kai-Wei Chang,Nanyun Peng
机构: University of California, Los Angeles( UCLA ); University of California, Merced; Adobe Research(Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chart generation aims to generate code to produce charts satisfying the desired visual properties, e.g., texts, layout, color, and type. It has great potential to empower the automatic professional report generation in financial analysis, research presentation, education, and healthcare. In this work, we build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation. Generating high-quality charts requires both strong visual design skills and precise coding capabilities that embed the desired visual properties into code. Such a complex multi-modal reasoning process is difficult for direct prompting of VLMs. To resolve these challenges, we propose METAL, a multi-agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents. METAL achieves 5.2% improvement in accuracy over the current best result in the chart generation task. The METAL framework exhibits the phenomenon of test-time scaling: its performance increases monotonically as the logarithmic computational budget grows from 512 to 8192 tokens. In addition, we find that separating different modalities during the critique process of METAL boosts the self-correction capability of VLMs in the multimodal context.
zh

[NLP-95] Evaluating the Effect of Retrieval Augmentation on Social Biases

【速读】：该论文旨在探讨 Retrieval Augmented Generation (RAG) 系统在不同语言（英语、日语和汉语）及四种社会偏见类型（性别、种族、年龄和宗教）下生成文本中的社会偏见。研究的关键在于通过Bias Question Answering (BBQ)基准数据集评估包含不同程度刻板印象的文档集合中RAG响应的社会偏见，并发现即使生成模型本身具有较低的社会偏见水平，文档集合中的偏见往往会放大到生成的响应中。这引发了对使用RAG技术将新事实注入自然语言生成系统可行性的担忧，并呼吁在实际应用前对其潜在社会偏见进行仔细评估。

链接: https://arxiv.org/abs/2502.17611
作者: Tianhui Zhang,Yi Zhou,Danushka Bollegala
机构: University of Liverpool(利物浦大学); Cardiff University(卡迪夫大学)
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has gained popularity as a method for conveniently incorporating novel facts that were not seen during the pre-training stage in Large Language Model (LLM)-based Natural Language Generation (NLG) systems. However, LLMs are known to encode significant levels of unfair social biases. The modulation of these biases by RAG in NLG systems is not well understood. In this paper, we systematically study the relationship between the different components of a RAG system and the social biases presented in the text generated across three languages (i.e. English, Japanese and Chinese) and four social bias types (i.e. gender, race, age and religion). Specifically, using the Bias Question Answering (BBQ) benchmark datasets, we evaluate the social biases in RAG responses from document collections with varying levels of stereotypical biases, employing multiple LLMs used as generators. We find that the biases in document collections are often amplified in the generated responses, even when the generating LLM exhibits a low-level of bias. Our findings raise concerns about the use of RAG as a technique for injecting novel facts into NLG systems and call for careful evaluation of potential social biases in RAG applications before their real-world deployment.
zh

[NLP-96] Synthetic Text Generation for Training Large Language Models via Gradient Matching

【速读】：该论文旨在解决现有合成文本生成方法无法在不损害真实数据隐私或不牺牲性能的情况下，生成可读性高的人类文本的问题。解决方案的关键在于利用交替方向乘子法（Alternating Direction Method of Multipliers, ADMM），通过迭代优化合成示例的嵌入向量以匹配目标训练或验证数据的梯度，并将其映射到具有低困惑度（perplexity）的文本标记序列中。这种方法能够保证大型语言模型（LLMs）在目标任务微调过程中的收敛性和性能。

链接: https://arxiv.org/abs/2502.17607
作者: Dang Nguyen,Zeman Li,Mohammadhossein Bateni,Vahab Mirrokni,Meisam Razaviyayn,Baharan Mirzasoleiman
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that guarantees the convergence and performance of LLMs during fine-tuning on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text can guarantee convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data. Experiments on various classification tasks confirm the effectiveness of our proposed approach.
zh

[NLP-97] PICASO: Permutation-Invariant Context Composition with State Space Models ICLR2025

【速读】：该论文旨在解决在推理过程中向大型语言模型提供相关上下文知识以提高生成质量的同时，因处理额外上下文而导致在线计算成本显著增加的问题。解决方案的关键在于利用状态空间模型（State Space Models, SSMs）将数据库中的多个上下文映射到固定维度的状态，并通过一种简单数学关系将这些状态组合成一个状态，从而高效地近似连接文本上下文的效果。此外，为了确保顺序信息不带来影响，论文还通过对所有可能的上下文顺序进行有效平均来实现状态表示的置换不变性（permutation-invariance）。

链接: https://arxiv.org/abs/2502.17605
作者: Tian Yu Liu,Alessandro Achille,Matthew Trager,Aditya Golatkar,Luca Zancato,Stefano Soatto
机构: AWS AI Labs; UCLA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in The Thirteenth International Conference on Learning Representations, ICLR 2025

点击查看摘要

Abstract:Providing Large Language Models with relevant contextual knowledge at inference time has been shown to greatly improve the quality of their generations. This is often achieved by prepending informative passages of text, or ‘contexts’, retrieved from external knowledge bases to their input. However, processing additional contexts online incurs significant computation costs that scale with their length. State Space Models (SSMs) offer a promising solution by allowing a database of contexts to be mapped onto fixed-dimensional states from which to start the generation. A key challenge arises when attempting to leverage information present across multiple contexts, since there is no straightforward way to condition generation on multiple independent states in existing SSMs. To address this, we leverage a simple mathematical relation derived from SSM dynamics to compose multiple states into one that efficiently approximates the effect of concatenating textual contexts. Since the temporal ordering of contexts can often be uninformative, we enforce permutation-invariance by efficiently averaging states obtained via our composition algorithm across all possible context orderings. We evaluate our resulting method on WikiText and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4x speedup.
zh

[NLP-98] MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference NAACL2025

【速读】：该论文旨在解决长上下文多模态大型语言模型（MLLMs）在推理过程中因多模态键值（KV）缓存随输入长度增加而显著增长所导致的资源消耗大和推理效率低的问题。论文的关键解决方案是MEDA（基于跨模态注意力熵的动态层间KV缓存分配方法），它通过利用跨模态注意力熵来确定每一层的KV缓存大小，并采用KV对选择方案及合并策略，以实现高效的多模态长上下文推理。MEDA实现了高达72%的KV缓存内存减少和2.82倍的解码速度提升，同时保持或提升了多种多模态任务在长上下文设置下的性能。

链接: https://arxiv.org/abs/2502.17599
作者: Zhongwei Wan,Hui Shen,Xin Wang,Che Liu,Zheda Mai,Mi Zhang
机构: The Ohio State University (俄亥俄州立大学); Imperial College London (伦敦帝国学院)
类目: Computation and Language (cs.CL)
备注: NAACL 2025 Main

点击查看摘要

Abstract:Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency. Existing methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, thus often adopting uniform or progressive reduction strategies for layer-wise cache allocation. In this work, we propose MEDA, a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference. As its core, MEDA utilizes cross-modal attention entropy to determine the KV cache size at each MLLMs layer. Given the dynamically allocated KV cache size at each layer, MEDA also employs a KV pair selection scheme to identify which KV pairs to select and a KV pair merging strategy that merges the selected and non-selected ones to preserve information from the entire context. MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed, while maintaining or enhancing performance on various multimodal tasks in long-context settings, including multi-images and long-video scenarios. Our code is released at this https URL.
zh

[NLP-99] Hallucination Detection in LLM s Using Spectral Features of Attention Maps

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中容易产生幻觉（hallucinations）的问题。解决方案的关键在于提出了一种名为\textLapEigvals的方法，该方法利用从注意力图谱（attention maps）中导出的拉普拉斯矩阵（Laplacian matrix）的前k个特征值作为输入，用于幻觉检测探针（hallucination detection probes）。研究表明，这种方法在基于注意力的方法中达到了最先进的幻觉检测性能，并且通过广泛的消融研究证明了其鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2502.17598
作者: Jakub Binkowski,Denis Janiak,Albert Sawczyn,Bogdan Gabrys,Tomasz Kajdanowicz
机构: Wroclaw University of Science and Technology; University of Technology Sydney
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint, under review

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across various tasks but remain prone to hallucinations. Detecting hallucinations is essential for safety-critical applications, and recent methods leverage attention map properties to this end, though their effectiveness remains limited. In this work, we investigate the spectral features of attention maps by interpreting them as adjacency matrices of graph structures. We propose the \textLapEigvals method, which utilises the top- k eigenvalues of the Laplacian matrix derived from the attention maps as an input to hallucination detection probes. Empirical evaluations demonstrate that our approach achieves state-of-the-art hallucination detection performance among attention-based methods. Extensive ablation studies further highlight the robustness and generalisation of \textLapEigvals , paving the way for future advancements in the hallucination detection domain.
zh

[NLP-100] Proactive Privacy Amnesia for Large Language Models : Safeguarding PII with Negligible Impact on Model Utility ICLR’25 ICLR2025

【速读】：该论文旨在解决大型语言模型（LLMs）在面对恶意攻击时泄露个人识别信息（PII）的风险。现有方法难以在保护隐私的同时保持模型的实用性。论文提出的关键解决方案是主动遗忘机制——主动性隐私遗忘（Proactive Privacy Amnesia, PPA）。PPA通过识别并遗忘与PII关联最紧密的关键记忆，并植入合适的替代记忆来维持模型功能，从而在保护隐私的同时保持模型的效用。

链接: https://arxiv.org/abs/2502.17591
作者: Martin Kuo,Jingyang Zhang,Jianyi Zhang,Minxue Tang,Louis DiValentin,Aolin Ding,Jingwei Sun,William Chen,Amin Hass,Tianlong Chen,Yiran Chen,Hai Li
机构: Center for Computational Evolutionary Intelligence, Duke University (杜克大学计算进化智能中心); Center for Advanced AI, Accenture (埃森哲高级人工智能中心); Accenture (埃森哲); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Cary Academy (凯里学院)
类目: Computation and Language (cs.CL)
备注: ICLR’25 Poster. Project page and code is available at this https URL

点击查看摘要

Abstract:With the rise of large language models (LLMs), increasing research has recognized their risk of leaking personally identifiable information (PII) under malicious attacks. Although efforts have been made to protect PII in LLMs, existing methods struggle to balance privacy protection with maintaining model utility. In this paper, inspired by studies of amnesia in cognitive science, we propose a novel approach, Proactive Privacy Amnesia (PPA), to safeguard PII in LLMs while preserving their utility. This mechanism works by actively identifying and forgetting key memories most closely associated with PII in sequences, followed by a memory implanting using suitable substitute memories to maintain the LLM’s functionality. We conduct evaluations across multiple models to protect common PII, such as phone numbers and physical addresses, against prevalent PII-targeted attacks, demonstrating the superiority of our method compared with other existing defensive techniques. The results show that our PPA method completely eliminates the risk of phone number exposure by 100% and significantly reduces the risk of physical address exposure by 9.8% - 87.6%, all while maintaining comparable model utility performance.
zh

[NLP-101] End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models

【速读】：该论文旨在解决自动化图表摘要生成过程中匹配生成摘要与图表数据以及处理复杂图表模式推理的难题。关键解决方案在于引入端到端视觉链式思维（Visual Chain-of-Thought, V-CoT）方法，通过大规模视觉语言模型（Large Vision-Language Models, LVLMs）直接训练模型处理图表图像并生成文本摘要，无需显式的图表解析模块。此方法通过指令微调引入视觉链式思维机制，隐式引导LVLM在生成摘要过程中执行视觉推理步骤。

链接: https://arxiv.org/abs/2502.17589
作者: Raymond Choi,Frank Burns,Chase Lawrence
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated chart summarization is crucial for enhancing data accessibility and enabling efficient information extraction from visual data. While recent advances in visual-language models (VLMs) have demonstrated promise, existing methods often suffer from limitations in matching the generated summary to the chart data and in reasoning about complex chart patterns. This paper introduces End-to-End Visual Chain-of-Thought (V-CoT) for chart summarization, a novel approach optimized for Large Vision-Language Models (LVLMs). Our method directly trains an LVLM to process chart images and generate textual summaries in an end-to-end fashion, eliminating the need for explicit chart parsing modules. We incorporate a visual Chain-of-Thought mechanism through instruction fine-tuning, implicitly guiding the LVLM to perform visual reasoning steps during summary generation. Evaluated on the large-scale Chart-Sum-QA dataset, our V-CoT method significantly outperforms state-of-the-art baselines across a range of automatic metrics, including BLEU, BLEURT, CIDEr, and CS, and demonstrates superior matching degree and reasoning correctness in human evaluations. Ablation studies and detailed analyses further validate the effectiveness and robustness of our proposed approach, establishing a new benchmark for end-to-end chart summarization.
zh

[NLP-102] owards Conditioning Clinical Text Generation for User Control

【速读】：该论文旨在解决在临床环境中部署自然语言生成系统时面临的挑战，特别是大型语言模型（Large Language Models, LLMs）仍存在的幻觉和事实不一致问题，这需要人工监督。论文的关键解决方案是通过使用LLMs进行自动化数据集增强，将LLMs作为人类代理来调整其以适应临床医生的控制，而不增加认知负担。这种方法实现了相对9%的改进，并且在数据集增强的情况下达到了高达34%的提升，从而证明了其有效性。

链接: https://arxiv.org/abs/2502.17571
作者: Osman Alperen Koraş,Rabi Bahnan,Jens Kleesiek,Amin Dada
机构: Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen, Germany (埃森大学医院); Cancer Research Center Cologne Essen (CCCE), West German Cancer Center Essen University Hospital Essen (AöR), Essen, Germany (科隆埃森西癌症研究中心); German Cancer Consortium (DKTK, Partner site Essen), Heidelberg, Germany (德国癌症联合会); Department of Physics, TU Dortmund, Dortmund, Germany (多特蒙德工业大学物理系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deploying natural language generation systems in clinical settings remains challenging despite advances in Large Language Models (LLMs), which continue to exhibit hallucinations and factual inconsistencies, necessitating human oversight. This paper explores automated dataset augmentation using LLMs as human proxies to condition LLMs for clinician control without increasing cognitive workload. On the BioNLP ACL’24 Discharge Me! Shared Task, we achieve new state-of-the-art results with simpler methods than prior submissions through more efficient training, yielding a 9% relative improvement without augmented training and up to 34% with dataset augmentation. Preliminary human evaluation further supports the effectiveness of our approach, highlighting the potential of augmenting clinical text generation for control to enhance relevance, accuracy, and factual consistency.
zh

[NLP-103] raining a Generally Curious Agent

【速读】：该论文旨在解决智能系统在需要战略性信息收集的场景中探索效率低下的问题。解决方案的关键在于PAPRIKA方法，它通过利用来自不同任务的合成交互数据进行微调，使语言模型能够在新任务中基于环境反馈自主探索和调整行为，而无需额外的梯度更新。这种方法使得模型能够将学到的决策能力有效迁移到完全未见过的任务中。

链接: https://arxiv.org/abs/2502.17543
作者: Fahim Tajwar,Yiding Jiang,Abitha Thankaraj,Sumaita Sadia Rahman,J Zico Kolter,Jeff Schneider,Ruslan Salakhutdinov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present PAPRIKA, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, PAPRIKA teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with PAPRIKA can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach’s primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.
zh

[NLP-104] Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction

【速读】：该论文旨在解决现有简单特征提取方法（如提示法）在处理多样化数据集时难以生成准确且多样的描述，并缺乏对特征粒度和尺度的控制的问题。关键解决方案在于提出了一种领域无关的数据集特征化方法，通过优化选择信息量大的二值特征，利用大型语言模型（LLMs）重建原始数据的能力来实现对特征数量的精确控制，从而生成紧凑且具有描述性的表示，其效果可与人类专家标注相媲美。

链接: https://arxiv.org/abs/2502.17541
作者: Michal Bravansky,Vaclav Kubon,Suhas Hariharan,Robert Kirk
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interpreting data is central to modern research. Large language models (LLMs) show promise in providing such natural language interpretations of data, yet simple feature extraction methods such as prompting often fail to produce accurate and versatile descriptions for diverse datasets and lack control over granularity and scale. To address these limitations, we propose a domain-agnostic method for dataset featurization that provides precise control over the number of features extracted while maintaining compact and descriptive representations comparable to human expert labeling. Our method optimizes the selection of informative binary features by evaluating the ability of an LLM to reconstruct the original data using those features. We demonstrate its effectiveness in dataset modeling tasks and through two case studies: (1) Constructing a feature representation of jailbreak tactics that compactly captures both the effectiveness and diversity of a larger set of human-crafted attacks; and (2) automating the discovery of features that align with human preferences, achieving accuracy and robustness comparable to expert-crafted features. Moreover, we show that the pipeline scales effectively, improving as additional features are sampled, making it suitable for large and diverse datasets.
zh

[NLP-105] PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

【速读】：该论文旨在解决从包含复杂视觉内容（如科学海报）的多模态文档中生成准确且简洁文本摘要的挑战。论文引入了一个名为PosterSum的新基准数据集，其中包含16,305个科学会议海报及其对应的摘要。研究发现当前最先进的多模态大型语言模型（MLLMs）在准确理解和总结科学海报方面存在困难。为解决这一问题，论文提出了一种分段总结（Segment Summarize）的层次化方法，该方法在自动评估指标上超越现有MLLMs，ROUGE-L得分提高了3.14%，从而成为未来海报总结研究的基础。

链接: https://arxiv.org/abs/2502.17540
作者: Rohit Saxena,Pasquale Minervini,Frank Keller
机构: Institute for Language, Cognition and Computation (语言、认知与计算研究所); School of Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper includes a dataset of research posters with abstracts. We provide two cited examples ( arXiv:2211.11880 and arXiv:2210.07571 ) to illustrate reference summaries

点击查看摘要

Abstract:Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.
zh

[NLP-106] Policy Learning with a Natural Language Action Space: A Causal Approach

【速读】：该论文旨在解决在自然语言动作空间中多阶段决策的问题，特别是在序列动作后才能观察到结果的延迟奖励设置。解决方案的关键在于采用Q-learning估计动态治疗方案（Dynamic Treatment Regimes, DTR），通过单一模型实现数据高效的策略学习，并利用梯度上升法优化语言嵌入。此外，该方法还包括一种解码策略，能够将优化后的嵌入转换回连贯的自然语言。这一关键技术贡献使得模型能够在有限的训练数据下学习到更优的策略。

链接: https://arxiv.org/abs/2502.17538
作者: Bohan Zhang,Yixin Wang,Paramveer S. Dhillon
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces a novel causal framework for multi-stage decision-making in natural language action spaces where outcomes are only observed after a sequence of actions. While recent approaches like Proximal Policy Optimization (PPO) can handle such delayed-reward settings in high-dimensional action spaces, they typically require multiple models (policy, value, and reward) and substantial training data. Our approach employs Q-learning to estimate Dynamic Treatment Regimes (DTR) through a single model, enabling data-efficient policy learning via gradient ascent on language embeddings. A key technical contribution of our approach is a decoding strategy that translates optimized embeddings back into coherent natural language. We evaluate our approach on mental health intervention, hate speech countering, and sentiment transfer tasks, demonstrating significant improvements over competitive baselines across multiple metrics. Notably, our method achieves superior transfer strength while maintaining content preservation and fluency, as validated through human evaluation. Our work provides a practical foundation for learning optimal policies in complex language tasks where training data is limited.
zh

[NLP-107] he Lottery LLM Hypothesis Rethinking What Abilities Should LLM Compression Preserve?

【速读】：该论文旨在通过检索增强生成（Retrieval-Augmented Generation）、多步推理（Multi-Step Reasoning）及外部工具使用（External Tools Utilization），提升大型语言模型（LLMs）的性能，同时减少计算和存储成本。论文的关键在于提出“彩票大型语言模型”（Lottery LLM）假设，即对于给定任务，存在一个较小规模的模型，在多步推理和外部工具辅助下，能够达到与原始大模型相当的性能。基于当前LLMs的研究进展，论文强调了彩票大型语言模型和KV缓存压缩需具备的核心能力，这些能力在现有方法中常被忽视。

链接: https://arxiv.org/abs/2502.17535
作者: Zhenheng Tang,Xiang Liu,Qian Wang,Peijie Dong,Bingsheng He,Xiaowen Chu,Bo Li
机构: CSE, The Hong Kong University of Science and Technology (香港科技大学); DSA, The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Motivated by reducing the computational and storage costs of LLMs, model compression and KV cache compression have attracted much attention from researchers. However, current methods predominantly emphasize maintaining the performance of compressed LLMs, as measured by perplexity or simple accuracy on tasks of common sense knowledge QA and basic arithmetic reasoning. In this blog, we present a brief review of recent advancements in LLMs related to retrieval-augmented generation, multi-step reasoning, external tools, and computational expressivity, all of which substantially enhance LLM performance. Then, we propose a lottery LLM hypothesis suggesting that for a given LLM and task, there exists a smaller lottery LLM capable of producing the same performance as the original LLM with the assistance of multi-step reasoning and external tools. Based on the review of current progress in LLMs, we discuss and summarize the essential capabilities that the lottery LLM and KV cache compression must possess, which are currently overlooked in existing methods.
zh

[NLP-108] Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在训练过程中因依赖庞大的互联网衍生训练语料库而可能产生的数据污染风险。为减轻这种潜在风险，研究从静态基准测试转向动态基准测试。论文的关键解决方案在于提出一系列动态基准测试的最佳设计原则，并分析现有动态基准测试的局限性。论文通过这些方法填补了缺乏标准化评估准则的空白，从而为减少数据污染风险提供了清晰的指导方向。

链接: https://arxiv.org/abs/2502.17521
作者: Simin Chen,Yiming Chen,Zexin Li,Yifan Jiang,Zhongwei Wan,Yixin He,Dezhi Ran,Tianle Gu,Haizhou Li,Tao Xie,Baishakhi Ray
机构: Columbia University; National University of Singapore; University of California, Riverside; University of Southern California; The Ohio State University; Peking University; Tsinghua University; The Chinese University of Hong Kong, Shenzhen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Github Link: this https URL

点击查看摘要

Abstract:Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. We first examine methods that enhance static benchmarks and identify their inherent limitations. We then highlight a critical gap-the lack of standardized criteria for evaluating dynamic benchmarks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks. This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts. We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.
zh

[NLP-109] SAE-V: Interpreting Multimodal Models for Enhanced Alignment

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在图像模态整合后，由于语义空间复杂化导致的可解释性降低及对齐稳定性减弱的问题，特别是低质量数据引发的模态间不一致、幻觉及偏见输出。论文的关键解决方案是引入了一种新的机制解释框架——SAE-V，该框架将稀疏自动编码器（Sparse Autoencoders, SAEs）范式扩展到多模态设置中。通过识别和分析可解释特征及其相应数据，SAE-V实现了对模型行为和数据质量的细粒度解释，并提供了内在的数据过滤机制以增强模型对齐，而无需额外的模型。

链接: https://arxiv.org/abs/2502.17514
作者: Hantao Lou,Changye Li,Jiaming Ji,Yaodong Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the integration of image modality, the semantic space of multimodal large language models (MLLMs) is more complex than text-only models, making their interpretability more challenging and their alignment less stable, particularly susceptible to low-quality data, which can lead to inconsistencies between modalities, hallucinations, and biased outputs. As a result, developing interpretability methods for MLLMs is crucial for improving alignment quality and efficiency. In text-only LLMs, Sparse Autoencoders (SAEs) have gained attention for their ability to interpret latent representations. However, extending SAEs to multimodal settings presents new challenges due to modality fusion and the difficulty of isolating cross-modal representations. To address these challenges, we introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to MLLMs. By identifying and analyzing interpretable features along with their corresponding data, SAE-V enables fine-grained interpretation of both model behavior and data quality, facilitating a deeper understanding of cross-modal interactions and alignment dynamics. Moreover, by utilizing cross-modal feature weighting, SAE-V provides an intrinsic data filtering mechanism to enhance model alignment without requiring additional models. Specifically, when applied to the alignment process of MLLMs, SAE-V-based data filtering methods could achieve more than 110% performance with less than 50% data. Our results highlight SAE-V’s ability to enhance interpretability and alignment in MLLMs, providing insights into their internal mechanisms.
zh

[NLP-110] Recurrent Knowledge Identification and Fusion for Language Model Continual Learning

【速读】：该论文旨在解决连续学习（Continual Learning, CL）过程中知识转移与遗忘之间的平衡问题，特别是在动态真实环境部署大规模语言模型（Large Language Models, LLMs）时面临的挑战。当前方法依赖于静态的重要性估计，导致在顺序训练中难以实现有效的知识转移。论文的关键解决方案是提出了一种名为Recurrent-KIF的新框架，它通过动态评估参数重要性分布来增强知识转移，并采用内环快速适应新任务和识别重要参数，外环全局管理新旧知识融合，通过冗余知识修剪和关键知识合并来优化这一过程。这种内环与外环迭代多轮融合的方式，使Recurrent-KIF能够利用中间训练信息并根据不断演变的重要性分布自适应调整融合策略。

链接: https://arxiv.org/abs/2502.17510
作者: Yujie Feng,Xujia Wang,Zexin Lu,Shenghong Fu,Guangyuan Shi,Yongxin Xu,Yasha Wang,Philip S. Yu,Xu Chu,Xiao-Ming Wu
机构: The Hong Kong Polytechnic University (香港理工大学); Tsinghua University (清华大学); Peking University (北京大学); Huawei Hong Kong Research Center (华为香港研究中心); University of Illinois at Chicago (芝加哥伊利诺伊大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continual learning (CL) is crucial for deploying large language models (LLMs) in dynamic real-world environments without costly retraining. While recent model ensemble and model merging methods guided by parameter importance have gained popularity, they often struggle to balance knowledge transfer and forgetting, mainly due to the reliance on static importance estimates during sequential training. In this paper, we present Recurrent-KIF, a novel CL framework for Recurrent Knowledge Identification and Fusion, which enables dynamic estimation of parameter importance distributions to enhance knowledge transfer. Inspired by human continual learning, Recurrent-KIF employs an inner loop that rapidly adapts to new tasks while identifying important parameters, coupled with an outer loop that globally manages the fusion of new and historical knowledge through redundant knowledge pruning and key knowledge merging. These inner-outer loops iteratively perform multiple rounds of fusion, allowing Recurrent-KIF to leverage intermediate training information and adaptively adjust fusion strategies based on evolving importance distributions. Extensive experiments on two CL benchmarks with various model sizes (from 770M to 13B) demonstrate that Recurrent-KIF effectively mitigates catastrophic forgetting and enhances knowledge transfer.
zh

[NLP-111] Improving Value-based Process Verifier via Structural Prior Injection

【速读】：该论文旨在解决在大型语言模型（LLM）推理场景中，通过蒙特卡洛采样估计状态值时引入的噪声和误差问题。论文的关键解决方案是将结构先验注入价值表示，并将标量值转换为预定义的分类分布的期望，从而从分布的角度量化噪声和误差。具体而言，通过将蒙特卡洛采样的结果视为先验真实二项分布中的单一样本，论文量化采样误差为后验估计分布与真实分布之间的不匹配，并通过分布选择优化对此进行优化。

链接: https://arxiv.org/abs/2502.17498
作者: Zetian Sun,Dongfang Li,Baotian Hu,Jun Yu,Min Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:In the Large Language Model(LLM) reasoning scenario, people often estimate state value via Monte Carlo sampling. Though Monte Carlo estimation is an elegant method with less inductive bias, noise and errors are inevitably introduced due to the limited sampling. To handle the problem, we inject the structural prior into the value representation and transfer the scalar value into the expectation of a pre-defined categorical distribution, representing the noise and errors from a distribution perspective. Specifically, by treating the result of Monte Carlo sampling as a single sample from the prior ground-truth Binomial distribution, we quantify the sampling error as the mismatch between posterior estimated distribution and ground-truth distribution, which is thus optimized via distribution selection optimization. We test the performance of value-based process verifiers on Best-of-N task and Beam search task. Compared with the scalar value representation, we show that reasonable structural prior injection induced by different objective functions or optimization methods can improve the performance of value-based process verifiers for about 1 \sim 2 points at little-to-no cost. We also show that under different structural prior, the verifiers’ performances vary greatly despite having the same optimal solution, indicating the importance of reasonable structural prior injection.
zh

[NLP-112] ELLEN: Extremely Lightly Supervised Learning For Efficient Named Entity Recognition COLING2024 LREC

【速读】：该论文致力于解决半监督命名实体识别（Semi-supervised Named Entity Recognition, NER）中的极度轻量标注问题，即仅使用每个类别包含10个样本的词典作为监督信息。解决方案的关键在于引入了ELLEN方法，这是一种简单且完全模块化的神经符号方法，融合了微调语言模型与语言学规则。这些规则包括“一次语篇一个义项”原则，并利用掩码语言模型进行无监督NER，借助词性标注识别并消除未标记实体中的假阴性，以及关于分类器置信度分数在局部和全局上下文中的其他直觉。通过这种方法，ELLEN在CoNLL-2003数据集上实现了显著性能，并在相同的监督条件下超越了大多数现有更为复杂的半监督NER方法。

链接: https://arxiv.org/abs/2403.17385
作者: Haris Riaz,Razvan-Gabriel Dumitru,Mihai Surdeanu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to LREC-COLING 2024

点击查看摘要

Abstract:In this work, we revisit the problem of semi-supervised named entity recognition (NER) focusing on extremely light supervision, consisting of a lexicon containing only 10 examples per class. We introduce ELLEN, a simple, fully modular, neuro-symbolic method that blends fine-tuned language models with linguistic rules. These rules include insights such as ‘‘One Sense Per Discourse’’, using a Masked Language Model as an unsupervised NER, leveraging part-of-speech tags to identify and eliminate unlabeled entities as false negatives, and other intuitions about classifier confidence scores in local and global context. ELLEN achieves very strong performance on the CoNLL-2003 dataset when using the minimal supervision from the lexicon above. It also outperforms most existing (and considerably more complex) semi-supervised NER methods under the same supervision settings commonly used in the literature (i.e., 5% of the training data). Further, we evaluate our CoNLL-2003 model in a zero-shot scenario on WNUT-17 where we find that it outperforms GPT-3.5 and achieves comparable performance to GPT-4. In a zero-shot setting, ELLEN also achieves over 75% of the performance of a strong, fully supervised model trained on gold data. Our code is available at: this https URL.
zh

[NLP-113] Polarized Online Discourse on Abortion: Frames and Hostile Expressions among Liberals and Conservatives

【速读】：该论文旨在解决如何系统性地分析美国关于堕胎议题的政治分歧在公众话语中的长期演变，并探讨这些分歧在罗伊诉韦德案推翻前后的关键事件中如何表现。解决方案的关键在于通过分析超过350万条与堕胎相关的推特数据（涵盖110万用户一年的时间跨度），使用先进的变压器模型分类器来识别表达敌意的语言，并提取五个主要的堕胎讨论框架。通过这些数据，研究揭示了自由派和保守派在表达敌意方面的相互模仿，以及他们在不同框架下的独特讨论方式，从而反映出双方在堕胎议题上的视角差异。此外，研究还发现，一方偏好的框架往往会引发另一方的敌意反应，这表明双方存在不尊重和贬低的态度，可能进一步阻碍理解和加剧政治极化。

链接: https://arxiv.org/abs/2311.16831
作者: Ashwin Rao,Rong-Ching Chang,Qiankun Zhong,Kristina Lerman,Magdalena Wojcieszak
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Abortion has been one of the most divisive issues in the United States. Yet, missing is comprehensive longitudinal evidence on how political divides on abortion are reflected in public discourse over time, on a national scale, and in response to key events before and after the overturn of Roe v Wade. We analyze a corpus of over 3.5M tweets related to abortion over the span of one year (January 2022 to January 2023) from over 1.1M users. We estimate users’ ideology and rely on state-of-the-art transformer-based classifiers to identify expressions of hostility and extract five prominent frames surrounding abortion. We use those data to examine (a) how prevalent were expressions of hostility (i.e., anger, toxic speech, insults, obscenities, and hate speech), (b) what frames liberals and conservatives used to articulate their positions on abortion, and © the prevalence of hostile expressions in liberals and conservative discussions of these frames. We show that liberals and conservatives largely mirrored each other’s use of hostile expressions: as liberals used more hostile rhetoric, so did conservatives, especially in response to key events. In addition, the two groups used distinct frames and discussed them in vastly distinct contexts, suggesting that liberals and conservatives have differing perspectives on abortion. Lastly, frames favored by one side provoked hostile reactions from the other: liberals use more hostile expressions when addressing religion, fetal personhood, and exceptions to abortion bans, whereas conservatives use more hostile language when addressing bodily autonomy and women’s health. This signals disrespect and derogation, which may further preclude understanding and exacerbate polarization.
zh

[NLP-114] An Overview of Large Language Models for Statisticians

【速读】：该论文旨在解决大型语言模型（LLMs）在不确定性量化、可解释性、公平性、隐私保护、水印技术和模型适应性等方面的问题，以增强其可信度和透明度。论文的关键在于通过统计学与人工智能的结合，提出促进理论基础和实际应用发展的新方法，从而应对复杂的社会挑战。

链接: https://arxiv.org/abs/2502.17814
作者: Wenlong Ji,Weizhe Yuan,Emily Getzen,Kyunghyun Cho,Michael I. Jordan,Song Mei,Jason E Weston,Weijie J. Su,Jing Xu,Linjun Zhang
机构: Stanford University (斯坦福大学); New York University (纽约大学); Meta FAIR (Meta FAIR); University of Pennsylvania (宾夕法尼亚大学); UC Berkeley (加州大学伯克利分校); INRIA (INRIA); Rutgers University (罗格斯大学)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI), exhibiting remarkable capabilities across diverse tasks such as text generation, reasoning, and decision-making. While their success has primarily been driven by advances in computational power and deep learning architectures, emerging problems – in areas such as uncertainty quantification, decision-making, causal inference, and distribution shift – require a deeper engagement with the field of statistics. This paper explores potential areas where statisticians can make important contributions to the development of LLMs, particularly those that aim to engender trustworthiness and transparency for human users. Thus, we focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation. We also consider possible roles for LLMs in statistical analysis. By bridging AI and statistics, we aim to foster a deeper collaboration that advances both the theoretical foundations and practical applications of LLMs, ultimately shaping their role in addressing complex societal challenges.
zh

[NLP-115] From Euler to AI: Unifying Formulas for Mathematical Constants

【速读】：该论文旨在探究众多关于圆周率 ((\pi)) 的公式是否相互关联，并提出一种系统性方法来发现和证明这些公式的等价性。解决方案的关键在于利用现代大型语言模型、大规模数据处理技术和新颖的数学算法。通过分析457,145篇arXiv论文，论文揭示了超过三分之一的验证公式可以源自单一数学对象，从而证明了包括欧拉、高斯、布朗克爵士以及由拉马努金机器通过算法发现的新公式之间的联系。此方法同样适用于其他数学常数，如 (e)、(\zeta(3)) 和 Catalan 常数，体现了其广泛应用潜力。

链接: https://arxiv.org/abs/2502.17533
作者: Tomer Raz,Michael Shalyt,Elyasheev Leibtag,Rotem Kalisch,Yaron Hadad,Ido Kaminer
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Number Theory (math.NT)
备注: 50 pages, 6 figures

点击查看摘要

Abstract:The constant \pi has fascinated scholars for centuries, inspiring the derivation of countless formulas rooted in profound mathematical insight. This abundance of formulas raises a question: Are they interconnected, and can a unifying structure explain their relationships? We propose a systematic methodology for discovering and proving formula equivalences, leveraging modern large language models, large-scale data processing, and novel mathematical algorithms. Analyzing 457,145 arXiv papers, over a third of the validated formulas for \pi were proven to be derivable from a single mathematical object - including formulas by Euler, Gauss, Lord Brouncker, and newer ones from algorithmic discoveries by the Ramanujan Machine. Our approach extends to other constants, such as e , \zeta(3) , and Catalan’s constant, proving its broad applicability. This work represents a step toward the automatic unification of mathematical knowledge, laying a foundation for AI-driven discoveries of connections across scientific domains. Comments: 50 pages, 6 figures Subjects: History and Overview (math.HO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Number Theory (math.NT) Cite as: arXiv:2502.17533 [math.HO] (or arXiv:2502.17533v1 [math.HO] for this version) https://doi.org/10.48550/arXiv.2502.17533 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-116] Protein Large Language Models : A Comprehensive Survey

【速读】：该论文旨在全面概述蛋白质特异性大语言模型（Protein LLMs）的架构、训练数据集、评估指标及其多样化应用。通过系统分析超过100篇文章，论文提出了当前最先进蛋白质特异性大语言模型的结构分类，并探讨了这些模型如何利用大规模蛋白质序列数据以提高准确性，从而推动蛋白质工程和生物医学研究的发展。解决方案的关键在于通过系统的文献回顾和分析，构建一个结构化的分类体系，阐明蛋白质特异性大语言模型在提升预测精度方面的机制及其广泛应用前景。

链接: https://arxiv.org/abs/2502.17504
作者: Yijia Xiao,Wanjia Zhao,Junkai Zhang,Yiqiao Jin,Han Zhang,Zhicheng Ren,Renliang Sun,Haixin Wang,Guancheng Wan,Pan Lu,Xiao Luo,Yu Zhang,James Zou,Yizhou Sun,Wei Wang
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art Protein LLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning Protein LLMs as essential tools for scientific discovery in protein science. Resources are maintained at this https URL.
zh

[NLP-117] Brain-to-Text Decoding: A Non-invasive Approach via Typing

【速读】：该论文旨在解决通过非侵入性方法解码大脑活动以恢复失语或行动障碍患者沟通能力的问题。关键解决方案在于提出了一种名为Brain2Qwerty的新深度学习架构，该架构能够利用脑电图（EEG）或脑磁图（MEG）信号来解码句子，且在使用脑磁图时达到了平均字符错误率（CER）为32%，显著优于脑电图的67%。此外，最佳参与者模型实现了19%的CER，并能够完美解码未包含在训练集中的多种句子。

链接: https://arxiv.org/abs/2502.17480
作者: Jarod Lévy,Mingfang Zhang,Svetlana Pinet,Jérémy Rapin,Hubert Banville,Stéphane d’Ascoli,Jean-Rémi King
机构: Meta (Meta)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Modern neuroprostheses can now restore communication in patients who have lost the ability to speak or move. However, these invasive devices entail risks inherent to neurosurgery. Here, we introduce a non-invasive method to decode the production of sentences from brain activity and demonstrate its efficacy in a cohort of 35 healthy volunteers. For this, we present Brain2Qwerty, a new deep learning architecture trained to decode sentences from either electro- (EEG) or magneto-encephalography (MEG), while participants typed briefly memorized sentences on a QWERTY keyboard. With MEG, Brain2Qwerty reaches, on average, a character-error-rate (CER) of 32% and substantially outperforms EEG (CER: 67%). For the best participants, the model achieves a CER of 19%, and can perfectly decode a variety of sentences outside of the training set. While error analyses suggest that decoding depends on motor processes, the analysis of typographical errors suggests that it also involves higher-level cognitive factors. Overall, these results narrow the gap between invasive and non-invasive methods and thus open the path for developing safe brain-computer interfaces for non-communicating patients.
zh

[NLP-118] ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis

【速读】：该论文旨在解决ECG（心电图）诊断能力评估数据集不足及其复杂性和多样性有限的问题。关键解决方案在于创建了一个名为ECG-Expert-QA的综合多模态数据集，该数据集整合了真实的临床数据与系统生成的合成病例，并涵盖了从基础节律分析到复杂案例解读的六项基本诊断任务。通过严格的医学知识引导过程模拟具有挑战性的临床病例，不仅增加了标注诊断数据的可用性，还显著提升了临床表现的复杂性和多样性。

链接: https://arxiv.org/abs/2502.17475
作者: Xu Wang,Jiaju Kang,Puyu Han
机构: FUXI AI Lab(福溪人工智能实验室); Shandong Jianzhu University(山东建筑大学); Beijing Normal University(北京师范大学); Southern University of Science and Technology(南方科技大学)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present ECG-Expert-QA, a comprehensive multimodal dataset designed for evaluating diagnostic capabilities in ECG interpretation, integrating real clinical data with systematically generated synthetic cases. The dataset encompasses six fundamental diagnostic tasks, comprising 47,211 meticulously curated question-answer pairs that span a spectrum of clinical scenarios, from basic rhythm analysis to complex case interpretation. By simulating challenging clinical cases through a rigorous medical knowledge-guided process, ECG-Expert-QA not only enhances the availability of annotated diagnostic data but also significantly increases the complexity and diversity of clinical presentations, including rare cardiac conditions and temporal progression patterns. This design enables comprehensive evaluation of medical language models across multiple dimensions, including diagnostic accuracy, clinical reasoning, and knowledge integration. To facilitate global research collaboration, ECG-Expert-QA is available in both Chinese and English versions, with rigorous quality control ensuring linguistic and clinical consistency. The dataset’s challenging diagnostic tasks, which include interpretation of complex arrhythmias, identification of subtle ischemic changes, and integration of clinical context, establish it as an effective benchmark for advancing AI-assisted ECG interpretation and pushing the boundaries of current diagnostic models. Our dataset is open-source and available at this https URL.
zh

[NLP-119] Bridging Brain Signals and Language: A Deep Learning Approach to EEG-to-Text Decoding

【速读】：该论文旨在解决现有脑电图（EEG）到文本解码方法无法实现开放词汇量和深度语义理解，以及缺乏个体大脑特异性变量的问题。解决方案的关键在于引入一种新的框架，该框架通过整合针对特定受试者的模型与自然语言处理方法，改变了传统的封闭词汇量EEG到文本的解码方法。此方法采用深度表征学习技术提取重要的EEG特征，并训练神经网络生成超越原始数据内容的复杂句子，从而有效理解和生成有意义且正确的文本，同时考虑个体大脑的差异性。

链接: https://arxiv.org/abs/2502.17465
作者: Mostafa El Gedawy,Omnia Nabil,Omar Mamdouh,Mahmoud Nady,Nour Alhuda Adel,Ahmed Fares
机构: 未知
类目: ignal Processing (eess.SP); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 11 figures, and 6 tables

点击查看摘要

Abstract:Brain activity translation into human language delivers the capability to revolutionize machine-human interaction while providing communication support to people with speech disability. Electronic decoding reaches a certain level of achievement yet current EEG-to-text decoding methods fail to reach open vocabularies and depth of meaning and individual brain-specific variables. We introduce a special framework which changes conventional closed-vocabulary EEG-to-text decoding approaches by integrating subject-specific learning models with natural language processing methods to resolve detection obstacles. This method applies a deep representation learning approach to extract important EEG features which allow training of neural networks to create elaborate sentences that extend beyond original data content. The ZuCo dataset analysis demonstrates that research findings achieve higher BLEU, ROUGE and BERTScore performance when compared to current methods. The research proves how this framework functions as an effective approach to generate meaningful and correct texts while understanding individual brain variations. The proposed research aims to create a connection between open-vocabulary Text generation systems and human brain signal interpretation for developing efficacious brain-to-text systems. The research produces interdisciplinary effects through innovative assistive technology development and personalized communication systems which extend possibilities for human-computer interaction in various settings.
zh

计算机视觉

[CV-0] K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

【速读】：该论文旨在解决在不额外训练的情况下，有效融合多个低秩适应（Low-Rank Adaptation, LoRA）模块以同时保留原始主体和风格的问题。论文的关键在于提出了一种名为K-LoRA的无训练融合方法，通过在每个注意力层中比较要融合的LoRA中的Top-K元素，选择最优融合的LoRA，从而确保在融合过程中保留最具代表性的主体和风格特征，实现两者的有效平衡。

链接: https://arxiv.org/abs/2502.18461
作者: Ziheng Ouyang,Zhen Li,Qibin Hou
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent studies have explored combining different LoRAs to jointly generate learned style and content. However, existing methods either fail to effectively preserve both the original subject and style simultaneously or require additional training. In this paper, we argue that the intrinsic properties of LoRA can effectively guide diffusion models in merging learned subject and style. Building on this insight, we propose K-LoRA, a simple yet effective training-free LoRA fusion approach. In each attention layer, K-LoRA compares the Top-K elements in each LoRA to be fused, determining which LoRA to select for optimal fusion. This selection mechanism ensures that the most representative features of both subject and style are retained during the fusion process, effectively balancing their contributions. Experimental results demonstrate that the proposed method effectively integrates the subject and style information learned by the original LoRAs, outperforming state-of-the-art training-based approaches in both qualitative and quantitative results.
zh

[CV-1] GHOST 2.0: generative high-fidelity one shot transfer of heads

【速读】：该论文旨在解决人脸交换（Face Swapping）相关领域中的头像交换（Head Swapping）问题。论文提出的关键解决方案是GHOST 2.0系统，它包含两个专门设计的模块：一是增强的对齐模型（Enhanced Aligner Model），用于在保持多尺度身份信息的同时，处理极端姿态变化；二是融合模块（Blender Module），通过肤色转移和不匹配区域的无缝填充，将重演的头部自然地融入目标背景中。这两个模块分别在各自的任务上超越了基线方法，实现了当前最佳的人头交换效果，并且能够处理复杂的案例，如源图像与目标图像之间存在显著发型差异的情况。

链接: https://arxiv.org/abs/2502.18417
作者: Alexander Groshev(1),Anastasiia Iashchenko(1),Pavel Paramonov(1),Denis Dimitrov(1 and 2),Andrey Kuznetsov(1 and 2) ((1) SberAI, (2) AIRI)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While the task of face swapping has recently gained attention in the research community, a related problem of head swapping remains largely unexplored. In addition to skin color transfer, head swap poses extra challenges, such as the need to preserve structural information of the whole head during synthesis and inpaint gaps between swapped head and background. In this paper, we address these concerns with GHOST 2.0, which consists of two problem-specific modules. First, we introduce enhanced Aligner model for head reenactment, which preserves identity information at multiple scales and is robust to extreme pose variations. Secondly, we use a Blender module that seamlessly integrates the reenacted head into the target background by transferring skin color and inpainting mismatched regions. Both modules outperform the baselines on the corresponding tasks, allowing to achieve state of the art results in head swapping. We also tackle complex cases, such as large difference in hair styles of source and target.
zh

[CV-2] MedKAN: An Advanced Kolmogorov-Arnold Network for Medical Image Classification

【速读】：该论文旨在解决在医学影像分类中，卷积神经网络（CNNs）和基于Transformer的架构难以捕捉复杂的纹理细节和上下文特征的问题。解决方案的关键在于提出了一种基于Kolmogorov-Arnold网络（KANs）及其卷积扩展的新型框架——MedKAN。MedKAN通过其局部信息KAN（LIK）模块和全局信息KAN（GIK）模块实现精细特征提取和全局上下文整合，从而实现稳健的特征建模与融合。

链接: https://arxiv.org/abs/2502.18416
作者: Zhuoqin Yang,Jiansong Zhang,Xiaoling Luo,Zheng Lu,Linlin Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in deep learning for image classification predominantly rely on convolutional neural networks (CNNs) or Transformer-based architectures. However, these models face notable challenges in medical imaging, particularly in capturing intricate texture details and contextual features. Kolmogorov-Arnold Networks (KANs) represent a novel class of architectures that enhance nonlinear transformation modeling, offering improved representation of complex features. In this work, we present MedKAN, a medical image classification framework built upon KAN and its convolutional extensions. MedKAN features two core modules: the Local Information KAN (LIK) module for fine-grained feature extraction and the Global Information KAN (GIK) module for global context integration. By combining these modules, MedKAN achieves robust feature modeling and fusion. To address diverse computational needs, we introduce three scalable variants–MedKAN-S, MedKAN-B, and MedKAN-L. Experimental results on nine public medical imaging datasets demonstrate that MedKAN achieves superior performance compared to CNN- and Transformer-based models, highlighting its effectiveness and generalizability in medical image analysis.
zh

[CV-3] OmniAlign-V: Towards Enhanced Alignment of MLLM s with Human Preference

【速读】：该论文旨在解决开源多模态大语言模型（MLLMs）在人类偏好对齐方面的不足。解决方案的关键在于提出OmniAlign-V数据集，该数据集包含200K高质量训练样本，涵盖了多样化的图像、复杂的问题以及多样的响应格式，以提升MLLMs与人类偏好的对齐。此外，论文还引入了MM-AlignBench基准，用于评估MLLMs与人类价值观的对齐情况。实验结果表明，使用OmniAlign-V通过监督微调（SFT）或直接偏好优化（DPO）来微调MLLMs，能够显著提高人类偏好对齐，同时保持或增强其在标准视觉问答（VQA）基准上的性能。

链接: https://arxiv.org/abs/2502.18411
作者: Xiangyu Zhao,Shengyuan Ding,Zicheng Zhang,Haian Huang,Maosong Cao,Weiyun Wang,Jiaqi Wang,Xinyu Fang,Wenhai Wang,Guangtao Zhai,Haodong Duan,Hua Yang,Kai Chen
机构: Shanghai Jiaotong University(上海交通大学); Shanghai AI Laboratory(上海人工智能实验室); Nanjing University(南京大学); Fudan University(复旦大学); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs’ alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs’ alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at this https URL.
zh

[CV-4] EgoSim: An Egocentric Multi-view Simulator and Real Dataset for Body-worn Cameras during Motion and Activity

【速读】：该论文旨在解决随着光学传感器的小型化，越来越多穿戴在身体不同部位的相机将被集成到设备中所带来的新视角与挑战。特别是在计算机视觉领域中的传统任务如人体运动跟踪、身体姿态估计以及动作识别方面，尤其是对于通常被遮挡的下肢部分。论文的关键解决方案在于引入了一个名为EgoSim的新模拟器，它能够从多个视角生成逼真的第一人称视角渲染图像，并利用真实运动捕捉数据来模拟运动伪影，尤其是在手臂或腿部佩戴摄像头时尤为明显。此外，作者还发布了MultiEgoView数据集，该数据集包含来自六个身体佩戴摄像头的第一人称视角视频及相应的全身体三维姿态标注，从而显著提高了训练模型在现实世界数据上的泛化能力。

链接: https://arxiv.org/abs/2502.18373
作者: Dominik Hollidt,Paul Streli,Jiaxi Jiang,Yasaman Haghighi,Changlin Qian,Xintong Liu,Christian Holz
机构: ETH Zürich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Research on egocentric tasks in computer vision has mostly focused on head-mounted cameras, such as fisheye cameras or embedded cameras inside immersive headsets. We argue that the increasing miniaturization of optical sensors will lead to the prolific integration of cameras into many more body-worn devices at various locations. This will bring fresh perspectives to established tasks in computer vision and benefit key areas such as human motion tracking, body pose estimation, or action recognition – particularly for the lower body, which is typically occluded. In this paper, we introduce EgoSim, a novel simulator of body-worn cameras that generates realistic egocentric renderings from multiple perspectives across a wearer’s body. A key feature of EgoSim is its use of real motion capture data to render motion artifacts, which are especially noticeable with arm- or leg-worn cameras. In addition, we introduce MultiEgoView, a dataset of egocentric footage from six body-worn cameras and ground-truth full-body 3D poses during several activities: 119 hours of data are derived from AMASS motion sequences in four high-fidelity virtual environments, which we augment with 5 hours of real-world motion data from 13 participants using six GoPro cameras and 3D body pose references from an Xsens motion capture suit. We demonstrate EgoSim’s effectiveness by training an end-to-end video-only 3D pose estimation network. Analyzing its domain gap, we show that our dataset and simulator substantially aid training for inference on real-world data. EgoSim code MultiEgoView dataset: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.18373 [cs.CV] (or arXiv:2502.18373v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.18373 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dominik Hollidt [view email] [v1] Tue, 25 Feb 2025 17:11:14 UTC (36,158 KB)
zh

[CV-5] Near-Shore Mapping for Detection and Tracking of Vessels

【速读】：该论文旨在解决自主水面航行器（ASV）在码头附近难以追踪小型船只如皮划艇的问题。传统方法依赖于陆地掩膜来过滤陆地和码头，但由于其不精确性，导致难以追踪接近码头的物体。论文的关键解决方案在于利用离线创建的高精度三维地图，该地图通过激光雷达（LiDAR）数据生成，并结合图像数据检测和过滤潜在移动的物体。具体而言，通过训练神经网络进行视觉船体检测与分割，从而提高近岸追踪的准确性。

链接: https://arxiv.org/abs/2502.18368
作者: Nicholas Dalhaug,Annette Stahl,Rudolf Mester,Edmund Førland Brekke
机构: Norwegian University of Science and Technology (NTNU)(挪威科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to FUSION 2025

点击查看摘要

Abstract:For an autonomous surface vessel (ASV) to dock, it must track other vessels close to the docking area. Kayaks present a particular challenge due to their proximity to the dock and relatively small size. Maritime target tracking has typically employed land masking to filter out land and the dock. However, imprecise land masking makes it difficult to track close-to-dock objects. Our approach uses Light Detection And Ranging (LiDAR) data and maps the docking area offline. The precise 3D measurements allow for precise map creation. However, the mapping could result in static, yet potentially moving, objects being mapped. We detect and filter out potentially moving objects from the LiDAR data by utilizing image data. The visual vessel detection and segmentation method is a neural network that is trained on our labeled data. Close-to-shore tracking improves with an accurate map and is demonstrated on a recently gathered real-world dataset. The dataset contains multiple sequences of a kayak and a day cruiser moving close to the dock, in a collision path with an autonomous ferry prototype.
zh

[CV-6] ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

【速读】：该论文旨在解决多层图像生成中的关键任务，即如何通过全局文本提示直接生成具有可变透明度的多层图像，并实现高效且精确的层间控制。论文的关键解决方案是引入了Anonymous Region Transformer (ART)，它通过匿名区域布局允许生成模型自主确定视觉令牌与文本令牌之间的对齐方式，从而替代了先前主要使用的语义布局方法。此外，ART通过分层区域裁剪机制显著降低了注意力计算成本，使得能够高效生成包含大量不同层次（如50层以上）的图像，并且相比全关注方法，ART的处理速度提高了超过12倍，同时减少了层间冲突。

链接: https://arxiv.org/abs/2502.18364
作者: Yifan Pu,Yiming Zhao,Zhicong Tang,Ruihong Yin,Haoxing Ye,Yuhui Yuan,Dong Chen,Jianmin Bao,Sirui Zhang,Yanbin Wang,Lin Liang,Lijuan Wang,Ji Li,Xiu Li,Zhouhui Lian,Gao Huang,Baining Guo
机构: Microsoft Research Asia (微软亚洲研究院); Tsinghua University (清华大学); Peking University (北京大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Inspired by Schema theory suggests that knowledge is organized in frameworks (schemas) that enable people to interpret and learn from new information by linking it to prior knowledge., this anonymous region layout allows the generative model to autonomously determine which set of visual tokens should align with which text tokens, which is in contrast to the previously dominant semantic layout for the image generation task. In addition, the layer-wise region crop mechanism, which only selects the visual tokens belonging to each anonymous region, significantly reduces attention computation costs and enables the efficient generation of images with numerous distinct layers (e.g., 50+). When compared to the full attention approach, our method is over 12 times faster and exhibits fewer layer conflicts. Furthermore, we propose a high-quality multi-layer transparent image autoencoder that supports the direct encoding and decoding of the transparency of variable multi-layer images in a joint manner. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.
zh

[CV-7] From Vision to Sound: Advancing Audio Anomaly Detection with Vision-Based Algorithms

【速读】：该论文旨在解决音频异常检测（Audio Anomaly Detection, AAD）问题，特别是在工业和环境领域。不同于大多数现有方法主要依赖于异常样本分类，本文提出的方法通过引入细粒度的时间-频率局部化技术，显著提升了检测结果的可解释性。关键在于这种方法能够精确识别出异常发生的具体时间和位置，从而使得检测结果更加具有实际应用价值。

链接: https://arxiv.org/abs/2502.18328
作者: Manuel Barusco,Francesco Borsatti,Davide Dalle Pezze,Francesco Paissan,Elisabetta Farella,Gian Antonio Susto
机构: University of Padova (帕多瓦大学); Fondazione Bruno Kessler (布鲁诺·凯泽勒基金会)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advances in Visual Anomaly Detection (VAD) have introduced sophisticated algorithms leveraging embeddings generated by pre-trained feature extractors. Inspired by these developments, we investigate the adaptation of such algorithms to the audio domain to address the problem of Audio Anomaly Detection (AAD). Unlike most existing AAD methods, which primarily classify anomalous samples, our approach introduces fine-grained temporal-frequency localization of anomalies within the spectrogram, significantly improving explainability. This capability enables a more precise understanding of where and when anomalies occur, making the results more actionable for end users. We evaluate our approach on industrial and environmental benchmarks, demonstrating the effectiveness of VAD techniques in detecting anomalies in audio signals. Moreover, they improve explainability by enabling localized anomaly identification, making audio anomaly detection systems more interpretable and practical.
zh

[CV-8] Self-Supervised Data Generation for Precision Agriculture: Blending Simulated Environments with Real Imagery

【速读】：该论文旨在解决精准农业中因环境动态变化及农作物外观演变导致的标记数据稀缺和显著协变量偏移问题。解决方案的关键在于提出了一种新颖的系统，利用基于Unity引擎的葡萄园模拟器，采用考虑几何一致性的剪切粘贴技术生成逼真的合成数据，从而在不同的视角和光照条件下产生多样化的数据样本，以训练检测算法。这种方法显著提升了先进检测器在桌葡萄栽培中的训练性能。

链接: https://arxiv.org/abs/2502.18320
作者: Leonardo Saraceni,Ionut Marian Motoi,Daniele Nardi,Thomas Alessandro Ciarfuglia
机构: Department of Computer, Control and Management Engineering (DIAG), Sapienza University of Rome (罗马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Presented at 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE)

点击查看摘要

Abstract:In precision agriculture, the scarcity of labeled data and significant covariate shifts pose unique challenges for training machine learning models. This scarcity is particularly problematic due to the dynamic nature of the environment and the evolving appearance of agricultural subjects as living things. We propose a novel system for generating realistic synthetic data to address these challenges. Utilizing a vineyard simulator based on the Unity engine, our system employs a cut-and-paste technique with geometrical consistency considerations to produce accurate photo-realistic images and labels from synthetic environments to train detection algorithms. This approach generates diverse data samples across various viewpoints and lighting conditions. We demonstrate considerable performance improvements in training a state-of-the-art detector by applying our method to table grapes cultivation. The combination of techniques can be easily automated, an increasingly important consideration for adoption in agricultural practice.
zh

[CV-9] GCDance: Genre-Controlled 3D Full Body Dance Generation Driven By Music

【速读】：该论文旨在解决从音乐生成高质量全身影舞序列的挑战，特别是确保这些序列在体态上逼真且与音乐的节拍和节奏精确同步。为了解决这些问题，论文提出了一种名为GCDance的无分类器扩散框架，该框架基于音乐和文本提示生成特定流派的舞蹈动作。其关键是通过结合高层预训练音乐基础模型特征和手工设计特征来提取音乐特征，并利用CLIP高效嵌入基于文本的流派提示表示，从而实现风格可控性。这一方法使得GCDance能够在同一段音乐上生成多样的舞蹈风格，同时保持与音乐节奏和旋律的一致性。

链接: https://arxiv.org/abs/2502.18309
作者: Xinran Liu,Xu Dong,Diptesh Kanojia,Wenwu Wang,Zhenhua Feng
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Generating high-quality full-body dance sequences from music is a challenging task as it requires strict adherence to genre-specific choreography. Moreover, the generated sequences must be both physically realistic and precisely synchronized with the beats and rhythm of the music. To overcome these challenges, we propose GCDance, a classifier-free diffusion framework for generating genre-specific dance motions conditioned on both music and textual prompts. Specifically, our approach extracts music features by combining high-level pre-trained music foundation model features with hand-crafted features for multi-granularity feature fusion. To achieve genre controllability, we leverage CLIP to efficiently embed genre-based textual prompt representations at each time step within our dance generation pipeline. Our GCDance framework can generate diverse dance styles from the same piece of music while ensuring coherence with the rhythm and melody of the music. Extensive experimental results obtained on the FineDance dataset demonstrate that GCDance significantly outperforms the existing state-of-the-art approaches, which also achieve competitive results on the AIST++ dataset. Our ablation and inference time analysis demonstrate that GCDance provides an effective solution for high-quality music-driven dance generation.
zh

[CV-10] LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation

【速读】：该论文旨在解决传统文本编码器（如CLIP和T5）在多语言处理方面的局限性，这些局限性阻碍了跨多种语言的图像生成。论文的关键解决方案是引入LDGen方法，通过利用大型语言模型（LLMs）的高级能力，采用层次化标题优化和人类指令技术来提取精确的语义信息。此外，通过轻量级适配器和跨模态精炼器促进LLMs与图像特征之间的高效特征对齐和交互，从而实现在减少训练时间的同时支持零样本多语言图像生成。实验结果表明，该方法在提示遵循度和图像美学质量方面超越了基线模型，并且能够无缝支持多种语言。

链接: https://arxiv.org/abs/2502.18302
作者: Pengzhi Li,Pengfei Yu,Zide Liu,Wei He,Xuhao Pan,Xudong Rao,Tao Wei,Wei Chen
机构: Li Auto Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models while minimizing computational demands. Traditional text encoders, such as CLIP and T5, exhibit limitations in multilingual processing, hindering image generation across diverse languages. We address these challenges by leveraging the advanced capabilities of LLMs. Our approach employs a language representation strategy that applies hierarchical caption optimization and human instruction techniques to derive precise semantic information,. Subsequently, we incorporate a lightweight adapter and a cross-modal refiner to facilitate efficient feature alignment and interaction between LLMs and image features. LDGen reduces training time and enables zero-shot multilingual image generation. Experimental results indicate that our method surpasses baseline models in both prompt adherence and image aesthetic quality, while seamlessly supporting multiple languages. Project page: this https URL.
zh

[CV-11] Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models

【速读】：该论文旨在解决在大规模视觉语言模型（Large Vision Language Models, LVLMs）中，通过篡改自监督学习（Self-Supervised Learning, SSL）视觉编码器引入显著视觉幻觉的新后门威胁。解决方案的关键在于提出了一种名为BadVision的方法，该方法利用新颖的触发器优化和后门学习技术来有效操控LVLMs，使其产生攻击者指定的视觉幻觉，且攻击成功率超过99%，同时保持隐蔽性。

链接: https://arxiv.org/abs/2502.18290
作者: Zhaoyi Liu,Huan Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) vision encoders learn high-quality image representations and thus have become a vital part of developing vision modality of large vision language models (LVLMs). Due to the high cost of training such encoders, pre-trained encoders are widely shared and deployed into many LVLMs, which are security-critical or bear societal significance. Under this practical scenario, we reveal a new backdoor threat that significant visual hallucinations can be induced into these LVLMs by merely compromising vision encoders. Because of the sharing and reuse of these encoders, many downstream LVLMs may inherit backdoor behaviors from encoders, leading to widespread backdoors. In this work, we propose BadVision, the first method to exploit this vulnerability in SSL vision encoders for LVLMs with novel trigger optimization and backdoor learning techniques. We evaluate BadVision on two types of SSL encoders and LVLMs across eight benchmarks. We show that BadVision effectively drives the LVLMs to attacker-chosen hallucination with over 99% attack success rate, causing a 77.6% relative visual understanding error while maintaining the stealthiness. SoTA backdoor detection methods cannot detect our attack effectively.
zh

[CV-12] Multi-label out-of-distribution detection via evidential learning ECCV

【速读】：该论文旨在解决机器学习算法在处理未见分布（Out-of-Distribution, OOD）数据，尤其是多标签情况下的鲁棒性和适应性问题。关键在于提出了一种基于证据深度学习的方法，设计了一个使用Beta证据神经网络的CNN架构，能够同时计算样本的似然性和预测不确定性，并据此提出了两种新的基于不确定性的OOD检测评分方法：(i) OOD-score Max，基于最大证据；(ii) OOD-score Sum，考虑所有输出的证据。通过在PASCAL-VOC、MS-COCO和NUS-WIDE三个常用数据集上的广泛实验验证，证明了该方法优于多种最新技术。

链接: https://arxiv.org/abs/2502.18224
作者: Eduardo Aguilar,Bogdan Raducanu,Petia Radeva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Uncertainty Quantification for Computer Vision workshop (ECCVW 2024)

点击查看摘要

Abstract:A crucial requirement for machine learning algorithms is not only to perform well, but also to show robustness and adaptability when encountering novel scenarios. One way to achieve these characteristics is to endow the deep learning models with the ability to detect out-of-distribution (OOD) data, i.e. data that belong to distributions different from the one used during their training. It is even a more complicated situation, when these data usually are multi-label. In this paper, we propose an approach based on evidential deep learning in order to meet these challenges applied to visual recognition problems. More concretely, we designed a CNN architecture that uses a Beta Evidential Neural Network to compute both the likelihood and the predictive uncertainty of the samples. Based on these results, we propose afterwards two new uncertainty-based scores for OOD data detection: (i) OOD - score Max, based on the maximum evidence; and (ii) OOD score - Sum, which considers the evidence from all outputs. Extensive experiments have been carried out to validate the proposed approach using three widely-used datasets: PASCAL-VOC, MS-COCO and NUS-WIDE, demonstrating its outperformance over several State-of-the-Art methods.
zh

[CV-13] UASTrack: A Unified Adaptive Selection Framework with Modality-Customization in Single Object Tracking

【速读】：该论文旨在解决多模态单目标跟踪（SOT）中的统一RGB-X追踪器（X代表深度、事件或热敏模态）在实际应用中未能有效应对模态适应性感知的问题。现有方法要么依赖于针对特定任务的训练策略，要么未能充分考虑不同模态数据分布的独特性。论文的关键解决方案在于提出UASTrack框架，该框架通过设计判别自选择器（Discriminative Auto-Selector, DAS）实现模态适应性感知，并通过任务定制优化适配器（Task-Customized Optimization Adapter, TCOA）在潜在空间中有效过滤噪声冗余和减轻背景干扰，从而提升了多模态跟踪任务的表现。

链接: https://arxiv.org/abs/2502.18220
作者: He Wang,Tianyang Xu,Zhangyong Tang,Xiao-Jun Wu,Josef Kittler
机构: School of Artificial Intelligence and Computer Science, Jiangnan University (江南大学), Wuxi 214122, China; Centre for Vision, Speech and Signal Processing, University of Surrey (萨里大学), GU2 7XH Guildford, U.K.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-modal tracking is essential in single-object tracking (SOT), as different sensor types contribute unique capabilities to overcome challenges caused by variations in object appearance. However, existing unified RGB-X trackers (X represents depth, event, or thermal modality) either rely on the task-specific training strategy for individual RGB-X image pairs or fail to address the critical importance of modality-adaptive perception in real-world applications. In this work, we propose UASTrack, a unified adaptive selection framework that facilitates both model and parameter unification, as well as adaptive modality discrimination across various multi-modal tracking tasks. To achieve modality-adaptive perception in joint RGB-X pairs, we design a Discriminative Auto-Selector (DAS) capable of identifying modality labels, thereby distinguishing the data distributions of auxiliary modalities. Furthermore, we propose a Task-Customized Optimization Adapter (TCOA) tailored to various modalities in the latent space. This strategy effectively filters noise redundancy and mitigates background interference based on the specific characteristics of each modality. Extensive comparisons conducted on five benchmarks including LasHeR, GTOT, RGBT234, VisEvent, and DepthTrack, covering RGB-T, RGB-E, and RGB-D tracking scenarios, demonstrate our innovative approach achieves comparative performance by introducing only additional training parameters of 1.87M and flops of 1.95G. The code will be available at this https URL.
zh

[CV-14] Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training

【速读】：该论文旨在解决由单一图像生成新视角时，扩散模型在保持新视角与参考视角之间一致性方面所面临的挑战。关键解决方案在于利用极线几何（Epipolar Geometry）来定位和检索输入视图中的重叠信息，并将其整合到目标视图的生成过程中，从而无需训练或微调。此外，通过将极线注意力机制扩展到多视角设置，进一步增强了生成视图的整体一致性。这种方法无需任何可学习参数，即可显著提升新视角合成的一致性，并改善下游应用如三维重建的表现。

链接: https://arxiv.org/abs/2502.18219
作者: Botao Ye,Sifei Liu,Xueting Li,Marc Pollefeys,Ming-Hsuan Yang
机构: ETH Zurich (苏黎世联邦理工学院); ETH AI Center (ETH AI中心); NVIDIA (英伟达); Microsoft (微软); UC Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3DV 2025

点击查看摘要

Abstract:Large diffusion models demonstrate remarkable zero-shot capabilities in novel view synthesis from a single image. However, these models often face challenges in maintaining consistency across novel and reference views. A crucial factor leading to this issue is the limited utilization of contextual information from reference views. Specifically, when there is an overlap in the viewing frustum between two views, it is essential to ensure that the corresponding regions maintain consistency in both geometry and appearance. This observation leads to a simple yet effective approach, where we propose to use epipolar geometry to locate and retrieve overlapping information from the input view. This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning, as the process requires no learnable parameters. Furthermore, to enhance the overall consistency of generated views, we extend the utilization of epipolar attention to a multi-view setting, allowing retrieval of overlapping information from the input view and other target views. Qualitative and quantitative experimental results demonstrate the effectiveness of our method in significantly improving the consistency of synthesized views without the need for any fine-tuning. Moreover, This enhancement also boosts the performance of downstream applications such as 3D reconstruction. The code is available at this https URL.
zh

[CV-15] Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation

【速读】：该论文旨在解决通用哺乳动物姿态估计中的巨大外观和姿态变化问题。论文的关键在于提出了一种关键点交互变换器（Keypoint Interactive Transformer, KIT），通过学习实例级别的结构支持依赖关系来应对这一挑战。具体而言，KIT包括两个耦合组件：一是提取关键点特征并生成身体部位提示，这些特征受专用广义热图回归损失（GHRL）监督；二是设计了一种新颖的交互变换器，以特征切片作为输入令牌，不进行空间分割，并采用自适应权重策略来解决不同关键点之间的不平衡问题。

链接: https://arxiv.org/abs/2502.18214
作者: Tianyang Xu,Jiyong Rao,Xiaoning Song,Zhenhua Feng,Xiao-Jun Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IJCV 2025

点击查看摘要

Abstract:General mammal pose estimation is an important and challenging task in computer vision, which is essential for understanding mammal behaviour in real-world applications. However, existing studies are at their preliminary research stage, which focus on addressing the problem for only a few specific mammal species. In principle, from specific to general mammal pose estimation, the biggest issue is how to address the huge appearance and pose variances for different species. We argue that given appearance context, instance-level prior and the structural relation among keypoints can serve as complementary evidence. To this end, we propose a Keypoint Interactive Transformer (KIT) to learn instance-level structure-supporting dependencies for general mammal pose estimation. Specifically, our KITPose consists of two coupled components. The first component is to extract keypoint features and generate body part prompts. The features are supervised by a dedicated generalised heatmap regression loss (GHRL). Instead of introducing external visual/text prompts, we devise keypoints clustering to generate body part biases, aligning them with image context to generate corresponding instance-level prompts. Second, we propose a novel interactive transformer that takes feature slices as input tokens without performing spatial splitting. In addition, to enhance the capability of the KIT model, we design an adaptive weight strategy to address the imbalance issue among different keypoints.
zh

[CV-16] raining Consistency Models with Variational Noise Coupling

【速读】：该论文旨在解决非蒸馏一致性训练（Non-distillation Consistency Training, CT）在图像生成任务中常遇到的高方差和不稳定性问题。论文的关键解决方案在于提出了一种基于Flow Matching框架的新颖CT训练方法，并引入了一种受变分自编码器（Variational Autoencoders, VAE）架构启发的噪声耦合方案。通过训练一个数据相关的噪声发射模型（作为编码器架构实现），该方法能够间接学习噪声到数据的映射几何结构，这是传统CT中由前向过程固定的部分。这一创新使得模型在多个图像数据集上的生成性能显著提升，并在CIFAR-10上达到了非蒸馏CT的最新水平，在ImageNet上达到与最新技术相当的结果。

链接: https://arxiv.org/abs/2502.18197
作者: Gianluigi Silvestri,Luca Ambrogioni,Chieh-Hsin Lai,Yuhta Takida,Yuki Mitsufuji
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 11 figures

点击查看摘要

Abstract:Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and improving its training dynamics is an active area of research. In this work, we propose a novel CT training approach based on the Flow Matching framework. Our main contribution is a trained noise-coupling scheme inspired by the architecture of Variational Autoencoders (VAE). By training a data-dependent noise emission model implemented as an encoder architecture, our method can indirectly learn the geometry of the noise-to-data mapping, which is instead fixed by the choice of the forward process in classical CT. Empirical results across diverse image datasets show significant generative improvements, with our model outperforming baselines and achieving the state-of-the-art (SoTA) non-distillation CT FID on CIFAR-10, and attaining FID on par with SoTA on ImageNet at 64 \times 64 resolution in 2-step generation. Our code is available at this https URL .
zh

[CV-17] Multi-Perspective Data Augmentation for Few-shot Object Detection ICLR2025

【速读】：该论文旨在解决Few-Shot目标检测(Few-Shot Object Detection, FSOD)中合成样本多样性不足的问题，特别是在前景与背景关系方面的代表性不足。为了解决这一问题，论文提出了一种多视角数据增强(Multi-Perspective Data Augmentation, MPAD)框架。MPAD框架的关键在于其三个组成部分：1) 基于上下文学习的对象合成方法(In-Context Learning for Object Synthesis, ICOS)，通过调整边界框来增强合成样本的细节和空间信息；2) 和谐提示聚合调度器(Harmonic Prompt Aggregation Scheduler, HPAS)，通过在扩散模型生成过程中的每个时间步骤混合提示嵌入来产生困难的新型样本；3) 背景提议方法(Background Proposal, BAP)，用于采样典型的和困难的背景。这些方法共同提高了合成样本的多样性和代表性，从而提升了FSOD的性能。

链接: https://arxiv.org/abs/2502.18195
作者: Anh-Khoa Nguyen Vu,Quoc-Truong Truong,Vinh-Tiep Nguyen,Thanh Duc Ngo,Thanh-Toan Do,Tam V. Nguyen
机构: University of Information Technology(信息技术大学), Vietnam; Vietnam National University(越南国家大学), Vietnam; Monash University(蒙纳士大学), Clayton, VIC 3800, Australia; University of Dayton(代顿大学), Dayton, OH 45469, United States
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:Recent few-shot object detection (FSOD) methods have focused on augmenting synthetic samples for novel classes, show promising results to the rise of diffusion models. However, the diversity of such datasets is often limited in representativeness because they lack awareness of typical and hard samples, especially in the context of foreground and background relationships. To tackle this issue, we propose a Multi-Perspective Data Augmentation (MPAD) framework. In terms of foreground-foreground relationships, we propose in-context learning for object synthesis (ICOS) with bounding box adjustments to enhance the detail and spatial information of synthetic samples. Inspired by the large margin principle, support samples play a vital role in defining class boundaries. Therefore, we design a Harmonic Prompt Aggregation Scheduler (HPAS) to mix prompt embeddings at each time step of the generation process in diffusion models, producing hard novel samples. For foreground-background relationships, we introduce a Background Proposal method (BAP) to sample typical and hard backgrounds. Extensive experiments on multiple FSOD benchmarks demonstrate the effectiveness of our approach. Our framework significantly outperforms traditional methods, achieving an average increase of 17.5% in nAP50 over the baseline on PASCAL VOC. Code is available at this https URL.
zh

[CV-18] CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification ICLR2025

【速读】：该论文旨在构建一个对抗鲁棒的零样本图像分类器。为实现这一目标，作者基于CLIP模型（CLIP: Contrastive Language–Image Pre-training）进行工作，CLIP是一种视觉-语言预训练编码模型，能够通过匹配图像与文本提示（如“一张某类物体的照片”）来进行零样本分类。论文的关键解决方案在于净化（Purification）策略，该策略无需针对特定攻击类型的对抗训练，从而能够应对任何可预见的攻击。具体而言，作者将净化风险表述为去噪对抗样本的净化过程与扰动良性样本的攻击过程之间的联合分布KL散度，并通过双向随机微分方程（SDEs）进行建模。最终结果启发作者在CLIP的多模态潜在空间中探索净化方法。为此，作者提出了两种CLIPure方法变体：CLIPure-Diff利用DaLLE-2中的DiffusionPrior模块来建模图像潜在向量的概率分布，而CLIPure-Cos则使用图像嵌入与文本提示“一张某类物体的照片”之间的余弦相似性来建模概率分布。据作者所知，CLIPure是首个在多模态潜在空间中应用的净化方法，而CLIPure-Cos则是首个不基于生成模型的净化方法，显著提升了防御效率。

链接: https://arxiv.org/abs/2502.18176
作者: Mingkun Zhang,Keping Bi,Wei Chen,Jiafeng Guo,Xueqi Cheng
机构: CAS Key Laboratory of AI Safety (人工智能安全重点实验室),
Institute of Computing Technology (计算技术研究所), Chinese Academy of Sciences (中国科学院), Beijing, China (北京);
CAS Key Laboratory of Network Data Science and Technology (网络数据科学与技术重点实验室),
Institute of Computing Technology (计算技术研究所), Chinese Academy of Sciences (中国科学院), Beijing, China (北京);
University of Chinese Academy of Sciences (中国科学院大学), Beijing, China (北京)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by ICLR 2025

点击查看摘要

Abstract:In this paper, we aim to build an adversarially robust zero-shot image classifier. We ground our work on CLIP, a vision-language pre-trained encoder model that can perform zero-shot classification by matching an image with text prompts a photo of a class-name.''. Purification is the path we choose since it does not require adversarial training on specific attack types and thus can cope with any foreseen attacks. We then formulate purification risk as the KL divergence between the joint distributions of the purification process of denoising the adversarial samples and the attack process of adding perturbations to benign samples, through bidirectional Stochastic Differential Equations (SDEs). The final derived results inspire us to explore purification in the multi-modal latent space of CLIP. We propose two variants for our CLIPure approach: CLIPure-Diff which models the likelihood of images' latent vectors with the DiffusionPrior module in DaLLE-2 (modeling the generation process of CLIP's latent vectors), and CLIPure-Cos which models the likelihood with the cosine similarity between the embeddings of an image and a photo of a.‘’. As far as we know, CLIPure is the first purification method in multi-modal latent space and CLIPure-Cos is the first purification method that is not based on generative models, which substantially improves defense efficiency. We conducted extensive experiments on CIFAR-10, ImageNet, and 13 datasets that previous CLIP-based defense methods used for evaluating zero-shot classification robustness. Results show that CLIPure boosts the SOTA robustness by a large margin, e.g., from 71.7% to 91.1% on CIFAR10, from 59.6% to 72.6% on ImageNet, and 108% relative improvements of average robustness on the 13 datasets over previous SOTA. The code is available at this https URL.
zh

[CV-19] Monitoring snow avalanches from SAR data with deep learning

【速读】：该论文旨在解决雪崩监测中的复杂性和变异性问题，特别是在大范围区域和各种天气条件下进行有效检测与分割。解决方案的关键在于应用深度学习技术，特别是像素级分割方法，以提高检测精度和空间分辨率，并通过大规模数据集验证模型的有效性。

链接: https://arxiv.org/abs/2502.18157
作者: Filippo Maria Bianchi,Jakob Grahn
机构: NORCE, Norwegian Research Centre AS (NORCE, 挪威研究中心AS); UiT the Arctic University of Norway (UIT北极大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Snow avalanches present significant risks to human life and infrastructure, particularly in mountainous regions, making effective monitoring crucial. Traditional monitoring methods, such as field observations, are limited by accessibility, weather conditions, and cost. Satellite-borne Synthetic Aperture Radar (SAR) data has become an important tool for large-scale avalanche detection, as it can capture data in all weather conditions and across remote areas. However, traditional processing methods struggle with the complexity and variability of avalanches. This chapter reviews the application of deep learning for detecting and segmenting snow avalanches from SAR data. Early efforts focused on the binary classification of SAR images, while recent advances have enabled pixel-level segmentation, providing greater accuracy and spatial resolution. A case study using Sentinel-1 SAR data demonstrates the effectiveness of deep learning models for avalanche segmentation, achieving superior results over traditional methods. We also present an extension of this work, testing recent state-of-the-art segmentation architectures on an expanded dataset of over 4,500 annotated SAR images. The best-performing model among those tested was applied for large-scale avalanche detection across the whole of Norway, revealing important spatial and temporal patterns over several winter seasons.
zh

[CV-20] Joint Reconstruction of Spatially-Coherent and Realistic Clothed Humans and Objects from a Single Image

【速读】：该论文旨在解决单视角图像中人体与物体共存场景下的高精度三维重建问题，特别是处理人体与其他物体之间的遮挡以及缺乏三维空间感知导致的深度模糊。关键解决方案在于提出了一种基于注意力机制的神经隐式模型，该模型利用图像像素对齐来获取高质量细节，并结合从人体-物体姿态中提取的语义特征以实现三维空间感知。此外，采用生成扩散模型来处理人体与物体间的遮挡问题。

链接: https://arxiv.org/abs/2502.18150
作者: Ayushi Dutta,Marco Pesavento,Marco Volino,Adrian Hilton,Armin Mustafa
机构: Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey (萨里大学), Guildford, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in human shape learning have focused on achieving accurate human reconstruction from single-view images. However, in the real world, humans share space with other objects. Reconstructing images with humans and objects is challenging due to the occlusions and lack of 3D spatial awareness, which leads to depth ambiguity in the reconstruction. Existing methods in monocular human-object reconstruction fail to capture intricate details of clothed human bodies and object surfaces due to their template-based nature. In this paper, we jointly reconstruct clothed humans and objects in a spatially coherent manner from single-view images, while addressing human-object occlusions. A novel attention-based neural implicit model is proposed that leverages image pixel alignment to retrieve high-quality details, and incorporates semantic features extracted from the human-object pose to enable 3D spatial awareness. A generative diffusion model is used to handle human-object occlusions. For training and evaluation, we introduce a synthetic dataset with rendered scenes of inter-occluded 3D human scans and diverse objects. Extensive evaluation on both synthetic and real datasets demonstrates the superior quality of proposed human-object reconstructions over competitive methods.
zh

[CV-21] LightFC-X: Lightweight Convolutional Tracker for RGB-X Tracking

【速读】：该论文旨在解决多模态跟踪在资源受限设备上的计算负担过重和成本过高的问题。解决方案的关键在于提出了一种名为LightFC-X的轻量级卷积RGB-X追踪器家族，它通过统一的卷积架构实现轻量级的多模态跟踪。具体而言，论文提出了一个高效的交叉注意力模块（ECAM）和一个新的时空模板聚合模块（STAM）。ECAM以仅0.08M参数实现了模板搜索区域特征的轻量级跨模态交互，而STAM则通过模块微调范式增强了模型对时间信息的利用。实验结果表明，LightFC-X在参数、性能和速度之间实现了最优平衡。

链接: https://arxiv.org/abs/2502.18143
作者: Yunfeng Li,Bo Wang,Ye Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite great progress in multimodal tracking, these trackers remain too heavy and expensive for resource-constrained devices. To alleviate this problem, we propose LightFC-X, a family of lightweight convolutional RGB-X trackers that explores a unified convolutional architecture for lightweight multimodal tracking. Our core idea is to achieve lightweight cross-modal modeling and joint refinement of the multimodal features and the spatiotemporal appearance features of the target. Specifically, we propose a novel efficient cross-attention module (ECAM) and a novel spatiotemporal template aggregation module (STAM). The ECAM achieves lightweight cross-modal interaction of template-search area integrated feature with only 0.08M parameters. The STAM enhances the model’s utilization of temporal information through module fine-tuning paradigm. Comprehensive experiments show that our LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed. For example, LightFC-T-ST outperforms CMD by 4.3% and 5.7% in SR and PR on the LasHeR benchmark, which it achieves 2.6x reduction in parameters and 2.7x speedup. It runs in real-time on the CPU at a speed of 22 fps. The code is available at this https URL.
zh

[CV-22] SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

【速读】：该论文旨在解决大型模型中高效实现注意力机制的问题。由于注意力机制的时间复杂度呈二次增长，导致计算成本高昂。尽管注意力图通常表现出稀疏性，但现有方法大多局限于特定模型内优化，无法保证通用性和整体性能。论文的关键在于提出了一种名为SpargeAttn的通用稀疏量化注意力方法，通过两阶段在线过滤器，在不增加额外开销的情况下，快速准确地预测注意力图，并进一步跳过一些矩阵乘法运算，从而在不影响端到端性能的前提下显著加速包括语言、图像和视频生成在内的多种模型。

链接: https://arxiv.org/abs/2502.18137
作者: Jintao Zhang,Chendong Xiang,Haofeng Huang,Jia Wei,Haocheng Xi,Jun Zhu,Jianfei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注:

点击查看摘要

Abstract:An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at this https URL.
zh

[CV-23] Personalized Federated Learning for Egocentric Video Gaze Estimation with Comprehensive Parameter Frezzing

【速读】：该论文旨在解决个性化视角视频视线估计任务中的模型适应性和准确性问题。关键解决方案在于采用了一种基于变换器的架构，并将其集成到一种部分联邦学习（Partially Federated Learning, PFL）框架中。这种方法仅选择并冻结在训练过程中变化最大的参数，以实现客户端模型的个性化。通过在EGTEA Gaze+和Ego4D数据集上的大量实验验证，FedCPF方法显著优于先前报道的联邦学习方法，实现了更高的召回率、精确度和F1分数。这些结果证实了所提出的全面参数冻结策略在增强模型个性化方面的有效性。

链接: https://arxiv.org/abs/2502.18123
作者: Yuhu Feng,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Egocentric video gaze estimation requires models to capture individual gaze patterns while adapting to diverse user data. Our approach leverages a transformer-based architecture, integrating it into a PFL framework where only the most significant parameters, those exhibiting the highest rate of change during training, are selected and frozen for personalization in client models. Through extensive experimentation on the EGTEA Gaze+ and Ego4D datasets, we demonstrate that FedCPF significantly outperforms previously reported federated learning methods, achieving superior recall, precision, and F1-score. These results confirm the effectiveness of our comprehensive parameters freezing strategy in enhancing model personalization, making FedCPF a promising approach for tasks requiring both adaptability and accuracy in federated learning settings.
zh

[CV-24] Enhancing Reusability of Learned Skills for Robot Manipulation via Gaze and Bottleneck

【速读】：该论文旨在解决在不同场景下复用机器人习得的灵巧操作技能的挑战。论文提出的关键解决方案是Gaze-based Bottleneck-aware Robot Manipulation (GazeBot)算法，该算法通过利用注视信息(gaze information)和运动瓶颈(motion bottlenecks)，实现了即使在物体位置和末端执行器姿态与演示数据不同时，仍能保持高重用性。此外，GazeBot在保持灵巧性和反应性的同时，相较于最先进的模仿学习方法，展示了更高的泛化性能。训练过程完全基于数据驱动，仅需提供包含注视数据的演示数据集。

链接: https://arxiv.org/abs/2502.18121
作者: Ryo Takizawa,Izumi Karino,Koki Nakagawa,Yoshiyuki Ohmura,Yasuo Kuniyoshi
机构: The University of Tokyo
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous agents capable of diverse object manipulations should be able to acquire a wide range of manipulation skills with high reusability. Although advances in deep learning have made it increasingly feasible to replicate the dexterity of human teleoperation in robots, generalizing these acquired skills to previously unseen scenarios remains a significant challenge. In this study, we propose a novel algorithm, Gaze-based Bottleneck-aware Robot Manipulation (GazeBot), which enables high reusability of the learned motions even when the object positions and end-effector poses differ from those in the provided demonstrations. By leveraging gaze information and motion bottlenecks, both crucial features for object manipulation, GazeBot achieves high generalization performance compared with state-of-the-art imitation learning methods, without sacrificing its dexterity and reactivity. Furthermore, the training process of GazeBot is entirely data-driven once a demonstration dataset with gaze data is provided. Videos and code are available at this https URL.
zh

[CV-25] PromptMID: Modal Invariant Descriptors Based on Diffusion and Vision Foundation Models for Optical-SAR Image Matching

【速读】：该论文旨在解决光学与合成孔径雷达（SAR）图像匹配在未见域中稳定性和效率的问题。现有基于学习的方法虽在特定场景下有效，但泛化能力有限且难以适应实际应用。论文的关键解决方案是提出PromptMID方法，通过利用基于土地使用分类的文本提示作为先验信息，构建模态不变描述符。PromptMID利用预训练的扩散模型和视觉基础模型（VFM）提取多尺度模态不变特征，并通过专门设计的功能聚合模块融合不同粒度的特征，从而显著提升光学-SAR图像匹配的跨域泛化能力。

链接: https://arxiv.org/abs/2502.18104
作者: Han Nie,Bin Luo,Jun Liu,Zhitao Fu,Huan Zhou,Shuo Zhang,Weixing Liu
机构: State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (测绘遥感信息工程国家重点实验室, 武汉大学), Wuhan 430079, China; School of Remote Sensing and Information Engineering, Wuhan University (遥感信息工程学院, 武汉大学), Wuhan 430079, China; Faculty of Land Resources Engineering, Kunming University of Science and Technology (国土资源工程学院, 昆明理工大学), Kunming 650031, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:The ideal goal of image matching is to achieve stable and efficient performance in unseen domains. However, many existing learning-based optical-SAR image matching methods, despite their effectiveness in specific scenarios, exhibit limited generalization and struggle to adapt to practical applications. Repeatedly training or fine-tuning matching models to address domain differences is not only not elegant enough but also introduces additional computational overhead and data production costs. In recent years, general foundation models have shown great potential for enhancing generalization. However, the disparity in visual domains between natural and remote sensing images poses challenges for their direct application. Therefore, effectively leveraging foundation models to improve the generalization of optical-SAR image matching remains challenge. To address the above challenges, we propose PromptMID, a novel approach that constructs modality-invariant descriptors using text prompts based on land use classification as priors information for optical and SAR image matching. PromptMID extracts multi-scale modality-invariant features by leveraging pre-trained diffusion models and visual foundation models (VFMs), while specially designed feature aggregation modules effectively fuse features across different granularities. Extensive experiments on optical-SAR image datasets from four diverse regions demonstrate that PromptMID outperforms state-of-the-art matching methods, achieving superior results in both seen and unseen domains and exhibiting strong cross-domain generalization capabilities. The source code will be made publicly available this https URL.
zh

[CV-26] FwNet-ECA: Facilitating Window Attention with Global Receptive Fields through Fourier Filtering Operations

【速读】：该论文旨在解决窗口注意力机制计算复杂度高的问题，并提出了一种名为FwNet-ECA的新方法。该方法利用傅里叶变换与可学习权重矩阵来增强图像的频谱特征，促进窗口间的连接，从而最大化感受野。关键在于采用频率域增强技术，而非物理平移窗口，以隐式地在空间区域间传递信息。这种方法不仅降低了参数数量和计算开销，同时保持了竞争力的准确性。

链接: https://arxiv.org/abs/2502.18094
作者: Shengtian Mian,Ya Wang,Nannan Gu,Yuping Wang,Xiaoqing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Windowed attention mechanisms were introduced to mitigate the issue of excessive computation inherent in global attention mechanisms. However, In this paper, we present FwNet-ECA, a novel method that utilizes Fourier transforms paired with learnable weight matrices to enhance the spectral features of images. This strategy facilitates inter-window connectivity, thereby maximizing the receptive field. Additionally, we incorporate the Efficient Channel Attention (ECA) module to improve communication between different channels. Instead of relying on physically shifted windows, our approach leverages frequency domain enhancement to implicitly bridge information across spatial regions. We validate our model on the iCartoonFace dataset and conduct downstream tasks on ImageNet, demonstrating that our model achieves lower parameter counts and computational overheads compared to shifted window approaches, while maintaining competitive accuracy. This work offers a more efficient and effective alternative for leveraging attention mechanisms in visual processing tasks, alleviating the challenges associated with windowed attention models. Code is available at this https URL.
zh

[CV-27] A Fusion Model for Art Style and Author Recognition Based on Convolutional Neural Networks and Transformers

【速读】：该论文旨在解决艺术风格和作者识别中的图像分类挑战。解决方案的关键在于提出了一种融合模型，结合了卷积神经网络（CNNs）和Transformer模型的优势。具体而言，该模型首先利用CNNs提取局部特征，然后使用Transformer捕捉全局上下文，并通过特征融合机制来增强分类准确性。实验结果表明，该融合模型在中画及油画数据集上的表现优于单独的CNN和Transformer模型，分别提高了9.7%和7.1%的分类准确率，并提升了F1分数。

链接: https://arxiv.org/abs/2502.18083
作者: Zhenyu Wang,Heng Song
机构: The School of Control and Computer Engineering, North China Electric Power University (华北电力大学控制与计算机工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recognition of art styles and authors is crucial in areas like cultural heritage protection, art market analysis, and historical research. With the advancement of deep learning, Convolutional Neural Networks (CNNs) and Transformer models have become key tools for image classification. While CNNs excel in local feature extraction, they struggle with global context, and Transformers are strong in capturing global dependencies but weak in fine-grained local details. To address these challenges, this paper proposes a fusion model combining CNNs and Transformers for art style and author recognition. The model first extracts local features using CNNs, then captures global context with a Transformer, followed by a feature fusion mechanism to enhance classification accuracy. Experiments on Chinese and oil painting datasets show the fusion model outperforms individual CNN and Transformer models, improving classification accuracy by 9.7% and 7.1%, respectively, and increasing F1 scores by 0.06 and 0.05. The results demonstrate the model’s effectiveness and potential for future improvements, such as multimodal integration and architecture optimization.
zh

[CV-28] Examining the Threat Landscape: Foundation Models and Model Stealing BMVC2024

【速读】：该论文旨在探讨基于基础模型（Foundation Models, FMs）的应用在模型窃取攻击中的脆弱性。研究发现，相较于传统的视觉架构如ResNets，基于FMs微调的模型表现出更高的易受攻击性。关键在于FMs在预训练过程中全面编码了视觉模式和特征，这些信息被攻击者利用，导致攻击成功率显著提高。论文通过实验表明，使用Vision Transformer (ViT-L/16)作为窃取模型，在CIFAR-10数据集上针对相同架构的受害者模型可获得高达94.28%的预测一致性，而ResNet-18仅为73.20%。论文强调需要采取有效的安全措施以保护这些模型免遭窃取。

链接: https://arxiv.org/abs/2502.18077
作者: Ankita Raj,Deepankar Varma,Chetan Arora
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted to BMVC 2024

点击查看摘要

Abstract:Foundation models (FMs) for computer vision learn rich and robust representations, enabling their adaptation to task/domain-specific deployments with little to no fine-tuning. However, we posit that the very same strength can make applications based on FMs vulnerable to model stealing attacks. Through empirical analysis, we reveal that models fine-tuned from FMs harbor heightened susceptibility to model stealing, compared to conventional vision architectures like ResNets. We hypothesize that this behavior is due to the comprehensive encoding of visual patterns and features learned by FMs during pre-training, which are accessible to both the attacker and the victim. We report that an attacker is able to obtain 94.28% agreement (matched predictions with victim) for a Vision Transformer based victim model (ViT-L/16) trained on CIFAR-10 dataset, compared to only 73.20% agreement for a ResNet-18 victim, when using ViT-L/16 as the thief model. We arguably show, for the first time, that utilizing FMs for downstream tasks may not be the best choice for deployment in commercial APIs due to their susceptibility to model theft. We thereby alert model owners towards the associated security risks, and highlight the need for robust security measures to safeguard such models against theft. Code is available at this https URL.
zh

[CV-29] Escaping The Big Data Paradigm in Self-Supervised Representation Learning

【速读】：该论文旨在解决在视觉表示学习中依赖大规模数据集和大量计算资源的问题，特别是在数据稀缺领域。论文的关键解决方案是引入SCOTT（稀疏卷积分词器用于Transformer），这是一种浅层分词架构，兼容掩码图像建模（Masked Image Modeling, MIM）任务，并将卷积归纳偏置注入到视觉Transformer（Vision Transformer, ViT）中，以增强其在小规模数据集上的有效性。此外，论文提出了一种基于MIM框架下的联合嵌入预测架构（MIM-JEPA），在潜在表征空间中捕捉更丰富的语义特征。通过这些方法，ViT可以在远小于传统需求的数据集上从头训练，而无需依赖大规模外部数据集进行预训练。

链接: https://arxiv.org/abs/2502.18056
作者: Carlos Vélez García,Miguel Cazorla,Jorge Pomares
机构: INESCOP(INESCOP); University of Alicante (阿尔利冈大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and implementation available at: this https URL

点击查看摘要

Abstract:The reliance on large-scale datasets and extensive computational resources has become a major barrier to advancing representation learning in vision, especially in data-scarce domains. In this paper, we address the critical question: Can we escape the big data paradigm in self-supervised representation learning from images? We introduce SCOTT (Sparse Convolutional Tokenizer for Transformers), a shallow tokenization architecture that is compatible with Masked Image Modeling (MIM) tasks. SCOTT injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimes. Alongside, we propose to train on a Joint-Embedding Predictive Architecture within a MIM framework (MIM-JEPA), operating in latent representation space to capture more semantic features. Our approach enables ViTs to be trained from scratch on datasets orders of magnitude smaller than traditionally required --without relying on massive external datasets for pretraining. We validate our method on three small-size, standard-resoultion, fine-grained datasets: Oxford Flowers-102, Oxford IIIT Pets-37, and ImageNet-100. Despite the challenges of limited data and high intra-class similarity, frozen SCOTT models pretrained with MIM-JEPA significantly outperform fully supervised methods and achieve competitive results with SOTA approaches that rely on large-scale pretraining, complex image augmentations and bigger model sizes. By demonstrating that robust off-the-shelf representations can be learned with limited data, compute, and model sizes, our work paves the way for computer applications in resource constrained environments such as medical imaging or robotics. Our findings challenge the prevailing notion that vast amounts of data are indispensable for effective representation learning in vision, offering a new pathway toward more accessible and inclusive advancements in the field.
zh

[CV-30] Progressive Local Alignment for Medical Multimodal Pre-training

【速读】：该论文旨在解决医学影像与文本之间局部对齐的挑战，特别是在缺乏自然局部配对且刚性区域识别方法受限的情况下。为了解决这些问题，论文提出了一种渐进式局部对齐网络（PLAN），其关键在于采用基于对比学习的方法来建立有意义的词-像素关系，并引入一种渐进学习策略以迭代优化这些关系，从而提高对齐精度和鲁棒性。这种方法有效地改进了软区域识别并抑制了噪声干扰。

链接: https://arxiv.org/abs/2502.18047
作者: Huimin Yan,Xian Yang,Liang Bai,Jiye Liang
机构: Institute of Intelligent Information Processing, Shanxi University (智能信息处理研究所, 山西大学), Taiyuan, 030006, China; Alliance Manchester Business School, The University of Manchester (曼彻斯特联盟商学院, 曼彻斯特大学), Manchester, M13 9PL, UK
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Local alignment between medical images and text is essential for accurate diagnosis, though it remains challenging due to the absence of natural local pairings and the limitations of rigid region recognition methods. Traditional approaches rely on hard boundaries, which introduce uncertainty, whereas medical imaging demands flexible soft region recognition to handle irregular structures. To overcome these challenges, we propose the Progressive Local Alignment Network (PLAN), which designs a novel contrastive learning-based approach for local alignment to establish meaningful word-pixel relationships and introduces a progressive learning strategy to iteratively refine these relationships, enhancing alignment precision and robustness. By combining these techniques, PLAN effectively improves soft region recognition while suppressing noise interference. Extensive experiments on multiple medical datasets demonstrate that PLAN surpasses state-of-the-art methods in phrase grounding, image-text retrieval, object detection, and zero-shot classification, setting a new benchmark for medical image-text alignment.
zh

[CV-31] S-Graphs 2.0 – A Hierarchical-Semantic Optimization and Loop Closure for SLAM

【速读】：该论文旨在解决基于定位与建图的方法未能充分利用环境中的语义关联信息，从而导致机器人姿态及地图元素管理效率低下和大规模场景下计算复杂度高的问题。关键解决方案在于提出了一种名为Situational Graphs 2.0的新方法，通过构建一个包含关键帧、墙壁、房间和楼层四层结构的场景图，利用室内场景的层次结构进行高效的数据管理和优化。其创新点包括前端的楼层检测模块以及针对不同层次（局部关键帧窗口、当前楼层全局、房间局部）的优化策略，特别是通过引入楼层级语义关联，实现楼层内闭环检测，减少视觉相似区域的误检。

链接: https://arxiv.org/abs/2502.18044
作者: Hriday Bavle,Jose Luis Sanchez-Lopez,Muhammad Shaheer,Javier Civera,Holger Voos
机构: Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg(卢森堡大学); I3A, Universidad de Zaragoza, Spain(西班牙萨拉戈萨大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 8 pages, 9 figures, RAL submission

点击查看摘要

Abstract:Works based on localization and mapping do not exploit the inherent semantic-relational information from the environment for faster and efficient management and optimization of the robot poses and its map elements, often leading to pose and map inaccuracies and computational inefficiencies in large scale environments. 3D scene graph representations which distributes the environment in an hierarchical manner can be exploited to enhance the management/optimization of underlying robot poses and its map. In this direction, we present our work Situational Graphs 2.0, which leverages the hierarchical structure of indoor scenes for efficient data management and optimization. Our algorithm begins by constructing a situational graph that organizes the environment into four layers: Keyframes, Walls, Rooms, and Floors. Our first novelty lies in the front-end which includes a floor detection module capable of identifying stairways and assigning a floor-level semantic-relations to the underlying layers. This floor-level semantic enables a floor-based loop closure strategy, rejecting false-positive loop closures in visually similar areas on different floors. Our second novelty is in exploiting the hierarchy for an improved optimization. It consists of: (1) local optimization, optimizing a window of recent keyframes and their connected components, (2) floor-global optimization, which focuses only on keyframes and their connections within the current floor during loop closures, and (3) room-local optimization, marginalizing redundant keyframes that share observations within the room. We validate our algorithm extensively in different real multi-floor environments. Our approach can demonstrate state-of-art-art results in large scale multi-floor environments creating hierarchical maps while bounding the computational complexity where several baseline works fail to execute efficiently. Comments: 8 pages, 9 figures, RAL submission Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE) Cite as: arXiv:2502.18044 [cs.RO] (or arXiv:2502.18044v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2502.18044 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-32] VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion

【速读】：该论文旨在解决当前自动驾驶系统在将2D观测转换为3D空间时丢失关键语义信息的问题，从而阻碍其在动态和复杂环境中的有效部署。论文的关键解决方案是提出了一种名为VLM-E2E的新框架，该框架利用视觉-语言模型（Vision-Language Models, VLMs）的场景理解和推理能力，通过提供注意力线索来增强训练。此外，引入了BEV-Text可学习加权融合策略，以解决多模态信息融合中的模态重要性不平衡问题。这种方法能够显式地捕捉驾驶员的注意力语义，并确保视觉和文本模态的互补信息得到有效利用，从而实现更全面和稳健的驾驶环境表示。

链接: https://arxiv.org/abs/2502.18042
作者: Pei Liu(1),Haipeng Liu(2),Haichao Liu(1),Xin Liu(1),Jinxin Ni(3),Jun Ma(1 and 4) ((1) The Hong Kong University of Science and Technology (Guangzhou), (2) Li Auto Inc., (3) Xiamen University, (4) The Hong Kong University of Science and Technology)
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); Li Auto Inc.(理想汽车); School of Aeronautics and Astronautics, Xiamen University (厦门大学航空航天学院); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human drivers adeptly navigate complex scenarios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space. In this sense, it hinders their effective deployment in dynamic and complex environments. Leveraging the superior scene understanding and reasoning abilities of Vision-Language Models (VLMs), we propose VLM-E2E, a novel framework that uses the VLMs to enhance training by providing attentional cues. Our method integrates textual representations into Bird’s-Eye-View (BEV) features for semantic supervision, which enables the model to learn richer feature representations that explicitly capture the driver’s attentional semantics. By focusing on attentional semantics, VLM-E2E better aligns with human-like driving behavior, which is critical for navigating dynamic and complex environments. Furthermore, we introduce a BEV-Text learnable weighted fusion strategy to address the issue of modality importance imbalance in fusing multimodal information. This approach dynamically balances the contributions of BEV and text features, ensuring that the complementary information from visual and textual modality is effectively utilized. By explicitly addressing the imbalance in multimodal fusion, our method facilitates a more holistic and robust representation of driving environments. We evaluate VLM-E2E on the nuScenes dataset and demonstrate its superiority over state-of-the-art approaches, showcasing significant improvements in performance.
zh

[CV-33] OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation

【速读】：该论文旨在解决户外空中视觉-语言导航（Vision-Language Navigation, VLN）领域数据不足的问题。户外空中视角覆盖范围广，数据收集更具挑战性，导致缺乏相应的基准数据集。为了解决这一问题，论文提出了一套名为OpenFly的平台，其关键是开发了一个高度自动化的工具链，用于数据收集，包括点云获取、场景语义分割、飞行轨迹创建及指令生成。基于此工具链，构建了一个包含100,000条轨迹的大规模户外空中VLN数据集，并采用多种渲染引擎和技术生成高质量的视觉数据。此外，论文还提出了一个关键帧感知的VLN模型OpenFly-Agent，该模型直接输出飞行动作。关键在于自动化工具链和大规模数据集的构建，以及新模型的设计。

链接: https://arxiv.org/abs/2502.18041
作者: Yunpeng Gao,Chenhui Li,Zhongrui You,Junli Liu,Zhen Li,Pengan Chen,Qizhi Chen,Zhonghan Tang,Liansheng Wang,Penghui Yang,Yiwen Tang,Yuhang Tang,Shuai Liang,Songyi Zhu,Ziqin Xiong,Yifei Su,Xinyi Ye,Jianan Li,Yan Ding,Dong Wang,Zhigang Wang,Bin Zhao,Xuelong Li
机构: Shanghai AI Laboratory; Northwestern Polytechnical University; Beijing University of Posts and Telecommunications; Shanghai Jiao Tong University; The University of Hong Kong; Zhejiang University; University of Science and Technology of China; East China University of Science and Technology; Fudan University; Institute of Automation, Chinese Academy of Sciences; TeleAI
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language Navigation (VLN) aims to guide agents through an environment by leveraging both language instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN has been extensively studied, whereas outdoor aerial VLN remains underexplored. The potential reason is that outdoor aerial view encompasses vast areas, making data collection more challenging, which results in a lack of benchmarks. To address this problem, we propose OpenFly, a platform comprising a versatile toolchain and large-scale benchmark for aerial VLN. Firstly, we develop a highly automated toolchain for data collection, enabling automatic point cloud acquisition, scene semantic segmentation, flight trajectory creation, and instruction generation. Secondly, based on the toolchain, we construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes. The corresponding visual data are generated using various rendering engines and advanced techniques, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). All data exhibit high visual quality. Particularly, 3D GS supports real-to-sim rendering, further enhancing the realism of the dataset. Thirdly, we propose OpenFly-Agent, a keyframe-aware VLN model, which takes language instructions, current observations, and historical keyframes as input, and outputs flight actions directly. Extensive analyses and experiments are conducted, showcasing the superiority of our OpenFly platform and OpenFly-Agent. The toolchain, dataset, and codes will be open-sourced.
zh

[CV-34] High-precision visual navigation device calibration method based on collimator

【速读】：该论文旨在解决视觉导航设备中相机校准耗时以及姿态调整过程复杂的问题。关键解决方案在于提出了一种基于准直仪的校准方法和系统，引入了单图像相机校准算法，并结合高精度调整机制实现了坐标系间旋转传递模型，从而实现高效的姿态校准。

链接: https://arxiv.org/abs/2502.18012
作者: Shunkun Liang,Dongcai Tan,Banglei Guan,Zhang Li,Guangcheng Dai,Nianpeng Pan,Liang Shen,Yang Shang,Qifeng Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Visual navigation devices require precise calibration to achieve high-precision localization and navigation, which includes camera and attitude calibration. To address the limitations of time-consuming camera calibration and complex attitude adjustment processes, this study presents a collimator-based calibration method and system. Based on the optical characteristics of the collimator, a single-image camera calibration algorithm is introduced. In addition, integrated with the precision adjustment mechanism of the calibration frame, a rotation transfer model between coordinate systems enables efficient attitude calibration. Experimental results demonstrate that the proposed method achieves accuracy and stability comparable to traditional multi-image calibration techniques. Specifically, the re-projection errors are less than 0.1463 pixels, and average attitude angle errors are less than 0.0586 degrees with a standard deviation less than 0.0257 degrees, demonstrating high precision and robustness.
zh

[CV-35] Shedding Light on the Polymers Identity: Microplastic Detection and Identification Through Nile Red Staining and Multispectral Imaging (FIMAP)

【速读】：该论文旨在解决微塑料（Microplastics, MPs）在环境中的广泛分布所带来的检测与识别挑战。论文的关键解决方案是引入了荧光成像微塑料分析平台（Fluorescence Imaging Microplastic Analysis Platform, FIMAP），这是一个配备了四个光学滤镜和五个激发波长的多光谱相机。FIMAP通过K均值聚类进行鲁棒分割（交并比=0.877）以及采用20维颜色坐标多元最近邻方法进行微塑料分类（3.14 mm），实现了90%的精度、90%的准确率、100%的召回率和94.7%的F1得分。这些技术有效排除了天然有机物（Natural Organic Matter, NOM）的干扰。

链接: https://arxiv.org/abs/2502.17997
作者: Derek Ho,Haotian Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 20 pages (with additional supplementary material), 5 Figures, 3 Tables

点击查看摘要

Abstract:The widespread distribution of microplastics (MPs) in the environment presents significant challenges for their detection and identification. Fluorescence imaging has emerged as a promising technique for enhancing plastic particle detectability and enabling accurate classification based on fluorescence behavior. However, conventional segmentation techniques face limitations, including poor signal-to-noise ratio, inconsistent illumination, thresholding difficulties, and false positives from natural organic matter (NOM). To address these challenges, this study introduces the Fluorescence Imaging Microplastic Analysis Platform (FIMAP), a retrofitted multispectral camera with four optical filters and five excitation wavelengths. FIMAP enables comprehensive characterization of the fluorescence behavior of ten Nile Red-stained MPs: HDPE, LDPE, PP, PS, EPS, ABS, PVC, PC, PET, and PA, while effectively excluding NOM. Using K-means clustering for robust segmentation (Intersection over Union = 0.877) and a 20-dimensional color coordinate multivariate nearest neighbor approach for MP classification (3.14 mm), FIMAP achieves 90% precision, 90% accuracy, 100% recall, and an F1 score of 94.7%. Only PS was occasionally misclassified as EPS. For smaller MPs (35-104 microns), classification accuracy declined, likely due to reduced stain sorption, fewer detectable pixels, and camera instability. Integrating FIMAP with higher-magnification instruments, such as a microscope, may enhance MP identification. This study presents FIMAP as an automated, high-throughput framework for detecting and classifying MPs across large environmental sample volumes.
zh

[CV-36] Improved YOLOv7x-Based Defect Detection Algorithm for Power Equipment

【速读】：该论文旨在解决电力设备异常检测中的关键问题，通过引入改进的YOLOv7x算法。解决方案的关键在于：首先，引入了ACmix卷积混合注意力机制模块，有效抑制背景噪声和无关特征，从而增强网络的特征提取能力；其次，增加了Biformer注意力机制，强化对关键特征的关注，提升网络灵活识别特征图像的能力；最后，采用MPDIoU函数替代原始损失函数，以更全面评估预测框与真实框之间的关系，解决了预测框不匹配的问题。这些改进显著提升了检测精度，达到了所有目标类别93.5%的mAP@0.5，以及97.1%的精度和召回率。

链接: https://arxiv.org/abs/2502.17961
作者: Jin Hou,Hao Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The normal operation of power equipment plays a critical role in the power system, making anomaly detection for power equipment highly significant. This paper proposes an improved YOLOv7x-based anomaly detection algorithm for power equipment. First, the ACmix convolutional mixed attention mechanism module is introduced to effectively suppress background noise and irrelevant features, thereby enhancing the network’s feature extraction capability. Second, the Biformer attention mechanism is added to the network to strengthen the focus on key features, improving the network’s ability to flexibly recognize feature images. Finally, to more comprehensively evaluate the relationship between predicted and ground truth bounding boxes, the original loss function is replaced with the MPDIoU function, addressing the issue of mismatched predicted bounding boxes. The improved algorithm enhances detection accuracy, achieving a mAP@0.5/% of 93.5% for all target categories, a precision of 97.1%, and a recall of 97%.
zh

[CV-37] Robust Polyp Detection and Diagnosis through Compositional Prompt-Guided Diffusion Models

【速读】：该论文旨在解决深度学习模型在结直肠癌（CRC）筛查中的多中心数据集泛化能力不足以及传统数据增强方法无法充分模拟复杂医学图像的问题。解决方案的关键在于提出了一种渐进谱扩散模型（Progressive Spectrum Diffusion Model, PSDM），通过整合多样化的临床注释（如分割掩膜、边界框和结肠镜报告）并将其转化为组成性提示（compositional prompts），这些提示分为粗略和精细组件，使模型能够捕捉广泛的结构和细节信息，从而生成具有临床准确性的合成图像。通过将PSDM生成的样本加入训练数据，显著提升了息肉检测、分类和分割的性能。

链接: https://arxiv.org/abs/2502.17951
作者: Jia Yu,Yan Zhu,Peiyao Fu,Tianyi Chen,Junbo Huang,Quanlin Li,Pinghong Zhou,Zhihua Wang,Fei Wu,Shuo Wang,Xian Yang
机构: Zhejiang University (浙江大学), Shanghai Institute for Advanced Study of Zhejiang University (浙江大学上海高等研究院); Shanghai Key Laboratory of MICCAI (上海医疗图像和计算机辅助技术重点实验室), Zhongshan Hospital (中山医院), Fudan University (复旦大学), Shanghai, China (中国); Shanghai Collaborative Innovation Center of Endoscopy (上海内镜协同创新中心), Shanghai, China (中国); Alliance Manchester Business School (曼彻斯特大学联盟曼彻斯特商学院), The University of Manchester (曼彻斯特大学), Manchester, UK (英国); Data Science Institute (数据科学研究所), Imperial College London (伦敦帝国理工学院), London, UK (英国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Colorectal cancer (CRC) is a significant global health concern, and early detection through screening plays a critical role in reducing mortality. While deep learning models have shown promise in improving polyp detection, classification, and segmentation, their generalization across diverse clinical environments, particularly with out-of-distribution (OOD) data, remains a challenge. Multi-center datasets like PolypGen have been developed to address these issues, but their collection is costly and time-consuming. Traditional data augmentation techniques provide limited variability, failing to capture the complexity of medical images. Diffusion models have emerged as a promising solution for generating synthetic polyp images, but the image generation process in current models mainly relies on segmentation masks as the condition, limiting their ability to capture the full clinical context. To overcome these limitations, we propose a Progressive Spectrum Diffusion Model (PSDM) that integrates diverse clinical annotations-such as segmentation masks, bounding boxes, and colonoscopy reports-by transforming them into compositional prompts. These prompts are organized into coarse and fine components, allowing the model to capture both broad spatial structures and fine details, generating clinically accurate synthetic images. By augmenting training data with PSDM-generated samples, our model significantly improves polyp detection, classification, and segmentation. For instance, on the PolypGen dataset, PSDM increases the F1 score by 2.12% and the mean average precision by 3.09%, demonstrating superior performance in OOD scenarios and enhanced generalization.
zh

[CV-38] InVDriver: Intra-Instance Aware Vectorized Query-Based Autonomous Driving Transformer

【速读】：该论文旨在解决现有矢量化查询框架在处理实例内部点之间的固有空间相关性时存在的不足，导致几何不一致输出的问题（如碎片化的高精度地图元素或振荡轨迹）。论文的关键解决方案在于提出了一种名为InVDriver的新系统，该系统通过掩码自注意力层显式建模实例内部的空间依赖关系，从而提高规划准确性和轨迹平滑度。

链接: https://arxiv.org/abs/2502.17949
作者: Bo Zhang,Heye Huang,Chunyang Liu,Yaqin Zhang,Zhenhua Xu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to JICV (Journal of Intelligent and Connected Vehicles)

点击查看摘要

Abstract:End-to-end autonomous driving with its holistic optimization capabilities, has gained increasing traction in academia and industry. Vectorized representations, which preserve instance-level topological information while reducing computational overhead, have emerged as a promising paradigm. While existing vectorized query-based frameworks often overlook the inherent spatial correlations among intra-instance points, resulting in geometrically inconsistent outputs (e.g., fragmented HD map elements or oscillatory trajectories). To address these limitations, we propose InVDriver, a novel vectorized query-based system that systematically models intra-instance spatial dependencies through masked self-attention layers, thereby enhancing planning accuracy and trajectory smoothness. Across all core modules, i.e., perception, prediction, and planning, InVDriver incorporates masked self-attention mechanisms that restrict attention to intra-instance point interactions, enabling coordinated refinement of structural elements while suppressing irrelevant inter-instance noise. Experimental results on the nuScenes benchmark demonstrate that InVDriver achieves state-of-the-art performance, surpassing prior methods in both accuracy and safety, while maintaining high computational efficiency. Our work validates that explicit modeling of intra-instance geometric coherence is critical for advancing vectorized autonomous driving systems, bridging the gap between theoretical advantages of end-to-end frameworks and practical deployment requirements.
zh

[CV-39] Optimal Brain Apoptosis ICLR2025

【速读】：该论文旨在解决卷积神经网络（Convolutional Neural Networks, CNNs）和Transformer模型在计算效率和资源需求方面因参数数量增加而带来的挑战。关键解决方案是引入了一种名为Optimal Brain Apoptosis (OBA)的新剪枝方法，通过直接计算每个参数的Hessian-向量乘积值，改进了参数重要性估计的方法。这种方法利用Hessian矩阵分解，并识别跨层Hessian子矩阵非零条件，从而高效地计算参数的二阶Taylor展开，实现更精确的剪枝过程。

链接: https://arxiv.org/abs/2502.17941
作者: Mingyuan Sun,Zheng Fang,Jiaxu Wang,Junjie Jiang,Delei Kong,Chenming Hu,Yuetong Fang,Renjing Xu
机构: Northeastern University (东北大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:The increasing complexity and parameter count of Convolutional Neural Networks (CNNs) and Transformers pose challenges in terms of computational efficiency and resource demands. Pruning has been identified as an effective strategy to address these challenges by removing redundant elements such as neurons, channels, or connections, thereby enhancing computational efficiency without heavily compromising performance. This paper builds on the foundational work of Optimal Brain Damage (OBD) by advancing the methodology of parameter importance estimation using the Hessian matrix. Unlike previous approaches that rely on approximations, we introduce Optimal Brain Apoptosis (OBA), a novel pruning method that calculates the Hessian-vector product value directly for each parameter. By decomposing the Hessian matrix across network layers and identifying conditions under which inter-layer Hessian submatrices are non-zero, we propose a highly efficient technique for computing the second-order Taylor expansion of parameters. This approach allows for a more precise pruning process, particularly in the context of CNNs and Transformers, as validated in our experiments including VGG19, ResNet32, ResNet50, and ViT-B/16 on CIFAR10, CIFAR100 and Imagenet datasets. Our code is available at this https URL.
zh

[CV-40] Deep-JGAC: End-to-End Deep Joint Geometry and Attribute Compression for Dense Colored Point Clouds

【速读】：该论文旨在解决彩色点云压缩中的高数据量导致的有效压缩需求问题。解决方案的关键在于提出了一种端到端的深度联合几何与属性点云压缩（Deep Joint Geometry and Attribute point cloud Compression, Deep-JGAC）框架，通过利用几何信息和属性信息之间的相关性来实现高效的压缩。具体而言，Deep-JGAC框架不仅兼容学习型和非学习型的几何和属性编码器，还引入了属性辅助的深度几何编码器和属性信息融合模块来增强几何表示，并提出了优化的重新着色模块以解决几何压缩失真引起的属性不匹配问题，从而提升颜色重建质量并降低计算复杂度。

链接: https://arxiv.org/abs/2502.17939
作者: Yun Zhang,Zixi Guo,Linwei Zhu,C.-C. Jay Kuo
机构: School of Electronics and Communication Engineering, Shenzhen Campus, Sun Yat-Sen University (中山大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Department of Electrical and Computer Engineering, University of Southern California (南加州大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Colored point cloud becomes a fundamental representation in the realm of 3D vision. Effective Point Cloud Compression (PCC) is urgently needed due to huge amount of data. In this paper, we propose an end-to-end Deep Joint Geometry and Attribute point cloud Compression (Deep-JGAC) framework for dense colored point clouds, which exploits the correlation between the geometry and attribute for high compression efficiency. Firstly, we propose a flexible Deep-JGAC framework, where the geometry and attribute sub-encoders are compatible to either learning or non-learning based geometry and attribute encoders. Secondly, we propose an attribute-assisted deep geometry encoder that enhances the geometry latent representation with the help of attribute, where the geometry decoding remains unchanged. Moreover, Attribute Information Fusion Module (AIFM) is proposed to fuse attribute information in geometry coding. Thirdly, to solve the mismatch between the point cloud geometry and attribute caused by the geometry compression distortion, we present an optimized re-colorization module to attach the attribute to the geometrically distorted point cloud for attribute coding. It enhances the colorization and lowers the computational complexity. Extensive experimental results demonstrate that in terms of the geometry quality metric D1-PSNR, the proposed Deep-JGAC achieves an average of 82.96%, 36.46%, 41.72%, and 31.16% bit-rate reductions as compared to the state-of-the-art G-PCC, V-PCC, GRASP, and PCGCv2, respectively. In terms of perceptual joint quality metric MS-GraphSIM, the proposed Deep-JGAC achieves an average of 48.72%, 14.67%, and 57.14% bit-rate reductions compared to the G-PCC, V-PCC, and IT-DL-PCC, respectively. The encoding/decoding time costs are also reduced by 94.29%/24.70%, and 96.75%/91.02% on average as compared with the V-PCC and IT-DL-PCC.
zh

[CV-41] AirCast: Improving Air Pollution Forecasting Through Multi-Variable Data Alignment

【速读】：该论文旨在解决空气污染预测中的极端事件挑战，由于这些事件在历史数据中稀少且分布不均，导致传统方法难以准确预测。为了解决这一问题，论文提出了一种名为AirCast的新型多变量空气质量预报模型，关键在于其采用了多任务学习架构、频率加权平均绝对误差（Frequency-weighted Mean Absolute Error, fMAE）损失函数以及基于领域知识的关键变量选择方法，从而实现更精准的污染预测。

链接: https://arxiv.org/abs/2502.17919
作者: Vishal Nedungadi,Muhammad Akhtar Munir,Marc Rußwurm,Ron Sarafian,Ioannis N. Athanasiadis,Yinon Rudich,Fahad Shahbaz Khan,Salman Khan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Air pollution remains a leading global health risk, exacerbated by rapid industrialization and urbanization, contributing significantly to morbidity and mortality rates. In this paper, we introduce AirCast, a novel multi-variable air pollution forecasting model, by combining weather and air quality variables. AirCast employs a multi-task head architecture that simultaneously forecasts atmospheric conditions and pollutant concentrations, improving its understanding of how weather patterns affect air quality. Predicting extreme pollution events is challenging due to their rare occurrence in historic data, resulting in a heavy-tailed distribution of pollution levels. To address this, we propose a novel Frequency-weighted Mean Absolute Error (fMAE) loss, adapted from the class-balanced loss for regression tasks. Informed from domain knowledge, we investigate the selection of key variables known to influence pollution levels. Additionally, we align existing weather and chemical datasets across spatial and temporal dimensions. AirCast’s integrated approach, combining multi-task learning, frequency weighted loss and domain informed variable selection, enables more accurate pollution forecasts. Our source code and models are made public here (this https URL)
zh

[CV-42] BD Currency Detection: A CNN Based Approach with Mobile App Integration

【速读】：该论文旨在解决传统货币识别方法在准确性与效率方面的局限性问题。解决方案的关键在于采用深度学习技术，特别是利用卷积神经网络（Convolutional Neural Networks, CNNs）对孟加拉国钞票进行高精度分类。通过训练一个优化后的CNN模型，并将其转换为TensorFlow Lite格式以实现实时和离线功能，最终将该模型集成到Android移动应用程序中，从而提供了一种快速、安全且易于访问的货币识别解决方案。

链接: https://arxiv.org/abs/2502.17907
作者: Syed Jubayer Jaman,Md. Zahurul Haque,Md Robiul Islam,Usama Abdun Noor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Currency recognition plays a vital role in banking, commerce, and assistive technology for visually impaired individuals. Traditional methods, such as manual verification and optical scanning, often suffer from limitations in accuracy and efficiency. This study introduces an advanced currency recognition system utilizing Convolutional Neural Networks (CNNs) to accurately classify Bangladeshi banknotes. A dataset comprising 50,334 images was collected, preprocessed, and used to train a CNN model optimized for high performance classification. The trained model achieved an accuracy of 98.5%, surpassing conventional image based currency recognition approaches. To enable real time and offline functionality, the model was converted into TensorFlow Lite format and integrated into an Android mobile application. The results highlight the effectiveness of deep learning in currency recognition, providing a fast, secure, and accessible solution that enhances financial transactions and assistive technologies.
zh

[CV-43] FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real

【速读】：该论文旨在解决机器人在杂乱货架上进行物体抓取时的安全性和泛化能力问题。关键解决方案在于引入了一种名为FetchBot的仿真到现实（sim-to-real）框架，通过一个高效的体素基方法生成大规模多样化的仿真杂乱场景，并训练了一个考虑动态特性的强化学习策略来生成物体抓取轨迹。此外，采用全尺寸深度基础模型估计的深度信息作为视觉策略的输入，以缓解仿真与现实之间的差距。同时，设计了一种新颖的多视角表示学习架构，以实现对复杂场景的全面编码，从而确保机器人在不同位置和深度下有效减少碰撞，实现安全且鲁棒的操作。

链接: https://arxiv.org/abs/2502.17894
作者: Weiheng Liu,Yuxuan Wan,Jilong Wang,Yuxuan Kuang,Xuesong Shi,Haoran Li,Dongbin Zhao,Zhizheng Zhang,He Wang
机构: Institute of Automation, Chinese Academy of Sciences(自动化研究所，中国科学院); CFCS, School of Computer Science, Peking University(北京大学前沿计算科学中心，计算机科学学院); Galbot; Beijing Academy of Artificial Intelligence(北京智源人工智能研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object fetching from cluttered shelves is an important capability for robots to assist humans in real-world scenarios. Achieving this task demands robotic behaviors that prioritize safety by minimizing disturbances to surrounding objects, an essential but highly challenging requirement due to restricted motion space, limited fields of view, and complex object dynamics. In this paper, we introduce FetchBot, a sim-to-real framework designed to enable zero-shot generalizable and safety-aware object fetching from cluttered shelves in real-world settings. To address data scarcity, we propose an efficient voxel-based method for generating diverse simulated cluttered shelf scenes at scale and train a dynamics-aware reinforcement learning (RL) policy to generate object fetching trajectories within these scenes. This RL policy, which leverages oracle information, is subsequently distilled into a vision-based policy for real-world deployment. Considering that sim-to-real discrepancies stem from texture variations mostly while from geometric dimensions rarely, we propose to adopt depth information estimated by full-fledged depth foundation models as the input for the vision-based policy to mitigate sim-to-real gap. To tackle the challenge of limited views, we design a novel architecture for learning multi-view representations, allowing for comprehensive encoding of cluttered shelf scenes. This enables FetchBot to effectively minimize collisions while fetching objects from varying positions and depths, ensuring robust and safety-aware operation. Both simulation and real-robot experiments demonstrate FetchBot’s superior generalization ability, particularly in handling a broad range of real-world scenarios, includ
zh

[CV-44] From underwater to aerial: a novel multi-scale knowledge distillation approach for coral reef monitoring

【速读】：该论文旨在解决珊瑚礁生态系统的高精度测绘与监测问题。解决方案的关键在于提出了一种新颖的多尺度方法，结合水下和空中影像，并利用深度学习模型进行跨尺度信息传递。具体而言，通过使用自主水面航行器(Autonomous Surface Vehicle, ASV)获取水下图像，并利用无人机获取空中图像，首先训练一个基于变换器的深度学习模型来识别水下图像中的31个类别，包括不同类型的珊瑚形态、相关生物及栖息地。然后，这些预测结果被用于训练第二个应用于空中图像的模型。跨尺度信息传递是通过加权足迹法实现的，该方法考虑了水下图像足迹与空中图像块之间的部分重叠。这种方法成功地将细粒度分类扩展到更大的珊瑚礁区域，预测珊瑚形态类型及其相关栖息地的准确性较高，与地面真实数据的一致性良好，AUC得分为0.9251。

链接: https://arxiv.org/abs/2502.17883
作者: Matteo Contini,Victor Illien,Julien Barde,Sylvain Poulain,Serge Bernard,Alexis Joly,Sylvain Bonhommeau
机构: IFREMER Délégation Océan Indien (DOI)(海洋印度洋代表团); INRIA; LIRMM; Université de Montpellier (蒙彼利埃大学); CNRS; UMR Marbec, IRD; Ifremer; CNRS, LIRMM; Université de Montpellier, CNRS; Alexis Joly 所在单位与 A. Joly 相同; IFREMER Délégation Océan Indien (DOI) (海洋印度洋代表团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Drone-based remote sensing combined with AI-driven methodologies has shown great potential for accurate mapping and monitoring of coral reef ecosystems. This study presents a novel multi-scale approach to coral reef monitoring, integrating fine-scale underwater imagery with medium-scale aerial imagery. Underwater images are captured using an Autonomous Surface Vehicle (ASV), while aerial images are acquired with an aerial drone. A transformer-based deep-learning model is trained on underwater images to detect the presence of 31 classes covering various coral morphotypes, associated fauna, and habitats. These predictions serve as annotations for training a second model applied to aerial images. The transfer of information across scales is achieved through a weighted footprint method that accounts for partial overlaps between underwater image footprints and aerial image tiles. The results show that the multi-scale methodology successfully extends fine-scale classification to larger reef areas, achieving a high degree of accuracy in predicting coral morphotypes and associated habitats. The method showed a strong alignment between underwater-derived annotations and ground truth data, reflected by an AUC (Area Under the Curve) score of 0.9251. This shows that the integration of underwater and aerial imagery, supported by deep-learning models, can facilitate scalable and accurate reef assessments. This study demonstrates the potential of combining multi-scale imaging and AI to facilitate the monitoring and conservation of coral reefs. Our approach leverages the strengths of underwater and aerial imagery, ensuring the precision of fine-scale analysis while extending it to cover a broader reef area.
zh

[CV-45] VVRec: Reconstruction Attacks on DL-based Volumetric Video Upstreaming via Latent Diffusion Model with Gamma Distribution

【速读】：该论文旨在解决深度学习在三维体素视频压缩应用中的隐私威胁问题，特别是针对重建攻击的威胁。重建攻击的目标是从截获的中间传输结果中恢复原始输入点云。论文的关键解决方案是设计了一种名为VVRec的攻击方案，它利用四个精心训练的神经网络模块以及最新的潜在扩散模型和伽玛分布及精炼算法，实现了高质量的点云重建，显著降低了失真，并在重建精度和色彩恢复方面超越了现有防御措施。

链接: https://arxiv.org/abs/2502.17880
作者: Rui Lu,Bihai Zhang,Dan Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:With the popularity of 3D volumetric video applications, such as Autonomous Driving, Virtual Reality, and Mixed Reality, current developers have turned to deep learning for compressing volumetric video frames, i.e., point clouds for video upstreaming. The latest deep learning-based solutions offer higher efficiency, lower distortion, and better hardware support compared to traditional ones like MPEG and JPEG. However, privacy threats arise, especially reconstruction attacks targeting to recover the original input point cloud from the intermediate results. In this paper, we design VVRec, to the best of our knowledge, which is the first targeting DL-based Volumetric Video Reconstruction attack scheme. VVRec demonstrates the ability to reconstruct high-quality point clouds from intercepted transmission intermediate results using four well-trained neural network modules we design. Leveraging the latest latent diffusion models with Gamma distribution and a refinement algorithm, VVRec excels in reconstruction quality, color recovery, and surpasses existing defenses. We evaluate VVRec using three volumetric video datasets. The results demonstrate that VVRec achieves 64.70dB reconstruction accuracy, with an impressive 46.39% reduction of distortion over baselines.
zh

[CV-46] Dual Classification Head Self-training Network for Cross-scene Hyperspectral Image Classification

【速读】：该论文旨在解决跨场景高光谱图像（HSI）分类中由于源域（SD）与目标域（TD）之间的反射光谱差异及相同地物类别特征分布不同所导致的性能挑战。关键解决方案是提出了一种双分类头自训练网络（DHSNet），该方法通过引入首次应用于跨场景HSI分类领域的双分类头自训练策略，实现类间特征的跨域对齐，并采用新颖的中心特征注意力机制增强模型学习跨域不变特征的能力。此方法有效减小了领域差距，同时避免了错误伪标签在模型中的累积。

链接: https://arxiv.org/abs/2502.17879
作者: Rong Liu,Junye Liang,Jiaqi Yang,Jiang He,Peng Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the difficulty of obtaining labeled data for hyperspectral images (HSIs), cross-scene classification has emerged as a widely adopted approach in the remote sensing community. It involves training a model using labeled data from a source domain (SD) and unlabeled data from a target domain (TD), followed by inferencing on the TD. However, variations in the reflectance spectrum of the same object between the SD and the TD, as well as differences in the feature distribution of the same land cover class, pose significant challenges to the performance of cross-scene classification. To address this issue, we propose a dual classification head self-training network (DHSNet). This method aligns class-wise features across domains, ensuring that the trained classifier can accurately classify TD data of different classes. We introduce a dual classification head self-training strategy for the first time in the cross-scene HSI classification field. The proposed approach mitigates domain gap while preventing the accumulation of incorrect pseudo-labels in the model. Additionally, we incorporate a novel central feature attention mechanism to enhance the model’s capacity to learn scene-invariant features across domains. Experimental results on three cross-scene HSI datasets demonstrate that the proposed DHSNET significantly outperforms other state-of-the-art approaches. The code for DHSNet will be available at this https URL.
zh

[CV-47] ASurvey: Spatiotemporal Consistency in Video Generation

【速读】：该论文旨在解决视频生成中的时空一致性问题，这一挑战在高质量视频生成的核心机制理解中存在知识缺口。论文的关键解决方案在于系统性地回顾了视频生成的近期进展，并聚焦于基础模型、信息表示方式、生成方案、后处理技术及评估指标五个方面，特别强调这些方面如何促进时空一致性的保持。

链接: https://arxiv.org/abs/2502.17863
作者: Zhiyu Yin,Kehai Chen,Xuefeng Bai,Ruili Jiang,Juntao Li,Hongdong Li,Jin Liu,Yang Xiang,Jun Yu,Min Zhang
机构: School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院); School of Computer Science and Technology, Central South University(中南大学计算机科学与技术学院); Peng Cheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video generation, by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC). Video generation presents unique challenges beyond static image generation, requiring both high-quality individual frames and temporal coherence to maintain consistency across the spatiotemporal sequence. Recent works have aimed at addressing the spatiotemporal consistency issue in video generation, while few literature review has been organized from this perspective. This gap hinders a deeper understanding of the underlying mechanisms for high-quality video generation. In this survey, we systematically review the recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics. We particularly focus on their contributions to maintaining spatiotemporal consistency. Finally, we discuss the future directions and challenges in this field, hoping to inspire further efforts to advance the development of video generation.
zh

[CV-48] HRR: Hierarchical Retrospection Refinement for Generated Image Detection

【速读】：该论文旨在解决生成式人工智能图像检测中的关键问题：区分图像是否由生成模型产生。论文提出的关键解决方案是Hierarchical Retrospection Refinement (HRR)框架，它通过引入多尺度风格回顾模块和基于相关熵稀疏加性机器原理的特征精炼模块，来缓解数据集风格和生成模型引入的学习偏差，减少冗余特征的影响，并捕捉数据的内在结构和模式，从而提高生成图像检测任务的性能。

链接: https://arxiv.org/abs/2502.17862
作者: Peipei Yuan,Zijing Xie,Shuo Ye,Hong Chen,Yulong Wang
机构: School of Artificial Intelligence, Jianghan University (江汉大学人工智能学院), Wuhan, China; School of Electronic Information and Communications, Huazhong University of Science and Technology (华中科技大学电子与信息工程学院), Wuhan, China; College of Informatics, Huazhong Agricultural University (华中农业大学信息学院), Wuhan 430070, China, and also with the Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education (教育部智能技术工程研究中心), Wuhan 430070, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative artificial intelligence holds significant potential for abuse, and generative image detection has become a key focus of research. However, existing methods primarily focused on detecting a specific generative model and emphasizing the localization of synthetic regions, while neglecting the interference caused by image size and style on model learning. Our goal is to reach a fundamental conclusion: Is the image real or generated? To this end, we propose a diffusion model-based generative image detection framework termed Hierarchical Retrospection Refinement~(HRR). It designs a multi-scale style retrospection module that encourages the model to generate detailed and realistic multi-scale representations, while alleviating the learning biases introduced by dataset styles and generative models. Additionally, based on the principle of correntropy sparse additive machine, a feature refinement module is designed to reduce the impact of redundant features on learning and capture the intrinsic structure and patterns of the data, thereby improving the model’s generalization ability. Extensive experiments demonstrate the HRR framework consistently delivers significant performance improvements, outperforming state-of-the-art methods in generated image detection task.
zh

[CV-49] UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting ICLR2025

【速读】：该论文旨在解决现有多模态三维预训练方法在学习文本、图像和点云联合表示时存在的问题，特别是点云表示未能充分捕捉三维世界的复杂性，并且与图像的密集二维像素之间存在明显差距。为了解决这一问题，论文提出了一种名为UniGS的方法，通过将三维高斯点 splatting (3D Gaussian Splatting, 3DGS) 整合到多模态预训练中来增强三维表示。关键在于首先利用3DGS表示将三维世界建模为一组带有色彩和透明度的三维高斯分布，从而与二维图像建立强连接；随后，通过引入一种新颖的高斯感知引导模块（Gaussian-Aware Guidance module），指导三维领域的细粒度表征学习，以更好地实现跨模态对齐。

链接: https://arxiv.org/abs/2502.17860
作者: Haoyuan Li,Yanpeng Zhou,Tao Tang,Jifei Song,Yihan Zeng,Michael Kampffmeyer,Hang Xu,Xiaodan Liang
机构: Shenzhen campus of Sun Yat-sen University (中山大学深圳校区); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); UiT The Arctic University of Norway ( UIT 挪威北极大学); Peng Cheng Laboratory (鹏城实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:Recent advancements in multi-modal 3D pre-training methods have shown promising efficacy in learning joint representations of text, images, and point clouds. However, adopting point clouds as 3D representation fails to fully capture the intricacies of the 3D world and exhibits a noticeable gap between the discrete points and the dense 2D pixels of images. To tackle this issue, we propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation. We first rely on the 3DGS representation to model the 3D world as a collection of 3D Gaussians with color and opacity, incorporating all the information of the 3D scene while establishing a strong connection with 2D images. Then, to achieve Language-Image-3D pertaining, UniGS starts with a pre-trained vision-language model to establish a shared visual and textual space through extensive real-world image-text pairs. Subsequently, UniGS employs a 3D encoder to align the optimized 3DGS with the Language-Image representations to learn unified multi-modal representations. To facilitate the extraction of global explicit 3D features by the 3D encoder and achieve better cross-modal alignment, we additionally introduce a novel Gaussian-Aware Guidance module that guides the learning of fine-grained representations of the 3D domain. Through extensive experiments across the Objaverse, ABO, MVImgNet and SUN RGBD datasets with zero-shot classification, text-driven retrieval and open-world understanding tasks, we demonstrate the effectiveness of UniGS in learning a more general and stronger aligned multi-modal representation. Specifically, UniGS achieves leading results across different 3D tasks with remarkable improvements over previous SOTA, Uni3D, including on zero-shot classification (+9.36%), text-driven retrieval (+4.3%) and open-world understanding (+7.92%).
zh

[CV-50] Sketch-1-to-3: One Single Sketch to 3D Detailed Face Reconstruction

【速读】：该论文旨在解决从单个素描图像进行现实感强的三维人脸重建这一任务中的关键挑战，包括准确提取面部关键点、保留丰富的表情细节和精细纹理以及在有限数据条件下训练高性能模型。论文提出的关键解决方案是Sketch-1-to-3框架，其中包含几何轮廓与纹理细节模块（Geometric Contour and Texture Detail, GCTD），用于增强从面部素描中提取几何轮廓和纹理细节的能力。此外，设计了一种深度学习架构，包含领域适应模块和定制损失函数，以将素描图像与三维人脸空间对齐，从而实现高保真表情和纹理重建。

链接: https://arxiv.org/abs/2502.17852
作者: Liting Wen,Zimo Yang,Xianlin Zhang,Chi Ding,Yue Zhang,Mingdao Wang,Xueming Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D face reconstruction from a single sketch is a critical yet underexplored task with significant practical applications. The primary challenges stem from the substantial modality gap between 2D sketches and 3D facial structures, including: (1) accurately extracting facial keypoints from 2D sketches; (2) preserving diverse facial expressions and fine-grained texture details; and (3) training a high-performing model with limited data. In this paper, we propose Sketch-1-to-3, a novel framework for realistic 3D face reconstruction from a single sketch, to address these challenges. Specifically, we first introduce the Geometric Contour and Texture Detail (GCTD) module, which enhances the extraction of geometric contours and texture details from facial sketches. Additionally, we design a deep learning architecture with a domain adaptation module and a tailored loss function to align sketches with the 3D facial space, enabling high-fidelity expression and texture reconstruction. To facilitate evaluation and further research, we construct SketchFaces, a real hand-drawn facial sketch dataset, and Syn-SketchFaces, a synthetic facial sketch dataset. Extensive experiments demonstrate that Sketch-1-to-3 achieves state-of-the-art performance in sketch-based 3D face reconstruction.
zh

[CV-51] A Novel Retinial Image Contrast Enhancement – Fuzzy-Based Method DATE

【速读】：该论文旨在解决视网膜图像中血管结构分割精度受图像质量影响的问题。论文的关键在于提出了一种新颖的模型，该模型利用模糊对比度增强（Fuzzy Contrast Enhancement, FCE）与对比度受限自适应直方图均衡化（Contrast Limited Adaptive Histogram Equalization, CLAHE）的线性融合方法来提升视网膜图像的质量，从而提高血管结构的分割精度。

链接: https://arxiv.org/abs/2502.17850
作者: Adnan Shaout,Jiho Han
机构: University of Michigan-Dearborn(密歇根大学迪尔伯恩分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This UPDATED version of the paper, accepted at the 2023 24th International Arab Conference on Information Technology (ACIT), includes corrections for typographical and grammatical errors, an joint authorship section with a detailed CRediT author statement, improvements in graphics and figure references, and refinements in citations

点击查看摘要

Abstract:The vascular structure in retinal images plays a crucial role in ophthalmic diagnostics, and its accuracies are directly influenced by the quality of the retinal image. Contrast enhancement is one of the crucial steps in any segmentation algorithm - the more so since the retinal images are related to medical diagnosis. Contrast enhancement is a vital step that not only intensifies the darkness of the blood vessels but also prevents minor capillaries from being disregarded during the process. This paper proposes a novel model that utilizes the linear blending of Fuzzy Contrast Enhancement (FCE) and Contrast Limited Adaptive Histogram Equalization (CLAHE) to enhance the retinal image for retinal vascular structure segmentation. The scheme is tested using the Digital Retinal Images for Vessel Extraction (DRIVE) dataset. The assertion was then evaluated through performance comparison among other methodologies which are Gray-scaling, Histogram Equalization (HE), FCE, and CLAHE. It was evident in this paper that the combination of FCE and CLAHE methods showed major improvement. Both FCE and CLAHE methods dominating with 88% as better enhancement methods proved that preprocessing through fuzzy logic is effective.
zh

[CV-52] Automatic Vehicle Detection using DETR: A Transformer-Based Approach for Navigating Treacherous Roads

【速读】：该论文旨在解决自动车辆检测（Automatic Vehicle Detection, AVD）在复杂多变驾驶环境中的挑战，特别是由于光照条件、道路类型和车辆类型的多样性导致的传统方法如YOLO和Faster R-CNN难以应对的问题。论文的关键解决方案是引入了Detection Transformer (DETR)，并通过一种名为Co-DETR的协作混合分配训练方案来增强特征学习和注意力机制，同时利用多样化的标签分配策略和引入多个并行辅助头以提供更有效的监督，从而提升训练效率和检测精度。通过在BadODD数据集上的广泛实验验证，证明了该方法在不同条件下的优越性能，使其适用于实际部署。

链接: https://arxiv.org/abs/2502.17843
作者: Istiaq Ahmed Fahad,Abdullah Ibne Hanif Arean,Nazmus Sakib Ahmed,Mahmudul Hasan
机构: Institute of Information Technology (信息技术学院); University of Dhaka (达卡大学); Dept. of Computer Science and Engineering (计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic Vehicle Detection (AVD) in diverse driving environments presents unique challenges due to varying lighting conditions, road types, and vehicle types. Traditional methods, such as YOLO and Faster R-CNN, often struggle to cope with these complexities. As computer vision evolves, combining Convolutional Neural Networks (CNNs) with Transformer-based approaches offers promising opportunities for improving detection accuracy and efficiency. This study is the first to experiment with Detection Transformer (DETR) for automatic vehicle detection in complex and varied settings. We employ a Collaborative Hybrid Assignments Training scheme, Co-DETR, to enhance feature learning and attention mechanisms in DETR. By leveraging versatile label assignment strategies and introducing multiple parallel auxiliary heads, we provide more effective supervision during training and extract positive coordinates to boost training efficiency. Through extensive experiments on DETR variants and YOLO models, conducted using the BadODD dataset, we demonstrate the advantages of our approach. Our method achieves superior results, and improved accuracy in diverse conditions, making it practical for real-world deployment. This work significantly advances autonomous navigation technology and opens new research avenues in object detection for autonomous vehicles. By integrating the strengths of CNNs and Transformers, we highlight the potential of DETR for robust and efficient vehicle detection in challenging driving environments.
zh

[CV-53] MM-PoisonRAG : Disrupting Multimodal RAG with Local and Global Poisoning Attacks

【速读】：该论文旨在揭示多模态大型语言模型（Multimodal Large Language Models, MLLMs）结合检索增强生成（Retrieval-Augmented Generation, RAG）技术在处理问答任务时所面临的安全风险，特别是知识投毒攻击（knowledge poisoning attacks）。这种攻击通过向外部知识库中注入错误或无关信息，以操纵模型输出结果。论文的关键解决方案是提出了一种名为MM-PoisonRAG的新框架，包含局部投毒攻击（Localized Poisoning Attack, LPA）和全局投毒攻击（Globalized Poisoning Attack, GPA）两种策略。LPA针对特定查询注入文本和图像中的误导信息，而GPA则在整个生成过程中提供虚假指导，导致模型输出无意义的回答。研究结果表明，这两种攻击策略均能有效破坏模型性能，强调了开发鲁棒防御机制以抵御知识投毒攻击的紧迫性。

链接: https://arxiv.org/abs/2502.17832
作者: Hyeonjeong Ha,Qiusi Zhan,Jeonghwan Kim,Dimitrios Bralios,Saikrishna Sanniboina,Nanyun Peng,Kai-wei Chang,Daniel Kang,Heng Ji
机构: University of Illinois Urbana-Champaign(伊利诺伊大学香槟分校); University of California Los Angeles(加州大学洛杉矶分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) equipped with Retrieval Augmented Generation (RAG) leverage both their rich parametric knowledge and the dynamic, external knowledge to excel in tasks such as Question Answering. While RAG enhances MLLMs by grounding responses in query-relevant external knowledge, this reliance poses a critical yet underexplored safety risk: knowledge poisoning attacks, where misinformation or irrelevant knowledge is intentionally injected into external knowledge bases to manipulate model outputs to be incorrect and even harmful. To expose such vulnerabilities in multimodal RAG, we propose MM-PoisonRAG, a novel knowledge poisoning attack framework with two attack strategies: Localized Poisoning Attack (LPA), which injects query-specific misinformation in both text and images for targeted manipulation, and Globalized Poisoning Attack (GPA) to provide false guidance during MLLM generation to elicit nonsensical responses across all queries. We evaluate our attacks across multiple tasks, models, and access settings, demonstrating that LPA successfully manipulates the MLLM to generate attacker-controlled answers, with a success rate of up to 56% on MultiModalQA. Moreover, GPA completely disrupts model generation to 0% accuracy with just a single irrelevant knowledge injection. Our results highlight the urgent need for robust defenses against knowledge poisoning to safeguard multimodal RAG frameworks.
zh

[CV-54] Weakly Supervised Pixel-Level Annotation with Visual Interpretability

【速读】：该论文旨在解决医学图像标注耗时、成本高且专家间存在差异的问题。解决方案的关键在于提出了一种集成学习、视觉可解释性和不确定性量化相结合的自动化可解释标注系统。具体而言，该系统结合了三个预训练深度学习模型（ResNet50、EfficientNet和DenseNet），并通过XGrad-CAM进行视觉解释，Monte Carlo Dropout进行不确定性量化。通过这种方法，模型能够模拟多位放射科医生的共识，并将不确定的预测标记出来供人工审查，从而提高了标注的准确性和可解释性。

链接: https://arxiv.org/abs/2502.17824
作者: Basma Nasir,Tehseen Zia,Muhammad Nawaz,Catarina Moreira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical image annotation is essential for diagnosing diseases, yet manual annotation is time-consuming, costly, and prone to variability among experts. To address these challenges, we propose an automated explainable annotation system that integrates ensemble learning, visual explainability, and uncertainty quantification. Our approach combines three pre-trained deep learning models - ResNet50, EfficientNet, and DenseNet - enhanced with XGrad-CAM for visual explanations and Monte Carlo Dropout for uncertainty quantification. This ensemble mimics the consensus of multiple radiologists by intersecting saliency maps from models that agree on the diagnosis while uncertain predictions are flagged for human review. We evaluated our system using the TBX11K medical imaging dataset and a Fire segmentation dataset, demonstrating its robustness across different domains. Experimental results show that our method outperforms baseline models, achieving 93.04% accuracy on TBX11K and 96.4% accuracy on the Fire dataset. Moreover, our model produces precise pixel-level annotations despite being trained with only image-level labels, achieving Intersection over Union IoU scores of 36.07% and 64.7%, respectively. By enhancing the accuracy and interpretability of image annotations, our approach offers a reliable and transparent solution for medical diagnostics and other image analysis tasks.
zh

[CV-55] Easy-Poly: A Easy Polyhedral Framework For 3D Multi-Object Tracking

【速读】：该论文旨在解决3D多目标跟踪（3D MOT）中的高误检（FP）、漏检（FN）和身份切换（IDS）问题，特别是在拥挤场景、小物体配置和恶劣天气条件下。为了解决这些问题，论文提出了一种名为Easy-Poly的实时滤波器基础3D MOT框架。其关键解决方案包括：一个利用多模态数据增强和改进的SpConv操作的增强提案生成器（Augmented Proposal Generator），显著提升了nuScenes上的平均精度均值（mAP）和归一化检测评分（NDS）；一个动态面向轨迹的数据关联算法（Dynamic Track-Oriented (DTO) data association algorithm），通过最优分配和多假设处理有效管理不确定性和遮挡；一个结合置信加权卡尔曼滤波器和自适应噪声协方差的动态运动建模（Dynamic Motion Modeling, DMM），在挑战性条件下提升多目标跟踪精度（MOTA）和自适应多目标跟踪精度（AMOTA）；以及一个具有调节阈值的扩展生命周期管理系统，以减少身份切换和错误终止。这些贡献共同提高了整体跟踪的鲁棒性。

链接: https://arxiv.org/abs/2502.17822
作者: Peng Zhang,Xin Li,Xin Lin,Liang He
机构: East China Normal University(华东师范大学), Shanghai, China; Shanghai Jiao Tong University(上海交通大学), Shanghai, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Recent advancements in 3D multi-object tracking (3D MOT) have predominantly relied on tracking-by-detection pipelines. However, these approaches often neglect potential enhancements in 3D detection processes, leading to high false positives (FP), missed detections (FN), and identity switches (IDS), particularly in challenging scenarios such as crowded scenes, small-object configurations, and adverse weather conditions. Furthermore, limitations in data preprocessing, association mechanisms, motion modeling, and life-cycle management hinder overall tracking robustness. To address these issues, we present Easy-Poly, a real-time, filter-based 3D MOT framework for multiple object categories. Our contributions include: (1) An Augmented Proposal Generator utilizing multi-modal data augmentation and refined SpConv operations, significantly improving mAP and NDS on nuScenes; (2) A Dynamic Track-Oriented (DTO) data association algorithm that effectively manages uncertainties and occlusions through optimal assignment and multiple hypothesis handling; (3) A Dynamic Motion Modeling (DMM) incorporating a confidence-weighted Kalman filter and adaptive noise covariances, enhancing MOTA and AMOTA in challenging conditions; and (4) An extended life-cycle management system with adjustive thresholds to reduce ID switches and false terminations. Experimental results show that Easy-Poly outperforms state-of-the-art methods such as Poly-MOT and Fast-Poly, achieving notable gains in mAP (e.g., from 63.30% to 64.96% with LargeKernel3D) and AMOTA (e.g., from 73.1% to 74.5%), while also running in real-time. These findings highlight Easy-Poly’s adaptability and robustness in diverse scenarios, making it a compelling choice for autonomous driving and related 3D MOT applications. The source code of this paper will be published upon acceptance.
zh

[CV-56] LAM: Large Avatar Model for One-shot Animatable Gaussian Head

【速读】：该论文旨在解决从单张图像重建可动画化高斯头像的问题。解决方案的关键在于提出了一种名为LAM（Large Avatar Model）的方法，通过一个前向传递即可生成可直接动画和渲染的高斯头像，无需额外的网络或后处理步骤。这种方法的核心组件是一个规范高斯属性生成器，利用FLAME规范点作为查询，通过Transformer与多尺度图像特征交互，从而在规范空间中准确预测高斯属性。这一技术使得重建后的规范高斯头像能够使用标准线性混合蒙皮（Linear Blend Skinning, LBS）方法进行动画化，并实现实时渲染。

链接: https://arxiv.org/abs/2502.17796
作者: Yisheng He,Xiaodong Gu,Xiaodan Ye,Chao Xu,Zhengyi Zhao,Yuan Dong,Weihao Yuan,Zilong Dong,Liefeng Bo
机构: Tongyi Lab, Alibaba Group (通义实验室, 阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present LAM, an innovative Large Avatar Model for animatable Gaussian head reconstruction from a single image. Unlike previous methods that require extensive training on captured video sequences or rely on auxiliary neural networks for animation and rendering during inference, our approach generates Gaussian heads that are immediately animatable and renderable. Specifically, LAM creates an animatable Gaussian head in a single forward pass, enabling reenactment and rendering without additional networks or post-processing steps. This capability allows for seamless integration into existing rendering pipelines, ensuring real-time animation and rendering across a wide range of platforms, including mobile phones. The centerpiece of our framework is the canonical Gaussian attributes generator, which utilizes FLAME canonical points as queries. These points interact with multi-scale image features through a Transformer to accurately predict Gaussian attributes in the canonical space. The reconstructed canonical Gaussian avatar can then be animated utilizing standard linear blend skinning (LBS) with corrective blendshapes as the FLAME model did and rendered in real-time on various platforms. Our experimental results demonstrate that LAM outperforms state-of-the-art methods on existing benchmarks.
zh

[CV-57] Synthia: Novel Concept Design with Affordance Composition

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）模型在生成功能上连贯的设计方面的问题。当前研究主要集中在生成给定设计概念的语义和风格变化，而忽视了功能连贯性——将多种功能整合到单一连贯设计中的能力。为了解决这一问题，论文提出了一种名为SYNTHIA的框架，其关键是利用一个分层的概念本体（hierarchical concept ontology），将概念分解为部分和功能潜能（affordances），作为实现功能连贯设计的关键构建模块。此外，通过基于该本体的课程学习方案（curriculum learning scheme），对比微调T2I模型以逐步学习功能潜能组合，并保持视觉新颖性。

链接: https://arxiv.org/abs/2502.17793
作者: Xiaomeng Jin,Hyeonjeong Ha,Jeonghwan Kim,Jiateng Liu,Zhenhailong Wang,Khanh Duy Nguyen,Ansel Blume,Nanyun Peng,Kai-wei Chang,Heng Ji
机构: University of Illinois Urbana-Champaign(伊利诺伊大学香槟分校); University of California Los Angeles(加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code is available this https URL

点击查看摘要

Abstract:Text-to-image (T2I) models enable rapid concept design, making them widely used in AI-driven design. While recent studies focus on generating semantic and stylistic variations of given design concepts, functional coherence–the integration of multiple affordances into a single coherent concept–remains largely overlooked. In this paper, we introduce SYNTHIA, a framework for generating novel, functionally coherent designs based on desired affordances. Our approach leverages a hierarchical concept ontology that decomposes concepts into parts and affordances, serving as a crucial building block for functionally coherent design. We also develop a curriculum learning scheme based on our ontology that contrastively fine-tunes T2I models to progressively learn affordance composition while maintaining visual novelty. To elaborate, we (i) gradually increase affordance distance, guiding models from basic concept-affordance association to complex affordance compositions that integrate parts of distinct affordances into a single, coherent form, and (ii) enforce visual novelty by employing contrastive objectives to push learned representations away from existing concepts. Experimental results show that SYNTHIA outperforms state-of-the-art T2I models, demonstrating absolute gains of 25.1% and 14.7% for novelty and functional coherence in human evaluation, respectively.
zh

[CV-58] Sample Selection via Contrastive Frag mentation for Noisy Label Regression NEURIPS2024

【速读】：该论文旨在解决现实世界回归任务中普遍存在的标签噪声问题。解决方案的关键在于提出了一种名为ConFrag的新方法，通过将回归数据转换成不相交但对比的片段对，集体建模数据。这种方法能够训练出更具区分性的表征，并增强选择干净样本的能力。ConFrag框架利用相邻片段的混合，通过专家特征提取器之间的邻域一致性来识别噪声标签。

链接: https://arxiv.org/abs/2502.17771
作者: Chris Dongjoo Kim,Sangwoo Moon,Jihwan Moon,Dongyeon Woo,Gunhee Kim
机构: Seoul National University(首尔国立大学); LG AI Research(LG AI 研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2024

点击查看摘要

Abstract:As with many other problems, real-world regression is plagued by the presence of noisy labels, an inevitable issue that demands our attention. Fortunately, much real-world data often exhibits an intrinsic property of continuously ordered correlations between labels and features, where data points with similar labels are also represented with closely related features. In response, we propose a novel approach named ConFrag, where we collectively model the regression data by transforming them into disjoint yet contrasting fragmentation pairs. This enables the training of more distinctive representations, enhancing the ability to select clean samples. Our ConFrag framework leverages a mixture of neighboring fragments to discern noisy labels through neighborhood agreement among expert feature extractors. We extensively perform experiments on six newly curated benchmark datasets of diverse domains, including age prediction, price prediction, and music production year estimation. We also introduce a metric called Error Residual Ratio (ERR) to better account for varying degrees of label noise. Our approach consistently outperforms fourteen state-of-the-art baselines, being robust against symmetric and random Gaussian label noise.
zh

[CV-59] Improving Transformer Based Line Segment Detection with Matched Predicting and Re-ranking

【速读】：该论文旨在解决经典Transformer基线段检测方法在预测过程中高质量检测结果被低置信度评分压制的问题，并且这些模型通常需要较长的训练周期。关键解决方案包括引入RANK-LETR方法，利用可学习的几何信息增强高质量预测的置信度评分，并提出一种新的线段提议方法，通过特征点直接预测线段位置以提高训练效率与稳定性。此外，引入线段排序损失以稳定训练过程中的排名，从而提升模型的泛化能力。

链接: https://arxiv.org/abs/2502.17766
作者: Xin Tong,Shi Peng,Baojie Tian,Yufei Guo,Xuhui Huang,Zhe Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classical Transformer-based line segment detection methods have delivered impressive results. However, we observe that some accurately detected line segments are assigned low confidence scores during prediction, causing them to be ranked lower and potentially suppressed. Additionally, these models often require prolonged training periods to achieve strong performance, largely due to the necessity of bipartite matching. In this paper, we introduce RANK-LETR, a novel Transformer-based line segment detection method. Our approach leverages learnable geometric information to refine the ranking of predicted line segments by enhancing the confidence scores of high-quality predictions in a posterior verification step. We also propose a new line segment proposal method, wherein the feature point nearest to the centroid of the line segment directly predicts the location, significantly improving training efficiency and stability. Moreover, we introduce a line segment ranking loss to stabilize rankings during training, thereby enhancing the generalization capability of the model. Experimental results demonstrate that our method outperforms other Transformer-based and CNN-based approaches in prediction accuracy while requiring fewer training epochs than previous Transformer-based models.
zh

[CV-60] A digital eye-fixation biomarker using a deep anomaly scheme to classify Parkisonian patterns

【速读】：该论文旨在解决利用视频分析量化帕金森病（Parkinson’s disease, PD）患者眼动模式的问题。当前方法主要依赖于全局和简化的轨迹分析，而本文提出了一种新颖的基于异常检测框架的视频分析方案。关键在于采用单类学习（one-class learning）的方法，避免了大量标注数据的需求，仅关注帕金森病的表现形式，并将其他类别样本视为分布的异常值。这种方法在眼球固定任务中对13名对照组受试者和13名不同疾病阶段的患者进行了评估，取得了平均敏感度为0.97，特异度为0.63，AUC-ROC为0.95的良好效果。

链接: https://arxiv.org/abs/2502.17762
作者: Juan Niño,Luis Guayacán,Santiago Gómez,Fabio Martínez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
备注: 6 pages, 4 images

点击查看摘要

Abstract:Oculomotor alterations constitute a promising biomarker to detect and characterize Parkinson’s disease (PD), even in prodromal stages. Currently, only global and simplified eye movement trajectories are employed to approximate the complex and hidden kinematic relationships of the oculomotor function. Recent advances on machine learning and video analysis have encouraged novel characterizations of eye movement patterns to quantify PD. These schemes enable the identification of spatiotemporal segments primarily associated with PD. However, they rely on discriminative models that require large training datasets and depend on balanced class distributions. This work introduces a novel video analysis scheme to quantify Parkinsonian eye fixation patterns with an anomaly detection framework. Contrary to classical deep discriminative schemes that learn differences among labeled classes, the proposed approach is focused on one-class learning, avoiding the necessity of a significant amount of data. The proposed approach focuses only on Parkinson’s representation, considering any other class sample as an anomaly of the distribution. This approach was evaluated for an ocular fixation task, in a total of 13 control subjects and 13 patients on different stages of the disease. The proposed digital biomarker achieved an average sensitivity and specificity of 0.97 and 0.63, respectively, yielding an AUC-ROC of 0.95. A statistical test shows significant differences (p 0.05) among predicted classes, evidencing a discrimination between patients and control subjects.
zh

[CV-61] AI-driven 3D Spatial Transcriptomics

【速读】：该论文旨在解决三维空间转录组学（3D Spatial Transcriptomics, 3D ST）在组织体积解析上的局限性。当前多数空间转录组学方法仅限于二维（2D）切片分析，而现有的三维方法通常需要复杂的组织切片过程，并且与非破坏性三维成像技术不兼容，缺乏可扩展性。论文提出的关键解决方案是生成一种名为VOrumetrically Resolved Transcriptomics EXpression (VORTEX) 的人工智能框架。VORTEX利用三维组织形态和少量二维空间转录组学数据来预测三维空间转录组学，通过预训练和微调学习通用及特定样本的基因表达与组织形态的相关性，从而实现密集、高通量和快速的三维空间转录组学分析，并能够无缝扩展到大体积组织。这一方法提供了成本效益高且破坏性小的途径，以获得体积分子洞见，预计会加速生物标志物发现以及复杂组织中形态分子关联和细胞状态的理解。

链接: https://arxiv.org/abs/2502.17761
作者: Cristina Almagro-Pérez,Andrew H. Song,Luca Weishaupt,Ahrong Kim,Guillaume Jaume,Drew F.K. Williamson,Konstantin Hemker,Ming Y. Lu,Kritika Singh,Bowen Chen,Long Phi Le,Alexander S. Baras,Sizun Jiang,Ali Bashashati,Jonathan T.C. Liu,Faisal Mahmood
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:A comprehensive three-dimensional (3D) map of tissue architecture and gene expression is crucial for illuminating the complexity and heterogeneity of tissues across diverse biomedical applications. However, most spatial transcriptomics (ST) approaches remain limited to two-dimensional (2D) sections of tissue. Although current 3D ST methods hold promise, they typically require extensive tissue sectioning, are complex, are not compatible with non-destructive 3D tissue imaging technologies, and often lack scalability. Here, we present VOlumetrically Resolved Transcriptomics EXpression (VORTEX), an AI framework that leverages 3D tissue morphology and minimal 2D ST to predict volumetric 3D ST. By pretraining on diverse 3D morphology-transcriptomic pairs from heterogeneous tissue samples and then fine-tuning on minimal 2D ST data from a specific volume of interest, VORTEX learns both generic tissue-related and sample-specific morphological correlates of gene expression. This approach enables dense, high-throughput, and fast 3D ST, scaling seamlessly to large tissue volumes far beyond the reach of existing 3D ST techniques. By offering a cost-effective and minimally destructive route to obtaining volumetric molecular insights, we anticipate that VORTEX will accelerate biomarker discovery and our understanding of morphomolecular associations and cell states in complex tissues. Interactive 3D ST volumes can be viewed at this https URL
zh

[CV-62] ask Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos

【速读】：该论文旨在通过引入一种基于梯度的方法，从程序性活动中学习任务图，从而改进手工设计的方法。解决方案的关键在于通过最大似然直接优化边权重，并将其整合到神经架构中。这种方法在CaptainCook4D、EgoPER和EgoProceL数据集上分别提升了14.5%、10.2%和13.6%的F1分数。此外，该方法还展示了基于特征的任务图预测能力，能够从文本或视频嵌入中进行预测，并在Ego-Exo4D基准测试和在线错误检测任务（Assembly101-O和EPIC-Tent-O）中取得了显著性能提升。

链接: https://arxiv.org/abs/2502.17753
作者: Luigi Seminara,Giovanni Maria Farinella,Antonino Furnari
机构: Department of Mathematics and Computer Science, University of Catania, Italy (数学与计算机科学系，卡塔尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2406.01486

点击查看摘要

Abstract:We introduce a gradient-based approach for learning task graphs from procedural activities, improving over hand-crafted methods. Our method directly optimizes edge weights via maximum likelihood, enabling integration into neural architectures. We validate our approach on CaptainCook4D, EgoPER, and EgoProceL, achieving +14.5%, +10.2%, and +13.6% F1-score improvements. Our feature-based approach for predicting task graphs from textual/video embeddings demonstrates emerging video understanding abilities. We also achieved top performance on the procedure understanding benchmark on Ego-Exo4D and significantly improved online mistake detection (+19.8% on Assembly101-O, +6.4% on EPIC-Tent-O). Code: this https URL.
zh

[CV-63] Can Score-Based Generative Modeling Effectively Handle Medical Image Classification?

【速读】：该论文旨在解决在复杂且数据稀缺、多样性不足的医学影像数据集（如乳腺X线摄影图像）上深度学习分类模型表现不稳定的问题。研究的关键在于探索基于评分的生成模型（score-based generative models）作为分类器的应用，提出了一种新的图像分类方法，并在CBIS-DDSM、INbreast和Vin-Dr Mammo数据集上实现了优越的分类结果。

链接: https://arxiv.org/abs/2502.17727
作者: Sushmita Sarker,Prithul Sarker,George Bebis,Alireza Tavakkoli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the International Symposium on Biomedical Imaging (ISBI) 2025

点击查看摘要

Abstract:The remarkable success of deep learning in recent years has prompted applications in medical image classification and diagnosis tasks. While classification models have demonstrated robustness in classifying simpler datasets like MNIST or natural images such as ImageNet, this resilience is not consistently observed in complex medical image datasets where data is more scarce and lacks diversity. Moreover, previous findings on natural image datasets have indicated a potential trade-off between data likelihood and classification accuracy. In this study, we explore the use of score-based generative models as classifiers for medical images, specifically mammographic images. Our findings suggest that our proposed generative classifier model not only achieves superior classification results on CBIS-DDSM, INbreast and Vin-Dr Mammo datasets, but also introduces a novel approach to image classification in a broader context. Our code is publicly available at this https URL
zh

[CV-64] IBURD: Image Blending for Underwater Robotic Detection

【速读】：该论文旨在解决深海环境下用于海洋垃圾检测的水下自主航行器（AUV）所面临的训练数据稀缺和多样性不足的问题。论文的关键解决方案是提出了一种名为\textit{IBURD}的图像融合流水线，它能够生成逼真的合成图像及相应的像素级标注，通过泊松编辑和风格迁移技术，实现透明物体在任意背景下的鲁棒融合，并自动调整融合图像的风格以匹配目标背景图像的模糊度。这些生成的海洋垃圾图像结合实际的水下背景，有效提升了深度学习视觉算法在复杂水下条件下的性能，从而支持AUV在环境清理任务中的应用。

链接: https://arxiv.org/abs/2502.17706
作者: Jungseok Hong,Sakshi Singh,Junaed Sattar
机构: University of Minnesota–Twin Cities (明尼苏达大学双城分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present an image blending pipeline, \textitIBURD, that creates realistic synthetic images to assist in the training of deep detectors for use on underwater autonomous vehicles (AUVs) for marine debris detection tasks. Specifically, IBURD generates both images of underwater debris and their pixel-level annotations, using source images of debris objects, their annotations, and target background images of marine environments. With Poisson editing and style transfer techniques, IBURD is even able to robustly blend transparent objects into arbitrary backgrounds and automatically adjust the style of blended images using the blurriness metric of target background images. These generated images of marine debris in actual underwater backgrounds address the data scarcity and data variety problems faced by deep-learned vision algorithms in challenging underwater conditions, and can enable the use of AUVs for environmental cleanup missions. Both quantitative and robotic evaluations of IBURD demonstrate the efficacy of the proposed approach for robotic detection of marine debris.
zh

[CV-65] Semi-Supervised Weed Detection in Vegetable Fields: In-domain and Cross-domain Experiments

【速读】：该论文旨在解决在精准除草任务中鲁棒杂草检测的挑战，特别关注缺乏大规模标注数据的问题。论文的关键解决方案是引入并提出了一种基于YOLOv8的半监督目标检测方法（Semi-Supervised Object Detection, SSOD），称为WeedTeacher。通过利用未标注的数据来增强杂草检测的效果，并通过实验验证其在领域内和跨领域场景中的表现。实验结果表明，在领域内实验中，WeedTeacher相比其有监督基线模型（即YOLOv8l）在mAP@50和mAP@50:95指标上分别提高了2.6%和3.1%，从而证明了该方法的有效性。

链接: https://arxiv.org/abs/2502.17673
作者: Boyang Deng,Yuzhen Lu
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figure

点击查看摘要

Abstract:Robust weed detection remains a challenging task in precision weeding, requiring not only potent weed detection models but also large-scale, labeled data. However, the labeled data adequate for model training is practically difficult to come by due to the time-consuming, labor-intensive process that requires specialized expertise to recognize plant species. This study introduces semi-supervised object detection (SSOD) methods for leveraging unlabeled data for enhanced weed detection and proposes a new YOLOv8-based SSOD method, i.e., WeedTeacher. An experimental comparison of four SSOD methods, including three existing frameworks (i.e., DenseTeacher, EfficientTeacher, and SmallTeacher) and WeedTeacher, alongside fully supervised baselines, was conducted for weed detection in both in-domain and cross-domain contexts. A new, diverse weed dataset was created as the testbed, comprising a total of 19,931 field images from two differing domains, including 8,435 labeled (basic-domain) images acquired by handholding devices from 2021 to 2023 and 11,496 unlabeled (new-domain) images acquired by a ground-based mobile platform in 2024. The in-domain experiment with models trained using 10% of the labeled, basic-domain images and tested on the remaining 90% of the data, showed that the YOLOv8-basedWeedTeacher achieved the highest accuracy among all four SSOD methods, with an improvement of 2.6% mAP@50 and 3.1% mAP@50:95 over its supervised baseline (i.e., YOLOv8l). In the cross-domain experiment where the unlabeled new-domain data was incorporated, all four SSOD methods, however, resulted in no or limited improvements over their supervised counterparts. Research is needed to address the difficulty of cross-domain data utilization for robust weed detection.
zh

[CV-66] CalibRefine: Deep Learning-Based Online Automatic Targetless LiDAR-Camera Calibration with Iterative and Attention-Driven Post-Refinement

【速读】：该论文旨在解决多传感器（Multi-sensor）标定在实际应用中的准确性与自动化程度问题。现有方法通常依赖于人工放置的目标、初步参数估计或密集的数据预处理，限制了其在真实世界环境中的可扩展性和适应性。论文提出了一种完全自动、无目标且在线的标定框架CalibRefine，关键在于通过四个阶段实现LiDAR和相机数据的精确配准：(1)使用相对位置、外观嵌入和语义类别训练共同特征判别器以生成可靠的LiDAR-相机对应关系；(2)基于粗略同态的标定；(3)迭代细化以逐步改善对齐效果；(4)利用视觉变换器和交叉注意力机制进行基于注意力的细化，以解决非平面失真。实验表明，CalibRefine能够在最少的人工干预下提供高精度的标定结果，超越现有的无目标方法，并且在复杂的真实世界条件下具有竞争力甚至优于手工调优的基线。

链接: https://arxiv.org/abs/2502.17648
作者: Lei Chenga,Lihao Guoa,Tianya Zhangb,Tam Bangb,Austin Harrisb,Mustafa Hajijc,Mina Sartipib,Siyang Cao
机构: Arizona State University (亚利桑那州立大学); unknown
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Submitted to Transportation Research Part C: Emerging Technologies

点击查看摘要

Abstract:Accurate multi-sensor calibration is essential for deploying robust perception systems in applications such as autonomous driving, robotics, and intelligent transportation. Existing LiDAR-camera calibration methods often rely on manually placed targets, preliminary parameter estimates, or intensive data preprocessing, limiting their scalability and adaptability in real-world settings. In this work, we propose a fully automatic, targetless, and online calibration framework, CalibRefine, which directly processes raw LiDAR point clouds and camera images. Our approach is divided into four stages: (1) a Common Feature Discriminator that trains on automatically detected objects–using relative positions, appearance embeddings, and semantic classes–to generate reliable LiDAR-camera correspondences, (2) a coarse homography-based calibration, (3) an iterative refinement to incrementally improve alignment as additional data frames become available, and (4) an attention-based refinement that addresses non-planar distortions by leveraging a Vision Transformer and cross-attention mechanisms. Through extensive experiments on two urban traffic datasets, we show that CalibRefine delivers high-precision calibration results with minimal human involvement, outperforming state-of-the-art targetless methods and remaining competitive with, or surpassing, manually tuned baselines. Our findings highlight how robust object-level feature matching, together with iterative and self-supervised attention-based adjustments, enables consistent sensor fusion in complex, real-world conditions without requiring ground-truth calibration matrices or elaborate data preprocessing.
zh

[CV-67] A Priori Generalizability Estimate for a CNN

【速读】：该论文旨在解决卷积神经网络在图像分类和分割任务中的性能评估与诊断问题。论文的关键解决方案在于提出了截断奇异值分解（Truncated Singular Value Decomposition, TSVD）方法，并定义了右投影比（Right Projection Ratio）和左投影比（Left Projection Ratio）两个指标。通过计算得到的左右奇异向量，这两个指标能够评估图像或标签在这些奇异向量上的投影保真度，进而识别模型可能表现不佳的图像以及数据类别不平衡的情况。特别是右投影比，仅依赖于无标签数据即可预测模型在图像分割任务中的性能，这表明其可作为估计模型在样本上表现好坏的有效度量。

链接: https://arxiv.org/abs/2502.17622
作者: Cito Balsells,Beatrice Riviere,David Fuentes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We formulate truncated singular value decompositions of entire convolutional neural networks. We demonstrate the computed left and right singular vectors are useful in identifying which images the convolutional neural network is likely to perform poorly on. To create this diagnostic tool, we define two metrics: the Right Projection Ratio and the Left Projection Ratio. The Right (Left) Projection Ratio evaluates the fidelity of the projection of an image (label) onto the computed right (left) singular vectors. We observe that both ratios are able to identify the presence of class imbalance for an image classification problem. Additionally, the Right Projection Ratio, which only requires unlabeled data, is found to be correlated to the model’s performance when applied to image segmentation. This suggests the Right Projection Ratio could be a useful metric to estimate how likely the model is to perform well on a sample.
zh

[CV-68] Laplace-Beltrami Operator for Gaussian Splatting

【速读】：该论文旨在解决在高斯散射表示上直接进行几何处理应用的问题，特别是计算Laplace-Beltrami算子时面临的挑战。论文的关键在于提出了一种利用马哈拉诺比斯距离直接在高斯散射表示上计算Laplace-Beltrami算子的方法，从而在保持高精度的同时，能够有效处理高斯散射中的大量离群点，并评估优化过程中的输出质量。

链接: https://arxiv.org/abs/2502.17531
作者: Hongyu Zhou,Zorah Lähner
机构: University of Bonn(波恩大学); Lamarr Institute(拉马尔研究所)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:With the rising popularity of 3D Gaussian splatting and the expanse of applications from rendering to 3D reconstruction, there comes also a need for geometry processing applications directly on this new representation. While considering the centers of Gaussians as a point cloud or meshing them is an option that allows to apply existing algorithms, this might ignore information present in the data or be unnecessarily expensive. Additionally, Gaussian splatting tends to contain a large number of outliers which do not affect the rendering quality but need to be handled correctly in order not to produce noisy results in geometry processing applications. In this work, we propose a formulation to compute the Laplace-Beltrami operator, a widely used tool in geometry processing, directly on Gaussian splatting using the Mahalanobis distance. While conceptually similar to a point cloud Laplacian, our experiments show superior accuracy on the point clouds encoded in the Gaussian splatting centers and, additionally, the operator can be used to evaluate the quality of the output during optimization.
zh

[CV-69] On Neural Inertial Classification Networks for Pedestrian Activity Recognition

【速读】：该论文旨在通过定义和分析十种数据驱动的技术来提升神经惯性分类网络的性能，以填补现有研究缺乏标准化基准的空白。论文的关键在于从神经网络的架构（network architecture）、数据增强（data augmentation）和数据预处理（data preprocessing）三个方面进行深入探讨。实验结果表明，通过旋转的数据增强和多头架构（multi-head architecture）能够持续显著提高性能。

链接: https://arxiv.org/abs/2502.17520
作者: Zeev Yampolsky,Ofir Kruzel,Victoria Khalfin Fekson,Itzik Klein
机构: The Hatter Department of Marine Technologies, University of Haifa (海法大学), Israel (以色列)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: arXiv admin note: substantial text overlap with arXiv:2501.01327

点击查看摘要

Abstract:Inertial sensors are crucial for recognizing pedestrian activity. Recent advances in deep learning have greatly improved inertial sensing performance and robustness. Different domains and platforms use deep-learning techniques to enhance network performance, but there is no common benchmark. The latter is crucial for fair comparison and evaluation within a standardized framework. The aim of this paper is to fill this gap by defining and analyzing ten data-driven techniques for improving neural inertial classification networks. In order to accomplish this, we focused on three aspects of neural networks: network architecture, data augmentation, and data preprocessing. The experiments were conducted across four datasets collected from 78 participants. In total, over 936 minutes of inertial data sampled between 50-200Hz were analyzed. Data augmentation through rotation and multi-head architecture consistently yields the most significant improvements. Additionally, this study outlines benchmarking strategies for enhancing neural inertial classification networks.
zh

[CV-70] Doctor-in-the-Loop: An Explainable Multi-View Deep Learning Framework for Predicting Pathological Response in Non-Small Cell Lung Cancer

【速读】：该论文旨在解决非小细胞肺癌（NSCLC）术后高复发率的问题，通过提高病理反应预测的准确性来指导个性化治疗。论文的关键解决方案是提出了一种名为“Doctor-in-the-Loop”的新型框架，该框架将专家驱动的领域知识与可解释的人工智能技术相结合，引导模型关注临床上相关的解剖区域，从而提升模型的可解释性和可信度。该方法采用渐进多视角策略，逐步从广泛上下文特征细化到更具体的病变细节，并在每个阶段融入领域见解，以增强预测准确性，同时确保模型决策过程与临床推理更加一致。

链接: https://arxiv.org/abs/2502.17503
作者: Alice Natalina Caragliano,Filippo Ruffini,Carlo Greco,Edy Ippolito,Michele Fiore,Claudia Tacconi,Lorenzo Nibid,Giuseppe Perrone,Sara Ramella,Paolo Soda,Valerio Guarrasi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Non-small cell lung cancer (NSCLC) remains a major global health challenge, with high post-surgical recurrence rates underscoring the need for accurate pathological response predictions to guide personalized treatments. Although artificial intelligence models show promise in this domain, their clinical adoption is limited by the lack of medically grounded guidance during training, often resulting in non-explainable intrinsic predictions. To address this, we propose Doctor-in-the-Loop, a novel framework that integrates expert-driven domain knowledge with explainable artificial intelligence techniques, directing the model toward clinically relevant anatomical regions and improving both interpretability and trustworthiness. Our approach employs a gradual multi-view strategy, progressively refining the model’s focus from broad contextual features to finer, lesion-specific details. By incorporating domain insights at every stage, we enhance predictive accuracy while ensuring that the model’s decision-making process aligns more closely with clinical reasoning. Evaluated on a dataset of NSCLC patients, Doctor-in-the-Loop delivers promising predictive performance and provides transparent, justifiable outputs, representing a significant step toward clinically explainable artificial intelligence in oncology.
zh

[CV-71] FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation

【速读】：该论文旨在解决在无需调参的情况下，通过大规模预训练视频扩散模型实现身份保持的文本到视频生成（IPT2V）过程中，面部动态表现不足的问题。关键在于引入三维面部几何先验以确保视频合成中的面部结构合理性，并采用多视角人脸增强策略来增加面部表情和头部姿态的变化。此外，通过一种可学习的分层自适应机制，将二维和三维特征融合后有选择性地注入各个独立的扩散变换（DiT）层，从而平衡身份保留与运动动态的建模。

链接: https://arxiv.org/abs/2502.13995
作者: Yunpeng Zhang,Qiang Wang,Fan Jiang,Yaqi Fan,Mu Xu,Yonggang Qi
机构: AMAP, Alibaba Group(阿里集团); Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tuning-free approaches adapting large-scale pre-trained video diffusion models for identity-preserving text-to-video generation (IPT2V) have gained popularity recently due to their efficacy and scalability. However, significant challenges remain to achieve satisfied facial dynamics while keeping the identity unchanged. In this work, we present a novel tuning-free IPT2V framework by enhancing face knowledge of the pre-trained video model built on diffusion transformers (DiT), dubbed FantasyID. Essentially, 3D facial geometry prior is incorporated to ensure plausible facial structures during video synthesis. To prevent the model from learning copy-paste shortcuts that simply replicate reference face across frames, a multi-view face augmentation strategy is devised to capture diverse 2D facial appearance features, hence increasing the dynamics over the facial expressions and head poses. Additionally, after blending the 2D and 3D features as guidance, instead of naively employing cross-attention to inject guidance cues into DiT layers, a learnable layer-aware adaptive mechanism is employed to selectively inject the fused features into each individual DiT layers, facilitating balanced modeling of identity preservation and motion dynamics. Experimental results validate our model’s superiority over the current tuning-free IPT2V methods.
zh

[CV-72] A Reverse Mamba Attention Network for Pathological Liver Segmentation

【速读】：该论文旨在解决病理肝脏分割中的复杂形态模式识别难题，特别是在CT和MRI图像中由于组织变异导致的传统分割方法失效的问题。解决方案的关键在于提出了一种新颖的架构——RMA-Mamba，它通过集成高效的序列建模（Vision Mamba, VMamba）与目标特征优化（reverse mamba attention module, RMA），实现了在多个尺度上的优越特征学习。这一双重机制方法不仅能够稳健处理复杂的形态学模式，同时保持计算效率。

链接: https://arxiv.org/abs/2502.18232
作者: Jun Zeng,Ulas Bagci,Debesh Jha
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:We present RMA-Mamba, a novel architecture that advances the capabilities of vision state space models through a specialized reverse mamba attention module (RMA). The key innovation lies in RMA-Mamba’s ability to capture long-range dependencies while maintaining precise local feature representation through its hierarchical processing pipeline. By integrating Vision Mamba (VMamba)'s efficient sequence modeling with RMA’s targeted feature refinement, our architecture achieves superior feature learning across multiple scales. This dual-mechanism approach enables robust handling of complex morphological patterns while maintaining computational efficiency. We demonstrate RMA-Mamba’s effectiveness in the challenging domain of pathological liver segmentation (from both CT and MRI), where traditional segmentation approaches often fail due to tissue variations. When evaluated on a newly introduced cirrhotic liver dataset (CirrMRI600+) of T2-weighted MRI scans, RMA-Mamba achieves the state-of-the-art performance with a Dice coefficient of 92.08%, mean IoU of 87.36%, and recall of 92.96%. The architecture’s generalizability is further validated on the cancerous liver segmentation from CT scans (LiTS: Liver Tumor Segmentation dataset), yielding a Dice score of 92.9% and mIoU of 88.99%. The source code of the proposed RMA-Mamba is available at this https URL.
zh

[CV-73] Liver Cirrhosis Stage Estimation from MRI with Deep Learning

【速读】：该论文旨在解决肝硬化分期自动化评估的问题。解决方案的关键在于提出了一种端到端的深度学习框架，该框架整合了多尺度特征学习与序列特定注意机制，以捕捉肝硬化进展过程中细微的组织变化。通过使用CirrMRI600+数据集进行验证，该模型在T1加权（T1W）和T2加权（T2W）序列上的表现显著优于传统的基于放射组学的方法。

链接: https://arxiv.org/abs/2502.18225
作者: Jun Zeng,Debesh Jha,Ertugrul Aktas,Elif Keles,Alpay Medetalibeyoglu,Matthew Antalek,Amir A. Borhani,Daniela P. Ladner,Gorkem Durak,Ulas Bagci
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:We present an end-to-end deep learning framework for automated liver cirrhosis stage estimation from multi-sequence MRI. Cirrhosis is the severe scarring (fibrosis) of the liver and a common endpoint of various chronic liver diseases. Early diagnosis is vital to prevent complications such as decompensation and cancer, which significantly decreases life expectancy. However, diagnosing cirrhosis in its early stages is challenging, and patients often present with life-threatening complications. Our approach integrates multi-scale feature learning with sequence-specific attention mechanisms to capture subtle tissue variations across cirrhosis progression stages. Using CirrMRI600+, a large-scale publicly available dataset of 628 high-resolution MRI scans from 339 patients, we demonstrate state-of-the-art performance in three-stage cirrhosis classification. Our best model achieves 72.8% accuracy on T1W and 63.8% on T2W sequences, significantly outperforming traditional radiomics-based approaches. Through extensive ablation studies, we show that our architecture effectively learns stage-specific imaging biomarkers. We establish new benchmarks for automated cirrhosis staging and provide insights for developing clinically applicable deep learning systems. The source code will be available at this https URL.
zh

[CV-74] VesselSAM: Leverag ing SAM for Aortic Vessel Segmentation with LoRA and Atrous Attention

【速读】：该论文旨在解决主动脉血管分割在临床诊断和治疗规划中的挑战，特别是在复杂解剖结构如血管的分割任务中。论文的关键解决方案是提出了一种名为VesselSAM的模型，它是对Segmentation Anything Model (SAM) 的改进版本，专门用于主动脉血管的分割。VesselSAM的核心创新在于引入了AtrousLoRA模块，该模块结合了空洞注意力机制（Atrous Attention）与低秩适应（Low-Rank Adaptation, LoRA），从而提升分割性能。其中，空洞注意力机制能够捕获多尺度的上下文信息，保留局部细节和全局信息，而LoRA则通过高效的微调过程减少可训练参数数量，确保计算效率。

链接: https://arxiv.org/abs/2502.18185
作者: Adnan Iltaf,Rayan Merghani Ahmed,Bin Li,Shoujun Zhou
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院); University Chinese Academy of Science(中国科学院大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE JBHI

点击查看摘要

Abstract:Medical image segmentation is crucial for clinical diagnosis and treatment planning, particularly for complex anatomical structures like vessels. In this work, we propose VesselSAM, a modified version of the Segmentation Anything Model (SAM), specifically designed for aortic vessel segmentation. VesselSAM incorporates AtrousLoRA, a novel module that combines Atrous Attention with Low-Rank Adaptation (LoRA), to improve segmentation performance. Atrous Attention enables the model to capture multi-scale contextual information, preserving both fine local details and broader global context. At the same time, LoRA facilitates efficient fine-tuning of the frozen SAM image encoder, reducing the number of trainable parameters and ensuring computational efficiency. We evaluate VesselSAM on two challenging datasets: the Aortic Vessel Tree (AVT) dataset and the Type-B Aortic Dissection (TBAD) dataset. VesselSAM achieves state-of-the-art performance with DSC scores of 93.50%, 93.25%, 93.02%, and 93.26% across multiple medical centers. Our results demonstrate that VesselSAM delivers high segmentation accuracy while significantly reducing computational overhead compared to existing large-scale models. This development paves the way for enhanced AI-based aortic vessel segmentation in clinical environments. The code and models will be released at this https URL.
zh

[CV-75] 3D Anatomical Structure-guided Deep Learning for Accurate Diffusion Microstructure Imaging

【速读】：该论文旨在解决在临床可行的扩散磁共振成像（Diffusion Magnetic Resonance Imaging, dMRI）扫描条件下，准确估计脑组织微结构的挑战。解决方案的关键在于提出了一种新颖的框架，该框架同时利用宏观先验解剖信息和参数间的互信息，在保证高保真度的同时实现快速的扩散微结构成像。实验结果表明，该方法相比四种最先进的技术，在估计多扩散模型的参数图时，峰值信噪比（PSNR）达到30.51 ± 0.58，结构相似性指数测量（SSIM）达到0.97 ± 0.004，并且实现了相对于密集采样方法15倍的速度提升。

链接: https://arxiv.org/abs/2502.17933
作者: Xinrui Ma,Jian Cheng,Wenxin Fan,Ruoyou Wu,Yongquan Ye,Shanshan Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion magnetic resonance imaging (dMRI) is a crucial non-invasive technique for exploring the microstructure of the living human brain. Traditional hand-crafted and model-based tissue microstructure reconstruction methods often require extensive diffusion gradient sampling, which can be time-consuming and limits the clinical applicability of tissue microstructure information. Recent advances in deep learning have shown promise in microstructure estimation; however, accurately estimating tissue microstructure from clinically feasible dMRI scans remains challenging without appropriate constraints. This paper introduces a novel framework that achieves high-fidelity and rapid diffusion microstructure imaging by simultaneously leveraging anatomical information from macro-level priors and mutual information across parameters. This approach enhances time efficiency while maintaining accuracy in microstructure estimation. Experimental results demonstrate that our method outperforms four state-of-the-art techniques, achieving a peak signal-to-noise ratio (PSNR) of 30.51 \pm 0.58 and a structural similarity index measure (SSIM) of 0.97 \pm 0.004 in estimating parametric maps of multiple diffusion models. Notably, our method achieves a 15 \times acceleration compared to the dense sampling approach, which typically utilizes 270 diffusion gradients.
zh

[CV-76] A graph neural network-based multispectral-view learning model for diabetic macular ischemia detection from color fundus photographs

【速读】：该论文旨在解决糖尿病性黄斑缺血（Diabetic Macular Ischemia, DMI）的检测难题。DMI是一种由视网膜毛细血管在黄斑区域损失所引起的视力障碍。尽管彩色眼底照片（Color Fundus Photographs, CFPs）结合人工智能（Artificial Intelligence, AI）已被广泛用于检测多种眼科疾病，包括糖尿病视网膜病变（Diabetic Retinopathy, DR），但其在DMI检测中的应用尚未被充分探索。论文的关键在于提出了一种基于图神经网络的多光谱视图学习（Graph Neural Network-based Multispectral View Learning, GNN-MSVL）模型，该模型通过计算多光谱成像（Computational Multispectral Imaging, CMI）从CFPs重建24波段的多光谱眼底图像，并利用定制跳跃连接策略的图神经网络增强跨光谱关系，从而提高对DMI相关特征的敏感度。

链接: https://arxiv.org/abs/2502.17886
作者: Qinghua He,Hongyang Jiang,Danqi Fang,Dawei Yang,Truong X. Nguyen,Anran Ran,Clement C. Tham,Simon K. H. Szeto,Sobha Sivaprasad,Carol Y. Cheung
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetic macular ischemia (DMI), marked by the loss of retinal capillaries in the macular area, contributes to vision impairment in patients with diabetes. Although color fundus photographs (CFPs), combined with artificial intelligence (AI), have been extensively applied in detecting various eye diseases, including diabetic retinopathy (DR), their applications in detecting DMI remain unexplored, partly due to skepticism among ophthalmologists regarding its feasibility. In this study, we propose a graph neural network-based multispectral view learning (GNN-MSVL) model designed to detect DMI from CFPs. The model leverages higher spectral resolution to capture subtle changes in fundus reflectance caused by ischemic tissue, enhancing sensitivity to DMI-related features. The proposed approach begins with computational multispectral imaging (CMI) to reconstruct 24-wavelength multispectral fundus images from CFPs. ResNeXt101 is employed as the backbone for multi-view learning to extract features from the reconstructed images. Additionally, a GNN with a customized jumper connection strategy is designed to enhance cross-spectral relationships, facilitating comprehensive and efficient multispectral view learning. The study included a total of 1,078 macula-centered CFPs from 1,078 eyes of 592 patients with diabetes, of which 530 CFPs from 530 eyes of 300 patients were diagnosed with DMI. The model achieved an accuracy of 84.7 percent and an area under the receiver operating characteristic curve (AUROC) of 0.900 (95 percent CI: 0.852-0.937) on eye-level, outperforming both the baseline model trained from CFPs and human experts (p-values less than 0.01). These findings suggest that AI-based CFP analysis holds promise for detecting DMI, contributing to its early and low-cost screening.
zh

[CV-77] agGAN: A Generative Model for Data Tagging

【速读】：该论文旨在解决在缺乏像素级标注的数据环境下，传统诊断AI系统决策不透明及性能不佳的问题。解决方案的关键在于提出了一种基于生成对抗网络（GANs）的TagGAN框架，能够利用仅具有图像级标签的数据进行弱监督细粒度疾病地图生成。通过这一方法，TagGAN能够在无需像素级标注的情况下，生成精确的疾病特异性区域可视化，并自动生成二进制掩模以辅助放射科医生的工作。

链接: https://arxiv.org/abs/2502.17836
作者: Muhammad Nawaz,Basma Nasir,Tehseen Zia,Zawar Hussain,Catarina Moreira
机构: Data Science Institute, University of Technology Sydney (悉尼科技大学数据科学研究所), Australia; COMSATS University Islamabad (COMSATS伊斯兰堡大学), Pakistan; Medical Imaging and Diagnostics Lab, National Center of Artificial Intelligence (国家人工智能中心医学影像与诊断实验室), Pakistan; Macquarie University (麦考瑞大学), Sydney, Australia
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Precise identification and localization of disease-specific features at the pixel-level are particularly important for early diagnosis, disease progression monitoring, and effective treatment in medical image analysis. However, conventional diagnostic AI systems lack decision transparency and cannot operate well in environments where there is a lack of pixel-level annotations. In this study, we propose a novel Generative Adversarial Networks (GANs)-based framework, TagGAN, which is tailored for weakly-supervised fine-grained disease map generation from purely image-level labeled data. TagGAN generates a pixel-level disease map during domain translation from an abnormal image to a normal representation. Later, this map is subtracted from the input abnormal image to convert it into its normal counterpart while preserving all the critical anatomical details. Our method is first to generate fine-grained disease maps to visualize disease lesions in a weekly supervised setting without requiring pixel-level annotations. This development enhances the interpretability of diagnostic AI by providing precise visualizations of disease-specific regions. It also introduces automated binary mask generation to assist radiologists. Empirical evaluations carried out on the benchmark datasets, CheXpert, TBX11K, and COVID-19, demonstrate the capability of TagGAN to outperform current top models in accurately identifying disease-specific pixels. This outcome highlights the capability of the proposed model to tag medical images, significantly reducing the workload for radiologists by eliminating the need for binary masks during training.
zh

[CV-78] Label-free Prediction of Vascular Connectivity in Perfused Microvascular Networks in vitro

【速读】：该论文旨在解决微血管连接性评估过程中依赖荧光标签导致的生物相容性问题或干扰正常细胞生长的问题。解决方案的关键在于开发了一种名为Vessel Connectivity Network (VC-Net) 的无标记评估方法，通过使用Vessel Queue Contrastive Learning (VQCL) 方法和类别不平衡算法来应对样本量有限、类别特征不明显及类别分布不平衡等问题，从而实现无标记且连续的微血管网络连接性评估。

链接: https://arxiv.org/abs/2502.17759
作者: Liang Xu,Pengwu Song,Shilu Zhu,Yang Zhang,Ru Zhang,Zhiyuan Zheng,Qingdong Zhang,Jie Gao,Chen Han,Mingzhai Sun,Peng Yao,Min Ye,Ronald X. Xu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continuous monitoring and in-situ assessment of microvascular connectivity have significant implications for culturing vascularized organoids and optimizing the therapeutic strategies. However, commonly used methods for vascular connectivity assessment heavily rely on fluorescent labels that may either raise biocompatibility concerns or interrupt the normal cell growth process. To address this issue, a Vessel Connectivity Network (VC-Net) was developed for label-free assessment of vascular connectivity. To validate the VC-Net, microvascular networks (MVNs) were cultured in vitro and their microscopic images were acquired at different culturing conditions as a training dataset. The VC-Net employs a Vessel Queue Contrastive Learning (VQCL) method and a class imbalance algorithm to address the issues of limited sample size, indistinctive class features and imbalanced class distribution in the dataset. The VC-Net successfully evaluated the vascular connectivity with no significant deviation from that by fluorescence imaging. In addition, the proposed VC-Net successfully differentiated the connectivity characteristics between normal and tumor-related MVNs. In comparison with those cultured in the regular microenvironment, the averaged connectivity of MVNs cultured in the tumor-related microenvironment decreased by 30.8%, whereas the non-connected area increased by 37.3%. This study provides a new avenue for label-free and continuous assessment of organoid or tumor vascularization in vitro.
zh

[CV-79] SynthRAD2025 Grand Challenge dataset: generating synthetic CTs for radiotherapy

【速读】：该论文旨在推动合成影像技术，特别是合成计算机断层扫描（Synthetic Computed Tomography, sCT）在放射治疗中的应用。关键解决方案在于提供了一个名为SynthRAD2025的数据集及其基准平台，该数据集包含了来自五个欧洲大学医学中心的2362个病例，涵盖多种扫描设备和协议。通过预处理步骤如刚性及可变形图像配准确保高质量、模态对齐的影像，并通过广泛的质控验证影像的一致性和可用性。数据集被分为训练、验证和测试集，以维护其完整性。此数据集支持算法的基准测试和开发，促进稳健且通用的影像合成算法，从而推进个性化癌症护理和适应性放射治疗的发展。

链接: https://arxiv.org/abs/2502.17609
作者: Adrian Thummerer,Erik van der Bijl,Arthur Jr Galapon,Florian Kamp,Mark Savenije,Christina Muijs,Shafak Aluwini,Roel J.H.M. Steenbakkers,Stephanie Beuel,Martijn P.W. Intven,Johannes A. Langendijk,Stefan Both,Stefanie Corradini,Viktor Rogowski,Maarten Terpstra,Niklas Wahl,Christopher Kurz,Guillaume Landry,Matteo Maspero
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 22 pages, 8 tables, 4 figures; Under submission to Medical Physics, as dataset paper for the SynhtRAD2025 Grand Challenge this https URL

点击查看摘要

Abstract:Medical imaging is essential in modern radiotherapy, supporting diagnosis, treatment planning, and monitoring. Synthetic imaging, particularly synthetic computed tomography (sCT), is gaining traction in radiotherapy. The SynthRAD2025 dataset and Grand Challenge promote advancements in sCT generation by providing a benchmarking platform for algorithms using cone-beam CT (CBCT) and magnetic resonance imaging (MRI). The dataset includes 2362 cases: 890 MRI-CT and 1472 CBCT-CT pairs from head-and-neck, thoracic, and abdominal cancer patients treated at five European university medical centers (UMC Groningen, UMC Utrecht, Radboud UMC, LMU University Hospital Munich, and University Hospital of Cologne). Data were acquired with diverse scanners and protocols. Pre-processing, including rigid and deformable image registration, ensures high-quality, modality-aligned images. Extensive quality assurance validates image consistency and usability. All imaging data is provided in MetaImage (.mha) format, ensuring compatibility with medical image processing tools. Metadata, including acquisition parameters and registration details, is available in structured CSV files. To maintain dataset integrity, SynthRAD2025 is divided into training (65%), validation (10%), and test (25%) sets. The dataset is accessible at this https URL under the SynthRAD2025 collection. This dataset supports benchmarking and the development of synthetic imaging techniques for radiotherapy applications. Use cases include sCT generation for MRI-only and MR-guided photon/proton therapy, CBCT-based dose calculations, and adaptive radiotherapy workflows. By integrating diverse acquisition settings, SynthRAD2025 fosters robust, generalizable image synthesis algorithms, advancing personalized cancer care and adaptive radiotherapy. Comments: 22 pages, 8 tables, 4 figures; Under submission to Medical Physics, as dataset paper for the SynhtRAD2025 Grand Challenge this https URL Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2502.17609 [physics.med-ph] (or arXiv:2502.17609v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2502.17609 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Matteo Maspero [view email] [v1] Mon, 24 Feb 2025 19:53:09 UTC (985 KB)
zh

[CV-80] Data-Driven Pseudo-spectral Full Waveform Inversion via Deep Neural Networks

【速读】：该论文旨在解决将伪谱方法（Pseudo-spectral approach）融入深度学习框架中的空白。解决方案的关键在于重新构建伪谱有限波传播反演（FWI）问题，将其转化为一种数据驱动的伪谱方法的深度神经网络（DNN）算法。通过这种方式，提出的新型DNN框架不仅在合成数据上进行了理论上的阐述和定性评估，还应用于二维Marmousi数据集，并与确定性和时间域方法进行了对比。研究结果表明，数据驱动的伪谱DNN方法在深部和逆冲区域的反演性能优于传统的FWI，这主要归因于该方法具备全局近似能力，不受射线追踪物理约束的限制。

链接: https://arxiv.org/abs/2502.17608
作者: Christopher Zerafa,Pauline Galea,Cristiana Sebu
机构: University of Malta(马耳他大学)
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 6 pages, review paper

点击查看摘要

Abstract:FWI seeks to achieve a high-resolution model of the subsurface through the application of multi-variate optimization to the seismic inverse problem. Although now a mature technology, FWI has limitations related to the choice of the appropriate solver for the forward problem in challenging environments requiring complex assumptions, and very wide angle and multi-azimuth data necessary for full reconstruction are often not available. Deep Learning techniques have emerged as excellent optimization frameworks. These exist between data and theory-guided methods. Data-driven methods do not impose a wave propagation model and are not exposed to modelling errors. On the contrary, deterministic models are governed by the laws of physics. Application of seismic FWI has recently started to be investigated within Deep Learning. This has focussed on the time-domain approach, while the pseudo-spectral domain has not been yet explored. However, classical FWI experienced major breakthroughs when pseudo-spectral approaches were employed. This work addresses the lacuna that exists in incorporating the pseudo-spectral approach within Deep Learning. This has been done by re-formulating the pseudo-spectral FWI problem as a Deep Learning algorithm for a data-driven pseudo-spectral approach. A novel DNN framework is proposed. This is formulated theoretically, qualitatively assessed on synthetic data, applied to a two-dimensional Marmousi dataset and evaluated against deterministic and time-based approaches. Inversion of data-driven pseudo-spectral DNN was found to outperform classical FWI for deeper and over-thrust areas. This is due to the global approximator nature of the technique and hence not bound by forward-modelling physical constraints from ray-tracing. Comments: 11 pages, 6 pages, review paper Subjects: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2502.17608 [physics.geo-ph] (or arXiv:2502.17608v1 [physics.geo-ph] for this version) https://doi.org/10.48550/arXiv.2502.17608 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Christopher Zerafa [view email] [v1] Mon, 24 Feb 2025 19:50:36 UTC (1,589 KB)
zh

[CV-81] Using Graph Convolutional Networks to Address fMRI Small Data Problems

【速读】：该论文旨在解决医学影像分析中小样本数据学习的问题，特别是在预测治疗反应（预后）等复杂任务中数据有限的挑战。解决方案的关键在于采用图神经网络处理患者信息本身即为图结构的数据（如感兴趣区域的连接图），并通过谱表示方法实现有效的信息传播，从而在相同数据集上比传统深度学习方法提高约12%的性能。这种方法通过减少三角不等式的数量来实现数据平滑，进而满足传递性，这是其优越性能的主要原因。

链接: https://arxiv.org/abs/2502.17489
作者: Thomas Screven,Andras Necz,Jason Smucny,Ian Davidson
机构: University of California, Davis(加州大学戴维斯分校)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:Although great advances in the analysis of neuroimaging data have been made, a major challenge is a lack of training data. This is less problematic in tasks such as diagnosis, where much data exists, but particularly prevalent in harder problems such as predicting treatment responses (prognosis), where data is focused and hence limited. Here, we address the learning from small data problems for medical imaging using graph neural networks. This is particularly challenging as the information about the patients is themselves graphs (regions of interest connectivity graphs). We show how a spectral representation of the connectivity data allows for efficient propagation that can yield approximately 12% improvement over traditional deep learning methods using the exact same data. We show that our method’s superior performance is due to a data smoothing result that can be measured by closing the number of triangle inequalities and thereby satisfying transitivity.
zh

人工智能

[AI-0] Scalable Equilibrium Sampling with Sequential Boltzmann Generators

链接: https://arxiv.org/abs/2502.18462
作者: Charlie B. Tan,Avishek Joey Bose,Chen Lin,Leon Klein,Michael M. Bronstein,Alexander Tong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann generators tackle this problem by pairing powerful normalizing flows with importance sampling to obtain statistically independent samples under the target distribution. In this paper, we extend the Boltzmann generator framework and introduce Sequential Boltzmann generators (SBG) with two key improvements. The first is a highly efficient non-equivariant Transformer-based normalizing flow operating directly on all-atom Cartesian coordinates. In contrast to equivariant continuous flows of prior methods, we leverage exactly invertible non-equivariant architectures which are highly efficient both during sample generation and likelihood computation. As a result, this unlocks more sophisticated inference strategies beyond standard importance sampling. More precisely, as a second key improvement we perform inference-time scaling of flow samples using annealed Langevin dynamics which transports samples toward the target distribution leading to lower variance (annealed) importance weights which enable higher fidelity resampling with sequential Monte Carlo. SBG achieves state-of-the-art performance w.r.t. all metrics on molecular systems, demonstrating the first equilibrium sampling in Cartesian coordinates of tri, tetra, and hexapeptides that were so far intractable for prior Boltzmann generators.

[AI-1] MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning

链接: https://arxiv.org/abs/2502.18439
作者: Chanwoo Park,Seungju Han,Xingzhi Guo,Asuman Ozdaglar,Kaiqing Zhang,Joo-Kyung Kim
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Leveraging multiple large language models (LLMs) to build collaborative multi-agentic workflows has demonstrated significant potential. However, most previous studies focus on prompting the out-of-the-box LLMs, relying on their innate capability for collaboration, which may not improve LLMs’ performance as shown recently. In this paper, we introduce a new post-training paradigm MAPoRL (Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning), to explicitly elicit the collaborative behaviors and further unleash the power of multi-agentic LLM frameworks. In MAPoRL, multiple LLMs first generate their own responses independently and engage in a multi-turn discussion to collaboratively improve the final answer. In the end, a MAPoRL verifier evaluates both the answer and the discussion, by assigning a score that verifies the correctness of the answer, while adding incentives to encourage corrective and persuasive discussions. The score serves as the co-training reward, and is then maximized through multi-agent RL. Unlike existing LLM post-training paradigms, MAPoRL advocates the co-training of multiple LLMs together using RL for better generalization. Accompanied by analytical insights, our experiments demonstrate that training individual LLMs alone is insufficient to induce effective collaboration. In contrast, multi-agent co-training can boost the collaboration performance across benchmarks, with generalization to unseen domains.

[AI-2] oMCAT: Theory-of-Mind for Cooperative Agents in Teams via Multiagent Diffusion Policies

链接: https://arxiv.org/abs/2502.18438
作者: Pedro Sequeira,Vidyasagar Sadhu,Melinda Gervasio
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we present ToMCAT (Theory-of-Mind for Cooperative Agents in Teams), a new framework for generating ToM-conditioned trajectories. It combines a meta-learning mechanism, that performs ToM reasoning over teammates’ underlying goals and future behavior, with a multiagent denoising-diffusion model, that generates plans for an agent and its teammates conditioned on both the agent’s goals and its teammates’ characteristics, as computed via ToM. We implemented an online planning system that dynamically samples new trajectories (replans) from the diffusion model whenever it detects a divergence between a previously generated plan and the current state of the world. We conducted several experiments using ToMCAT in a simulated cooking domain. Our results highlight the importance of the dynamic replanning mechanism in reducing the usage of resources without sacrificing team performance. We also show that recent observations about the world and teammates’ behavior collected by an agent over the course of an episode combined with ToM inferences are crucial to generate team-aware plans for dynamic adaptation to teammates, especially when no prior information is provided about them.

[AI-3] PyEvalAI: AI-assisted evaluation of Jupyter Notebooks for immediate personalized feedback

链接: https://arxiv.org/abs/2502.18425
作者: Nils Wandel,David Stotko,Alexander Schier,Reinhard Klein
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Grading student assignments in STEM courses is a laborious and repetitive task for tutors, often requiring a week to assess an entire class. For students, this delay of feedback prevents iterating on incorrect solutions, hampers learning, and increases stress when exercise scores determine admission to the final exam. Recent advances in AI-assisted education, such as automated grading and tutoring systems, aim to address these challenges by providing immediate feedback and reducing grading workload. However, existing solutions often fall short due to privacy concerns, reliance on proprietary closed-source models, lack of support for combining Markdown, LaTeX and Python code, or excluding course tutors from the grading process. To overcome these limitations, we introduce PyEvalAI, an AI-assisted evaluation system, which automatically scores Jupyter notebooks using a combination of unit tests and a locally hosted language model to preserve privacy. Our approach is free, open-source, and ensures tutors maintain full control over the grading process. A case study demonstrates its effectiveness in improving feedback speed and grading efficiency for exercises in a university-level course on numerics.

[AI-4] Comparative Analysis of MDL-VAE vs. Standard VAE on 202 Years of Gynecological Data

链接: https://arxiv.org/abs/2502.18412
作者: Paula Santos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pagas, 5 figures, 9th International Conference on Signal, Image Processing (SIPO 2025), Vancouver CA

点击查看摘要

Abstract:This study presents a comparative evaluation of a Variational Autoencoder (VAE) enhanced with Minimum Description Length (MDL) regularization against a Standard Autoencoder for reconstructing high-dimensional gynecological data. The MDL-VAE exhibits significantly lower reconstruction errors (MSE, MAE, RMSE) and more structured latent representations, driven by effective KL divergence regularization. Statistical analyses confirm these performance improvements are significant. Furthermore, the MDL-VAE shows consistent training and validation losses and achieves efficient inference times, underscoring its robustness and practical viability. Our findings suggest that incorporating MDL principles into VAE architectures can substantially improve data reconstruction and generalization, making it a promising approach for advanced applications in healthcare data modeling and analysis.

[AI-5] SKANMixer: Kolmogorov-Arnold Networks with MLP-Mixer Model for Time Series Forecasting AAAI2025

链接: https://arxiv.org/abs/2502.18410
作者: Young-Chae Hong,Bei Xiao,Yangho Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures, 7 tables and accepted at the AI4TS: AI for Time Series Analysis workshop, AAAI 2025

点击查看摘要

Abstract:Time series forecasting has long been a focus of research across diverse fields, including economics, energy, healthcare, and traffic management. Recent works have introduced innovative architectures for time series models, such as the Time-Series Mixer (TSMixer), which leverages multi-layer perceptrons (MLPs) to enhance prediction accuracy by effectively capturing both spatial and temporal dependencies within the data. In this paper, we investigate the capabilities of the Kolmogorov-Arnold Networks (KANs) for time-series forecasting by modifying TSMixer with a KAN layer (TSKANMixer). Experimental results demonstrate that TSKANMixer tends to improve prediction accuracy over the original TSMixer across multiple datasets, ranking among the top-performing models compared to other time series approaches. Our results show that the KANs are promising alternatives to improve the performance of time series forecasting by replacing or extending traditional MLPs.

[AI-6] he Gradient of Algebraic Model Counting AAAI2025

链接: https://arxiv.org/abs/2502.18406
作者: Jaron Maene,Luc De Raedt
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at AAAI 2025

点击查看摘要

Abstract:Algebraic model counting unifies many inference tasks on logic formulas by exploiting semirings. Rather than focusing on inference, we consider learning, especially in statistical-relational and neurosymbolic AI, which combine logical, probabilistic and neural representations. Concretely, we show that the very same semiring perspective of algebraic model counting also applies to learning. This allows us to unify various learning algorithms by generalizing gradients and backpropagation to different semirings. Furthermore, we show how cancellation and ordering properties of a semiring can be exploited for more memory-efficient backpropagation. This allows us to obtain some interesting variations of state-of-the-art gradient-based optimisation methods for probabilistic logical models. We also discuss why algebraic model counting on tractable circuits does not lead to more efficient second-order optimization. Empirically, our algebraic backpropagation exhibits considerable speed-ups as compared to existing approaches.

[AI-7] How Far are LLM s from Real Search? A Comprehensive Study on Efficiency Completeness and Inherent Capabilities

链接: https://arxiv.org/abs/2502.18387
作者: Minhua Lin,Hui Liu,Xianfeng Tang,Jingying Zeng,Zhenwei Dai,Chen Luo,Zheng Li,Xiang Zhang,Qi He,Suhang Wang
类目: Artificial Intelligence (cs.AI)
*备注: 31 pages, 9 figures, 18 tables

点击查看摘要

Abstract:Search plays a fundamental role in problem-solving across various domains, with most real-world decision-making problems being solvable through systematic search. Drawing inspiration from recent discussions on search and learning, we systematically explore the complementary relationship between search and Large Language Models (LLMs) from three perspectives. First, we analyze how learning can enhance search efficiency and propose Search via Learning (SeaL), a framework that leverages LLMs for effective and efficient search. Second, we further extend SeaL to SeaL-C to ensure rigorous completeness during search. Our evaluation across three real-world planning tasks demonstrates that SeaL achieves near-perfect accuracy while reducing search spaces by up to 99.1% compared to traditional approaches. Finally, we explore how far LLMs are from real search by investigating whether they can develop search capabilities independently. Our analysis reveals that while current LLMs struggle with efficient search in complex problems, incorporating systematic search strategies significantly enhances their problem-solving capabilities. These findings not only validate the effectiveness of our approach but also highlight the need for improving LLMs’ search abilities for real-world applications.

[AI-8] MindMem: Multimodal for Predicting Advertisement Memorability Using LLM s and Deep Learning AAAI2025

链接: https://arxiv.org/abs/2502.18371
作者: Sepehr Asgarian,Qayam Jetha,Jouhyun Jeon
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures, 4 Tables, AAAI 2025 Economics of Modern ML: Markets, Incentives, and Generative AI Workshop

点击查看摘要

Abstract:In the competitive landscape of advertising, success hinges on effectively navigating and leveraging complex interactions among consumers, advertisers, and advertisement platforms. These multifaceted interactions compel advertisers to optimize strategies for modeling consumer behavior, enhancing brand recall, and tailoring advertisement content. To address these challenges, we present MindMem, a multimodal predictive model for advertisement memorability. By integrating textual, visual, and auditory data, MindMem achieves state-of-the-art performance, with a Spearman’s correlation coefficient of 0.631 on the LAMBDA and 0.731 on the Memento10K dataset, consistently surpassing existing methods. Furthermore, our analysis identified key factors influencing advertisement memorability, such as video pacing, scene complexity, and emotional resonance. Expanding on this, we introduced MindMem-ReAd (MindMem-Driven Re-generated Advertisement), which employs Large Language Model-based simulations to optimize advertisement content and placement, resulting in up to a 74.12% improvement in advertisement memorability. Our results highlight the transformative potential of Artificial Intelligence in advertising, offering advertisers a robust tool to drive engagement, enhance competitiveness, and maximize impact in a rapidly evolving market.

[AI-9] Which Contributions Deserve Credit? Perceptions of Attribution in Human-AI Co-Creation

链接: https://arxiv.org/abs/2502.18357
作者: Jessica He,Stephanie Houde,Justin D. Weisz
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 30 pages, 5 figures. In CHI Conference on Human Factors in Computing Systems (CHI '25), April 26-May 1, 2025, Yokohama, Japan

点击查看摘要

Abstract:AI systems powered by large language models can act as capable assistants for writing and editing. In these tasks, the AI system acts as a co-creative partner, making novel contributions to an artifact-under-creation alongside its human partner(s). One question that arises in these scenarios is the extent to which AI should be credited for its contributions. We examined knowledge workers’ views of attribution through a survey study (N=155) and found that they assigned different levels of credit across different contribution types, amounts, and initiative. Compared to a human partner, we observed a consistent pattern in which AI was assigned less credit for equivalent contributions. Participants felt that disclosing AI involvement was important and used a variety of criteria to make attribution judgments, including the quality of contributions, personal values, and technology considerations. Our results motivate and inform new approaches for crediting AI contributions to co-created work.

[AI-10] GraphRank Pro: Advancing Talent Analytics Through Knowledge Graphs and Sentiment-Enhanced Skill Profiling

链接: https://arxiv.org/abs/2502.18315
作者: Sirisha Velampalli,Chandrashekar Muniyappa
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The extraction of information from semi-structured text, such as resumes, has long been a challenge due to the diverse formatting styles and subjective content organization. Conventional solutions rely on specialized logic tailored for specific use cases. However, we propose a revolutionary approach leveraging structured Graphs, Natural Language Processing (NLP), and Deep Learning. By abstracting intricate logic into Graph structures, we transform raw data into a comprehensive Knowledge Graph. This innovative framework enables precise information extraction and sophisticated querying. We systematically construct dictionaries assigning skill weights, paving the way for nuanced talent analysis. Our system not only benefits job recruiters and curriculum designers but also empowers job seekers with targeted query-based filtering and ranking capabilities.

[AI-11] Smart and Efficient IoT-Based Irrigation System Design: Utilizing a Hybrid Agent -Based and System Dynamics Approach

链接: https://arxiv.org/abs/2502.18298
作者: Taha Ahmadi Pargo,Mohsen Akbarpour Shirazi,Dawud Fadai
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Applications (stat.AP)
*备注: 50 pages, 22 figures

点击查看摘要

Abstract:Regarding problems like reduced precipitation and an increase in population, water resource scarcity has become one of the most critical problems in modern-day societies, as a consequence, there is a shortage of available water resources for irrigation in arid and semi-arid countries. On the other hand, it is possible to utilize modern technologies to control irrigation and reduce water loss. One of these technologies is the Internet of Things (IoT). Despite the possibility of using the IoT in irrigation control systems, there are complexities in designing such systems. Considering this issue, it is possible to use agent-oriented software engineering (AOSE) methodologies to design complex cyber-physical systems such as IoT-based systems. In this research, a smart irrigation system is designed based on Prometheus AOSE methodology, to reduce water loss by maintaining soil moisture in a suitable interval. The designed system comprises sensors, a central agent, and irrigation nodes. These agents follow defined rules to maintain soil moisture at a desired level cooperatively. For system simulation, a hybrid agent-based and system dynamics model was designed. In this hybrid model, soil moisture dynamics were modeled based on the system dynamics approach. The proposed model, was implemented in AnyLogic computer simulation software. Utilizing the simulation model, irrigation rules were examined. The system’s functionality in automatic irrigation mode was tested based on a 256-run, fractional factorial design, and the effects of important factors such as soil properties on total irrigated water and total operation time were analyzed. Based on the tests, the system consistently irrigated nearly optimal water amounts in all tests. Moreover, the results were also used to minimize the system’s energy consumption by reducing the system’s operational time.

[AI-12] Mixing Any Cocktail with Limited Ingredients: On the Structure of Payoff Sets in Multi-Objective MDPs and its Impact on Randomised Strategies

链接: https://arxiv.org/abs/2502.18296
作者: James C. A. Main,Mickael Randour
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO); Probability (math.PR)
*备注: 64 pages

点击查看摘要

Abstract:We consider multi-dimensional payoff functions in Markov decision processes, and ask whether a given expected payoff vector can be achieved or not. In general, pure strategies (i.e., not resorting to randomisation) do not suffice for this problem. We study the structure of the set of expected payoff vectors of all strategies given a multi-dimensional payoff function and its consequences regarding randomisation requirements for strategies. In particular, we prove that for any payoff for which the expectation is well-defined under all strategies, it is sufficient to mix (i.e., randomly select a pure strategy at the start of a play and committing to it for the rest of the play) finitely many pure strategies to approximate any expected payoff vector up to any precision. Furthermore, for any payoff for which the expected payoff is finite under all strategies, any expected payoff can be obtained exactly by mixing finitely many strategies. Comments: 64 pages Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO); Probability (math.PR) Cite as: arXiv:2502.18296 [cs.GT] (or arXiv:2502.18296v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2502.18296 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-13] DenoMAE2.0: Improving Denoising Masked Autoencoders by Classifying Local Patches

链接: https://arxiv.org/abs/2502.18202
作者: Atik Faysal,Mohammad Rostami,Taha Boushine,Reihaneh Gh. Roshan,Huaxia Wang,Nikhil Muralidhar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce DenoMAE2.0, an enhanced denoising masked autoencoder that integrates a local patch classification objective alongside traditional reconstruction loss to improve representation learning and robustness. Unlike conventional Masked Autoencoders (MAE), which focus solely on reconstructing missing inputs, DenoMAE2.0 introduces position-aware classification of unmasked patches, enabling the model to capture fine-grained local features while maintaining global coherence. This dual-objective approach is particularly beneficial in semi-supervised learning for wireless communication, where high noise levels and data scarcity pose significant challenges. We conduct extensive experiments on modulation signal classification across a wide range of signal-to-noise ratios (SNRs), from extremely low to moderately high conditions and in a low data regime. Our results demonstrate that DenoMAE2.0 surpasses its predecessor, Deno-MAE, and other baselines in both denoising quality and downstream classification accuracy. DenoMAE2.0 achieves a 1.1% improvement over DenoMAE on our dataset and 11.83%, 16.55% significant improved accuracy gains on the RadioML benchmark, over DenoMAE, for constellation diagram classification of modulation signals.

[AI-14] ChatMotion: A Multimodal Multi-Agent for Human Motion Analysis

链接: https://arxiv.org/abs/2502.18180
作者: Li Lei,Jia Sen,Wang Jianhao,An Zhaochong,Li Jiaang,Hwang Jenq-Neng,Belongie Serge
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Advancements in Multimodal Large Language Models (MLLMs) have improved human motion understanding. However, these models remain constrained by their “instruct-only” nature, lacking interactivity and adaptability for diverse analytical perspectives. To address these challenges, we introduce ChatMotion, a multimodal multi-agent framework for human motion analysis. ChatMotion dynamically interprets user intent, decomposes complex tasks into meta-tasks, and activates specialized function modules for motion comprehension. It integrates multiple specialized modules, such as the MotionCore, to analyze human motion from various perspectives. Extensive experiments demonstrate ChatMotion’s precision, adaptability, and user engagement for human motion understanding.

[AI-15] rash: Incentivized Token Rewards for Automated Sorting and Handling IROS2025

链接: https://arxiv.org/abs/2502.18161
作者: Pablo Ortega,Eduardo Castelló Ferrer
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: Article submitted to IROS 2025

点击查看摘要

Abstract:As robotic systems (RS) become more autonomous, they are becoming increasingly used in small spaces and offices to automate tasks such as cleaning, infrastructure maintenance, or resource management. In this paper, we propose iTrash, an intelligent trashcan that aims to improve recycling rates in small office spaces. For that, we ran a 5 day experiment and found that iTrash can produce an efficiency increase of more than 30% compared to traditional trashcans. The findings derived from this work, point to the fact that using iTrash not only increase recyclying rates, but also provides valuable data such as users behaviour or bin usage patterns, which cannot be taken from a normal trashcan. This information can be used to predict and optimize some tasks in these spaces. Finally, we explored the potential of using blockchain technology to create economic incentives for recycling, following a Save-as-you-Throw (SAYT) model.

[AI-16] SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation

链接: https://arxiv.org/abs/2502.18153
作者: Dahun Shin,Dongyeop Lee,Jinseok Chung,Namhoon Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Approximate second-order optimization methods often exhibit poorer generalization compared to first-order approaches. In this work, we look into this issue through the lens of the loss landscape and find that existing second-order methods tend to converge to sharper minima compared to SGD. In response, we propose Sassha, a novel second-order method designed to enhance generalization by explicitly reducing sharpness of the solution, while stabilizing the computation of approximate Hessians along the optimization trajectory. In fact, this sharpness minimization scheme is crafted also to accommodate lazy Hessian updates, so as to secure efficiency besides flatness. To validate its effectiveness, we conduct a wide range of standard deep learning experiments where Sassha demonstrates its outstanding generalization performance that is comparable to, and mostly better than, other methods. We provide a comprehensive set of analyses including convergence, robustness, stability, efficiency, and cost.

[AI-17] A Real-time Spatio-Temporal Trajectory Planner for Autonomous Vehicles with Semantic Graph Optimization

链接: https://arxiv.org/abs/2502.18151
作者: Shan He,Yalong Ma,Tao Song,Yongzhi Jiang,Xinkai Wu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This work has been accepted for publication in IEEE Robotics and Automation Letters (RA-L). The final published version is available in IEEE Xplore (DOI: https://doi.org/10.1109/LRA.2024.3504239 )

点击查看摘要

Abstract:Planning a safe and feasible trajectory for autonomous vehicles in real-time by fully utilizing perceptual information in complex urban environments is challenging. In this paper, we propose a spatio-temporal trajectory planning method based on graph optimization. It efficiently extracts the multi-modal information of the perception module by constructing a semantic spatio-temporal map through separation processing of static and dynamic obstacles, and then quickly generates feasible trajectories via sparse graph optimization based on a semantic spatio-temporal hypergraph. Extensive experiments have proven that the proposed method can effectively handle complex urban public road scenarios and perform in real time. We will also release our codes to accommodate benchmarking for the research community

[AI-18] Large Language Model Driven Agents for Simulating Echo Chamber Formation

链接: https://arxiv.org/abs/2502.18138
作者: Chenhao Gu,Ling Luo,Zainab Razia Zaidi,Shanika Karunasekera
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of echo chambers on social media platforms has heightened concerns about polarization and the reinforcement of existing beliefs. Traditional approaches for simulating echo chamber formation have often relied on predefined rules and numerical simulations, which, while insightful, may lack the nuance needed to capture complex, real-world interactions. In this paper, we present a novel framework that leverages large language models (LLMs) as generative agents to simulate echo chamber dynamics within social networks. The novelty of our approach is that it incorporates both opinion updates and network rewiring behaviors driven by LLMs, allowing for a context-aware and semantically rich simulation of social interactions. Additionally, we utilize real-world Twitter (now X) data to benchmark the LLM-based simulation against actual social media behaviors, providing insights into the accuracy and realism of the generated opinion trends. Our results demonstrate the efficacy of LLMs in modeling echo chamber formation, capturing both structural and semantic dimensions of opinion clustering. %This work contributes to a deeper understanding of social influence dynamics and offers a new tool for studying polarization in online communities.

[AI-19] EU-Nets: Enhanced Explainable and Parsimonious U-Nets

链接: https://arxiv.org/abs/2502.18122
作者: B. Sun,P. Liò
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this study, we propose MHEX+, a framework adaptable to any U-Net architecture. Built upon MHEX+, we introduce novel U-Net variants, EU-Nets, which enhance explainability and uncertainty estimation, addressing the limitations of traditional U-Net models while improving performance and stability. A key innovation is the Equivalent Convolutional Kernel, which unifies consecutive convolutional layers, boosting interpretability. For uncertainty estimation, we propose the collaboration gradient approach, measuring gradient consistency across decoder layers. Notably, EU-Nets achieve an average accuracy improvement of 1.389% and a variance reduction of 0.83% across all networks and datasets in our experiments, requiring fewer than 0.1M parameters.

[AI-20] he Built-In Robustness of Decentralized Federated Averag ing to Bad Data

链接: https://arxiv.org/abs/2502.18097
作者: Samuele Sabella,Chiara Boldrini,Lorenzo Valerio,Andrea Passarella,Marco Conti
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Funding: SoBigData PPP (101079043), this http URL (PNRR IR0000013), FAIR (PNRR PE00000013), RESTART (PNRR PE00000001)

点击查看摘要

Abstract:Decentralized federated learning (DFL) enables devices to collaboratively train models over complex network topologies without relying on a central controller. In this setting, local data remains private, but its quality and quantity can vary significantly across nodes. The extent to which a fully decentralized system is vulnerable to poor-quality or corrupted data remains unclear, but several factors could contribute to potential risks. Without a central authority, there can be no unified mechanism to detect or correct errors, and each node operates with a localized view of the data distribution, making it difficult for the node to assess whether its perspective aligns with the true distribution. Moreover, models trained on low-quality data can propagate through the network, amplifying errors. To explore the impact of low-quality data on DFL, we simulate two scenarios with degraded data quality – one where the corrupted data is evenly distributed in a subset of nodes and one where it is concentrated on a single node – using a decentralized implementation of FedAvg. Our results reveal that averaging-based decentralized learning is remarkably robust to localized bad data, even when the corrupted data resides in the most influential nodes of the network. Counterintuitively, this robustness is further enhanced when the corrupted data is concentrated on a single node, regardless of its centrality in the communication network topology. This phenomenon is explained by the averaging process, which ensures that no single node – however central – can disproportionately influence the overall learning process.

[AI-21] MRBTP: Efficient Multi-Robot Behavior Tree Planning and Collaboration

链接: https://arxiv.org/abs/2502.18072
作者: Yishuai Cai,Xinglin Chen,Zhongxuan Cai,Yunxin Mao,Minglong Li,Wenjing Yang,Ji Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Multi-robot task planning and collaboration are critical challenges in robotics. While Behavior Trees (BTs) have been established as a popular control architecture and are plannable for a single robot, the development of effective multi-robot BT planning algorithms remains challenging due to the complexity of coordinating diverse action spaces. We propose the Multi-Robot Behavior Tree Planning (MRBTP) algorithm, with theoretical guarantees of both soundness and completeness. MRBTP features cross-tree expansion to coordinate heterogeneous actions across different BTs to achieve the team’s goal. For homogeneous actions, we retain backup structures among BTs to ensure robustness and prevent redundant execution through intention sharing. While MRBTP is capable of generating BTs for both homogeneous and heterogeneous robot teams, its efficiency can be further improved. We then propose an optional plugin for MRBTP when Large Language Models (LLMs) are available to reason goal-related actions for each robot. These relevant actions can be pre-planned to form long-horizon subtrees, significantly enhancing the planning speed and collaboration efficiency of MRBTP. We evaluate our algorithm in warehouse management and everyday service scenarios. Results demonstrate MRBTP’s robustness and execution efficiency under varying settings, as well as the ability of the pre-trained LLM to generate effective task-specific subtrees for MRBTP.

[AI-22] HEROS-GAN: Honed-Energy Regularized and Optimal Supervised GAN for Enhancing Accuracy and Range of Low-Cost Accelerometers AAAI

链接: https://arxiv.org/abs/2502.18064
作者: Yifeng Wang,Yi Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Signal Processing (eess.SP); Probability (math.PR)
*备注: AAAI Oral; AI for Sensors; Generative Deep Learning

点击查看摘要

Abstract:Low-cost accelerometers play a crucial role in modern society due to their advantages of small size, ease of integration, wearability, and mass production, making them widely applicable in automotive systems, aerospace, and wearable technology. However, this widely used sensor suffers from severe accuracy and range limitations. To this end, we propose a honed-energy regularized and optimal supervised GAN (HEROS-GAN), which transforms low-cost sensor signals into high-cost equivalents, thereby overcoming the precision and range limitations of low-cost accelerometers. Due to the lack of frame-level paired low-cost and high-cost signals for training, we propose an Optimal Transport Supervision (OTS), which leverages optimal transport theory to explore potential consistency between unpaired data, thereby maximizing supervisory information. Moreover, we propose a Modulated Laplace Energy (MLE), which injects appropriate energy into the generator to encourage it to break range limitations, enhance local changes, and enrich signal details. Given the absence of a dedicated dataset, we specifically establish a Low-cost Accelerometer Signal Enhancement Dataset (LASED) containing tens of thousands of samples, which is the first dataset serving to improve the accuracy and range of accelerometers and is released in Github. Experimental results demonstrate that a GAN combined with either OTS or MLE alone can surpass the previous signal enhancement SOTA methods by an order of magnitude. Integrating both OTS and MLE, the HEROS-GAN achieves remarkable results, which doubles the accelerometer range while reducing signal noise by two orders of magnitude, establishing a benchmark in the accelerometer signal processing.

[AI-23] AutoCas: Autoregressive Cascade Predictor in Social Networks via Large Language Models

链接: https://arxiv.org/abs/2502.18040
作者: Yuhao Zheng,Chenghua Gong,Rui Sun,Juyuan Zhang,Liming Pan,Linyuan Lv
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Popularity prediction in information cascades plays a crucial role in social computing, with broad applications in viral marketing, misinformation control, and content recommendation. However, information propagation mechanisms, user behavior, and temporal activity patterns exhibit significant diversity, necessitating a foundational model capable of adapting to such variations. At the same time, the amount of available cascade data remains relatively limited compared to the vast datasets used for training large language models (LLMs). Recent studies have demonstrated the feasibility of leveraging LLMs for time-series prediction by exploiting commonalities across different time-series domains. Building on this insight, we introduce the Autoregressive Information Cascade Predictor (AutoCas), an LLM-enhanced model designed specifically for cascade popularity prediction. Unlike natural language sequences, cascade data is characterized by complex local topologies, diffusion contexts, and evolving dynamics, requiring specialized adaptations for effective LLM integration. To address these challenges, we first tokenize cascade data to align it with sequence modeling principles. Next, we reformulate cascade diffusion as an autoregressive modeling task to fully harness the architectural strengths of LLMs. Beyond conventional approaches, we further introduce prompt learning to enhance the synergy between LLMs and cascade prediction. Extensive experiments demonstrate that AutoCas significantly outperforms baseline models in cascade popularity prediction while exhibiting scaling behavior inherited from LLMs. Code is available at this repository: this https URL

[AI-24] ExPath: Towards Explaining Targeted Pathways for Biological Knowledge Bases

链接: https://arxiv.org/abs/2502.18026
作者: Rikuto Kotoge,Ziwei Yang,Zheng Chen,Yushun Dong,Yasuko Matsubara,Jimeng Sun,Yasushi Sakurai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Biological knowledge bases provide systemically functional pathways of cells or organisms in terms of molecular interaction. However, recognizing more targeted pathways, particularly when incorporating wet-lab experimental data, remains challenging and typically requires downstream biological analyses and expertise. In this paper, we frame this challenge as a solvable graph learning and explaining task and propose a novel pathway inference framework, ExPath, that explicitly integrates experimental data, specifically amino acid sequences (AA-seqs), to classify various graphs (bio-networks) in biological databases. The links (representing pathways) that contribute more to classification can be considered as targeted pathways. Technically, ExPath comprises three components: (1) a large protein language model (pLM) that encodes and embeds AA-seqs into graph, overcoming traditional obstacles in processing AA-seq data, such as BLAST; (2) PathMamba, a hybrid architecture combining graph neural networks (GNNs) with state-space sequence modeling (Mamba) to capture both local interactions and global pathway-level dependencies; and (3) PathExplainer, a subgraph learning module that identifies functionally critical nodes and edges through trainable pathway masks. We also propose ML-oriented biological evaluations and a new metric. The experiments involving 301 bio-networks evaluations demonstrate that pathways inferred by ExPath maintain biological meaningfulness. We will publicly release curated 301 bio-network data soon.

[AI-25] NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms

链接: https://arxiv.org/abs/2502.18008
作者: Yashan Wang,Shangda Wu,Jianhuai Hu,Xingjian Du,Yueqi Peng,Yongxin Huang,Shuai Fan,Xiaobing Li,Feng Yu,Maosong Sun
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We introduce NotaGen, a symbolic music generation model aiming to explore the potential of producing high-quality classical sheet music. Inspired by the success of Large Language Models (LLMs), NotaGen adopts pre-training, fine-tuning, and reinforcement learning paradigms (henceforth referred to as the LLM training paradigms). It is pre-trained on 1.6M pieces of music, and then fine-tuned on approximately 9K high-quality classical compositions conditioned on “period-composer-instrumentation” prompts. For reinforcement learning, we propose the CLaMP-DPO method, which further enhances generation quality and controllability without requiring human annotations or predefined rewards. Our experiments demonstrate the efficacy of CLaMP-DPO in symbolic music generation models with different architectures and encoding schemes. Furthermore, subjective A/B tests show that NotaGen outperforms baseline models against human compositions, greatly advancing musical aesthetics in symbolic music this http URL project homepage is this https URL.

[AI-26] Radon-Nikodým Derivative: Re-imagining Anomaly Detection from a Measure Theoretic Perspective

链接: https://arxiv.org/abs/2502.18002
作者: Shlok Mehendale,Aditya Challa,Rahul Yedida,Sravan Danda,Santonu Sarkar,Snehanshu Saha
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Which principle underpins the design of an effective anomaly detection loss function? The answer lies in the concept of \rnthm theorem, a fundamental concept in measure theory. The key insight is – Multiplying the vanilla loss function with the \rnthm derivative improves the performance across the board. We refer to this as RN-Loss. This is established using PAC learnability of anomaly detection. We further show that the \rnthm derivative offers important insights into unsupervised clustering based anomaly detections as well. We evaluate our algorithm on 96 datasets, including univariate and multivariate data from diverse domains, including healthcare, cybersecurity, and finance. We show that RN-Derivative algorithms outperform state-of-the-art methods on 68% of Multivariate datasets (based on F-1 scores) and also achieves peak F1-scores on 72% of time series (Univariate) datasets.

[AI-27] GNN-XAR: A Graph Neural Network for Explainable Activity Recognition in Smart Homes

链接: https://arxiv.org/abs/2502.17999
作者: Michele Fiori,Davide Mor,Gabriele Civitarese,Claudio Bettini
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This is a preprint. Paper accepted for publication at the 21st EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (Mobiquitous)

点击查看摘要

Abstract:Sensor-based Human Activity Recognition (HAR) in smart home environments is crucial for several applications, especially in the healthcare domain. The majority of the existing approaches leverage deep learning models. While these approaches are effective, the rationale behind their outputs is opaque. Recently, eXplainable Artificial Intelligence (XAI) approaches emerged to provide intuitive explanations to the output of HAR models. To the best of our knowledge, these approaches leverage classic deep models like CNNs or RNNs. Recently, Graph Neural Networks (GNNs) proved to be effective for sensor-based HAR. However, existing approaches are not designed with explainability in mind. In this work, we propose the first explainable Graph Neural Network explicitly designed for smart home HAR. Our results on two public datasets show that this approach provides better explanations than state-of-the-art methods while also slightly improving the recognition rate.

[AI-28] Broadening Discovery through Structural Models: Multimodal Combination of Local and Structural Properties for Predicting Chemical Features

链接: https://arxiv.org/abs/2502.17986
作者: Nikolai Rekut,Alexey Orlov,Klea Ziu,Elizaveta Starykh,Martin Takac,Aleksandr Beznosikov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, machine learning has profoundly reshaped the field of chemistry, facilitating significant advancements across various applications, including the prediction of molecular properties and the generation of molecular structures. Language models and graph-based models are extensively utilized within this domain, consistently achieving state-of-the-art results across an array of tasks. However, the prevailing practice of representing chemical compounds in the SMILES format – used by most datasets and many language models – presents notable limitations as a training data format. In contrast, chemical fingerprints offer a more physically informed representation of compounds, thereby enhancing their suitability for model training. This study aims to develop a language model that is specifically trained on fingerprints. Furthermore, we introduce a bimodal architecture that integrates this language model with a graph model. Our proposed methodology synthesizes these approaches, utilizing RoBERTa as the language model and employing Graph Isomorphism Networks (GIN), Graph Convolutional Networks (GCN) and Graphormer as graph models. This integration results in a significant improvement in predictive performance compared to conventional strategies for tasks such as Quantitative Structure-Activity Relationship (QSAR) and the prediction of nuclear magnetic resonance (NMR) spectra, among others.

[AI-29] Integrating Boosted learning with Differential Evolution (DE) Optimizer: A Prediction of Groundwater Quality Risk Assessment in Odisha

链接: https://arxiv.org/abs/2502.17929
作者: Sonalika Subudhi,Alok Kumar Pati,Sephali Bose,Subhasmita Sahoo,Avipsa Pattanaik,Biswa Mohan Acharya
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 Figures (8 figs in paper and one additional graphical abstract), 9 Tables

点击查看摘要

Abstract:Groundwater is eventually undermined by human exercises, such as fast industrialization, urbanization, over-extraction, and contamination from agrarian and urban sources. From among the different contaminants, the presence of heavy metals like cadmium (Cd), chromium (Cr), arsenic (As), and lead (Pb) proves to have serious dangers when present in huge concentrations in groundwater. Long-term usage of these poisonous components may lead to neurological disorders, kidney failure and different sorts of cancer. To address these issues, this study developed a machine learning-based predictive model to evaluate the Groundwater Quality Index (GWQI) and identify the main contaminants which are affecting the water quality. It has been achieved with the help of a hybrid machine learning model i.e. LCBoost Fusion . The model has undergone several processes like data preprocessing, hyperparameter tuning using Differential Evolution (DE) optimization, and evaluation through cross-validation. The LCBoost Fusion model outperforms individual models (CatBoost and LightGBM), by achieving low RMSE (0.6829), MSE (0.5102), MAE (0.3147) and a high R ^2 score of 0.9809. Feature importance analysis highlights Potassium (K), Fluoride (F) and Total Hardness (TH) as the most influential indicators of groundwater contamination. This research successfully demonstrates the application of machine learning in assessing groundwater quality risks in Odisha. The proposed LCBoost Fusion model offers a reliable and efficient approach for real-time groundwater monitoring and risk mitigation. These findings will help the environmental organizations and the policy makers to map out targeted places for sustainable groundwater management. Future work will focus on using remote sensing data and developing an interactive decision-making system for groundwater quality assessment.

[AI-30] Structure-prior Informed Diffusion Model for Graph Source Localization with Limited Data

链接: https://arxiv.org/abs/2502.17928
作者: Hongyi Chen,Jingtao Ding,Xiaojun Liang,Yong Li,Xiao-Ping Zhang
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The source localization problem in graph information propagation is crucial for managing various network disruptions, from misinformation spread to infrastructure failures. While recent deep generative approaches have shown promise in this domain, their effectiveness is limited by the scarcity of real-world propagation data. This paper introduces SIDSL (\textbfStructure-prior \textbfInformed \textbfDiffusion model for \textbfSource \textbfLocalization), a novel framework that addresses three key challenges in limited-data scenarios: unknown propagation patterns, complex topology-propagation relationships, and class imbalance between source and non-source nodes. SIDSL incorporates topology-aware priors through graph label propagation and employs a propagation-enhanced conditional denoiser with a GNN-parameterized label propagation module (GNN-LP). Additionally, we propose a structure-prior biased denoising scheme that initializes from structure-based source estimations rather than random noise, effectively countering class imbalance issues. Experimental results across four real-world datasets demonstrate SIDSL’s superior performance, achieving 7.5-13.3% improvements in F1 scores compared to state-of-the-art methods. Notably, when pretrained with simulation data of synthetic patterns, SIDSL maintains robust performance with only 10% of training data, surpassing baselines by more than 18.8%. These results highlight SIDSL’s effectiveness in real-world applications where labeled data is scarce.

[AI-31] LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction

链接: https://arxiv.org/abs/2502.17925
作者: Suozhi Huang,Peiyang Song,Robert Joseph George,Anima Anandkumar
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mathematical reasoning remains a significant challenge for Large Language Models (LLMs) due to hallucinations. When combined with formal proof assistants like Lean, these hallucinations can be eliminated through rigorous verification, making theorem proving reliable. However, even with formal verification, LLMs still struggle with long proofs and complex mathematical formalizations. While Lean with LLMs offers valuable assistance with retrieving lemmas, generating tactics, or even complete proofs, it lacks a crucial capability: providing a sense of proof progress. This limitation particularly impacts the overall development efficiency in large formalization projects. We introduce LeanProgress, a method that predicts the progress in the proof. Training and evaluating our models made on a large corpus of Lean proofs from Lean Workbook Plus and Mathlib4 and how many steps remain to complete it, we employ data preprocessing and balancing techniques to handle the skewed distribution of proof lengths. Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1% in predicting the amount of progress and, hence, the remaining number of steps. When integrated into a best-first search framework using Reprover, our method shows a 3.8% improvement on Mathlib4 compared to baseline performances of 41.2%, particularly for longer proofs. These results demonstrate how proof progress prediction can enhance both automated and interactive theorem proving, enabling users to make more informed decisions about proof strategies.

[AI-32] Unmasking Gender Bias in Recommendation Systems and Enhancing Category-Aware Fairness

链接: https://arxiv.org/abs/2502.17921
作者: Tahsin Alamgir Kheya,Mohamed Reda Bouadjenek,Sunil Aryal
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommendation systems are now an integral part of our daily lives. We rely on them for tasks such as discovering new movies, finding friends on social media, and connecting job seekers with relevant opportunities. Given their vital role, we must ensure these recommendations are free from societal stereotypes. Therefore, evaluating and addressing such biases in recommendation systems is crucial. Previous work evaluating the fairness of recommended items fails to capture certain nuances as they mainly focus on comparing performance metrics for different sensitive groups. In this paper, we introduce a set of comprehensive metrics for quantifying gender bias in recommendations. Specifically, we show the importance of evaluating fairness on a more granular level, which can be achieved using our metrics to capture gender bias using categories of recommended items like genres for movies. Furthermore, we show that employing a category-aware fairness metric as a regularization term along with the main recommendation loss during training can help effectively minimize bias in the models’ output. We experiment on three real-world datasets, using five baseline models alongside two popular fairness-aware models, to show the effectiveness of our metrics in evaluating gender bias. Our metrics help provide an enhanced insight into bias in recommended items compared to previous metrics. Additionally, our results demonstrate how incorporating our regularization term significantly improves the fairness in recommendations for different categories without substantial degradation in overall recommendation performance.

[AI-33] Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic Graphs ICLR2025

链接: https://arxiv.org/abs/2502.17912
作者: Yuhan Chen,Yihong Luo,Yifan Song,Pengwen Dai,Jing Tang,Xiaochun Cao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The first two authors contributed equally to this work; ICLR 2025

点击查看摘要

Abstract:Despite extensive research efforts focused on OOD detection on images, OOD detection on nodes in graph learning remains underexplored. The dependence among graph nodes hinders the trivial adaptation of existing approaches on images that assume inputs to be i.i.d. sampled, since many unique features and challenges specific to graphs are not considered, such as the heterophily issue. Recently, GNNSafe, which considers node dependence, adapted energy-based detection to the graph domain with state-of-the-art performance, however, it has two serious issues: 1) it derives node energy from classification logits without specifically tailored training for modeling data distribution, making it less effective at recognizing OOD data; 2) it highly relies on energy propagation, which is based on homophily assumption and will cause significant performance degradation on heterophilic graphs, where the node tends to have dissimilar distribution with its neighbors. To address the above issues, we suggest training EBMs by MLE to enhance data distribution modeling and remove energy propagation to overcome the heterophily issues. However, training EBMs via MLE requires performing MCMC sampling on both node feature and node neighbors, which is challenging due to the node interdependence and discrete graph topology. To tackle the sampling challenge, we introduce DeGEM, which decomposes the learning process into two parts: a graph encoder that leverages topology information for node representations and an energy head that operates in latent space. Extensive experiments validate that DeGEM, without OOD exposure during training, surpasses previous state-of-the-art methods, achieving an average AUROC improvement of 6.71% on homophilic graphs and 20.29% on heterophilic graphs, and even outperform methods trained with OOD exposure. Our code is available at: this https URL.

[AI-34] Enhancing Speech Quality through the Integration of BGRU and Transformer Architectures

链接: https://arxiv.org/abs/2502.17911
作者: Souliman Alghnam,Mohammad Alhussien,Khaled Shaheen
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech enhancement plays an essential role in improving the quality of speech signals in noisy environments. This paper investigates the efficacy of integrating Bidirectional Gated Recurrent Units (BGRU) and Transformer models for speech enhancement tasks. Through a comprehensive experimental evaluation, our study demonstrates the superiority of this hybrid architecture over traditional methods and standalone models. The combined BGRU-Transformer framework excels in capturing temporal dependencies and learning complex signal patterns, leading to enhanced noise reduction and improved speech quality. Results show significant performance gains compared to existing approaches, highlighting the potential of this integrated model in real-world applications. The seamless integration of BGRU and Transformer architectures not only enhances system robustness but also opens the road for advanced speech processing techniques. This research contributes to the ongoing efforts in speech enhancement technology and sets a solid foundation for future investigations into optimizing model architectures, exploring many application scenarios, and advancing the field of speech processing in noisy environments.

[AI-35] FactFlow: Automatic Fact Sheet Generation and Customization from Tabular Dataset via AI Chain Design Implementation

链接: https://arxiv.org/abs/2502.17909
作者: Minh Duc Vu,Jieshan Chen,Zhenchang Xing,Qinghua Lu,Xiwei Xu,Qian Fu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:With the proliferation of data across various domains, there is a critical demand for tools that enable non-experts to derive meaningful insights without deep data analysis skills. To address this need, existing automatic fact sheet generation tools offer heuristic-based solutions to extract facts and generate stories. However, they inadequately grasp the semantics of data and struggle to generate narratives that fully capture the semantics of the dataset or align the fact sheet with specific user needs. Addressing these shortcomings, this paper introduces \tool, a novel tool designed for the automatic generation and customisation of fact sheets. \tool applies the concept of collaborative AI workers to transform raw tabular dataset into comprehensive, visually compelling fact sheets. We define effective taxonomy to profile AI worker for specialised tasks. Furthermore, \tool empowers users to refine these fact sheets through intuitive natural language commands, ensuring the final outputs align closely with individual preferences and requirements. Our user evaluation with 18 participants confirms that \tool not only surpasses state-of-the-art baselines in automated fact sheet production but also provides a positive user experience during customization tasks.

[AI-36] owards Sustainable Web Agents : A Plea for Transparency and Dedicated Metrics for Energy Consumption

链接: https://arxiv.org/abs/2502.17903
作者: Lars Krupp,Daniel Geißler,Paul Lukowicz,Jakob Karolus
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Improvements in the area of large language models have shifted towards the construction of models capable of using external tools and interpreting their outputs. These so-called web agents have the ability to interact autonomously with the internet. This allows them to become powerful daily assistants handling time-consuming, repetitive tasks while supporting users in their daily activities. While web agent research is thriving, the sustainability aspect of this research direction remains largely unexplored. We provide an initial exploration of the energy and CO2 cost associated with web agents. Our results show how different philosophies in web agent creation can severely impact the associated expended energy. We highlight lacking transparency regarding the disclosure of model parameters and processes used for some web agents as a limiting factor when estimating energy consumption. As such, our work advocates a change in thinking when evaluating web agents, warranting dedicated metrics for energy consumption and sustainability.

[AI-37] Knowledge-enhanced Multimodal ECG Representation Learning with Arbitrary-Lead Inputs

链接: https://arxiv.org/abs/2502.17900
作者: Che Liu,Cheng Ouyang,Zhongwei Wan,Haozhe Wang,Wenjia Bai,Rossella Arcucci
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in multimodal ECG representation learning center on aligning ECG signals with paired free-text reports. However, suboptimal alignment persists due to the complexity of medical language and the reliance on a full 12-lead setup, which is often unavailable in under-resourced settings. To tackle these issues, we propose K-MERL, a knowledge-enhanced multimodal ECG representation learning framework. K-MERL leverages large language models to extract structured knowledge from free-text reports and employs a lead-aware ECG encoder with dynamic lead masking to accommodate arbitrary lead inputs. Evaluations on six external ECG datasets show that K-MERL achieves state-of-the-art performance in zero-shot classification and linear probing tasks, while delivering an average 16% AUC improvement over existing methods in partial-lead zero-shot classification.

[AI-38] VeriPlan: Integrating Formal Verification and LLM s into End-User Planning

链接: https://arxiv.org/abs/2502.17898
作者: Christine Lee,David Porfirio,Xinyu Jessica Wang,Kevin Zhao,Bilge Mutlu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: In CHI Conference on Human Factors in Computing Systems (CHI '25), April 26-May 1, 2025, Yokohama, Japan. ACM, New York, NY, USA, 19 pages

点击查看摘要

Abstract:Automated planning is traditionally the domain of experts, utilized in fields like manufacturing and healthcare with the aid of expert planning tools. Recent advancements in LLMs have made planning more accessible to everyday users due to their potential to assist users with complex planning tasks. However, LLMs face several application challenges within end-user planning, including consistency, accuracy, and user trust issues. This paper introduces VeriPlan, a system that applies formal verification techniques, specifically model checking, to enhance the reliability and flexibility of LLMs for end-user planning. In addition to the LLM planner, VeriPlan includes three additional core features – a rule translator, flexibility sliders, and a model checker – that engage users in the verification process. Through a user study (n=12), we evaluate VeriPlan, demonstrating improvements in the perceived quality, usability, and user satisfaction of LLMs. Our work shows the effective integration of formal verification and user-control features with LLMs for end-user planning tasks.

[AI-39] Sample-efficient diffusion-based control of complex nonlinear systems

链接: https://arxiv.org/abs/2502.17893
作者: Hongyi Chen,Jingtao Ding,Jianhai Shu,Xinchun Yu,Xiaojun Liang,Yong Li,Xiao-Ping Zhang
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Complex nonlinear system control faces challenges in achieving sample-efficient, reliable performance. While diffusion-based methods have demonstrated advantages over classical and reinforcement learning approaches in long-term control performance, they are limited by sample efficiency. This paper presents SEDC (Sample-Efficient Diffusion-based Control), a novel diffusion-based control framework addressing three core challenges: high-dimensional state-action spaces, nonlinear system dynamics, and the gap between non-optimal training data and near-optimal control solutions. Through three innovations - Decoupled State Diffusion, Dual-Mode Decomposition, and Guided Self-finetuning - SEDC achieves 39.5%-49.4% better control accuracy than baselines while using only 10% of the training samples, as validated across three complex nonlinear dynamic systems. Our approach represents a significant advancement in sample-efficient control of complex nonlinear systems. The implementation of the code can be found at this https URL.

[AI-40] Arrhythmia Classification from 12-Lead ECG Signals Using Convolutional and Transformer-Based Deep Learning Models

链接: https://arxiv.org/abs/2502.17887
作者: Andrei Apostol,Maria Nutu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 34 pages, 17 figures

点击查看摘要

Abstract:In Romania, cardiovascular problems are the leading cause of death, accounting for nearly one-third of annual fatalities. The severity of this situation calls for innovative diagnosis method for cardiovascular diseases. This article aims to explore efficient, light-weight and rapid methods for arrhythmia diagnosis, in resource-constrained healthcare settings. Due to the lack of Romanian public medical data, we trained our systems using international public datasets, having in mind that the ECG signals are the same regardless the patients’ nationality. Within this purpose, we combined multiple datasets, usually used in the field of arrhythmias classification: PTB-XL electrocardiography dataset , PTB Diagnostic ECG Database, China 12-Lead ECG Challenge Database, Georgia 12-Lead ECG Challenge Database, and St. Petersburg INCART 12-lead Arrhythmia Database. For the input data, we employed ECG signal processing methods, specifically a variant of the Pan-Tompkins algorithm, useful in arrhythmia classification because it provides a robust and efficient method for detecting QRS complexes in ECG signals. Additionally, we used machine learning techniques, widely used for the task of classification, including convolutional neural networks (1D CNNs, 2D CNNs, ResNet) and Vision Transformers (ViTs). The systems were evaluated in terms of accuracy and F1 score. We annalysed our dataset from two perspectives. First, we fed the systems with the ECG signals and the GRU-based 1D CNN model achieved the highest accuracy of 93.4% among all the tested architectures. Secondly, we transformed ECG signals into images and the CNN2D model achieved an accuracy of 92.16%.

[AI-41] Contrastive Learning with Nasty Noise

链接: https://arxiv.org/abs/2502.17872
作者: Ziruo Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Contrastive learning has emerged as a powerful paradigm for self-supervised representation learning. This work analyzes the theoretical limits of contrastive learning under nasty noise, where an adversary modifies or replaces training samples. Using PAC learning and VC-dimension analysis, lower and upper bounds on sample complexity in adversarial settings are established. Additionally, data-dependent sample complexity bounds based on the l2-distance function are derived.

[AI-42] A Combinatorial Identities Benchmark for Theorem Proving via Automated Theorem Generation

链接: https://arxiv.org/abs/2502.17840
作者: Beibei Xiong,Hangyu Lv,Haojia Shan,Jianlin Wang,Zhengfeng Yang,Lihong Zhi
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced formal theorem proving, yet the scarcity of high-quality training data constrains their capabilities in complex mathematical domains. Combinatorics, a cornerstone of mathematics, provides essential tools for analyzing discrete structures and solving optimization problems. However, its inherent complexity makes it particularly challenging for automated theorem proving (ATP) for combinatorial identities. To address this, we manually construct LeanComb, combinatorial identities benchmark in Lean, which is, to our knowledge, the first formalized theorem proving benchmark built for combinatorial identities. We develop an Automated Theorem Generator for Combinatorial Identities, ATG4CI, which combines candidate tactics suggested by a self-improving large language model with a Reinforcement Learning Tree Search approach for tactic prediction. By utilizing ATG4CI, we generate a LeanComb-Enhanced dataset comprising 260K combinatorial identities theorems, each with a complete formal proof in Lean, and experimental evaluations demonstrate that models trained on this dataset can generate more effective tactics, thereby improving success rates in automated theorem proving for combinatorial identities.

[AI-43] CAML: Collaborative Auxiliary Modality Learning for Multi-Agent Systems

链接: https://arxiv.org/abs/2502.17821
作者: Rui Liu,Yu Shen,Peng Gao,Pratap Tokekar,Ming Lin
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modality learning has become a crucial technique for improving the performance of machine learning applications across domains such as autonomous driving, robotics, and perception systems. While existing frameworks such as Auxiliary Modality Learning (AML) effectively utilize multiple data sources during training and enable inference with reduced modalities, they primarily operate in a single-agent context. This limitation is particularly critical in dynamic environments, such as connected autonomous vehicles (CAV), where incomplete data coverage can lead to decision-making blind spots. To address these challenges, we propose Collaborative Auxiliary Modality Learning ( \textbfCAML ), a novel multi-agent multi-modality framework that enables agents to collaborate and share multimodal data during training while allowing inference with reduced modalities per agent during testing. We systematically analyze the effectiveness of \textbfCAML from the perspective of uncertainty reduction and data coverage, providing theoretical insights into its advantages over AML. Experimental results in collaborative decision-making for CAV in accident-prone scenarios demonstrate that \ours~achieves up to a \bf 58.13% improvement in accident detection. Additionally, we validate \textbfCAML on real-world aerial-ground robot data for collaborative semantic segmentation, achieving up to a \bf 10.61% improvement in mIoU.

[AI-44] DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities

链接: https://arxiv.org/abs/2502.17807
作者: Tianyi Zhuang,Chuqiao Kuang,Xiaoguang Li,Yihua Teng,Jihao Wu,Yasheng Wang,Lifeng Shang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation.

[AI-45] Research on Enhancing Cloud Computing Network Security using Artificial Intelligence Algorithms

链接: https://arxiv.org/abs/2502.17801
作者: Yuqing Wang,Xiao Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cloud computing environments are increasingly vulnerable to security threats such as distributed denial-of-service (DDoS) attacks and SQL injection. Traditional security mechanisms, based on rule matching and feature recognition, struggle to adapt to evolving attack strategies. This paper proposes an adaptive security protection framework leveraging deep learning to construct a multi-layered defense architecture. The proposed system is evaluated in a real-world business environment, achieving a detection accuracy of 97.3%, an average response time of 18 ms, and an availability rate of 99.999%. Experimental results demonstrate that the proposed method significantly enhances detection accuracy, response efficiency, and resource utilization, offering a novel and effective approach to cloud computing security.

[AI-46] DeepSeek vs. ChatGPT : A Comparative Study for Scientific Computing and Scientific Machine Learning Tasks

链接: https://arxiv.org/abs/2502.17764
作者: Qile Jiang,Zhiwei Gao,George Em Karniadakis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful tools for tackling a wide range of problems, including those in scientific computing, particularly in solving partial differential equations (PDEs). However, different models exhibit distinct strengths and preferences, resulting in varying levels of performance. In this paper, we compare the capabilities of the most advanced LLMs–ChatGPT and DeepSeek–along with their reasoning-optimized versions in addressing computational challenges. Specifically, we evaluate their proficiency in solving traditional numerical problems in scientific computing as well as leveraging scientific machine learning techniques for PDE-based problems. We designed all our experiments so that a non-trivial decision is required, e.g. defining the proper space of input functions for neural operator learning. Our findings reveal that the latest model, ChatGPT o3-mini-high, usually delivers the most accurate results while also responding significantly faster than its reasoning counterpart, DeepSeek R1. This enhanced speed and accuracy make ChatGPT o3-mini-high a more practical and efficient choice for diverse computational tasks at this juncture.

[AI-47] Design and implementation of a distributed security threat detection system integrating federated learning and multimodal LLM

链接: https://arxiv.org/abs/2502.17763
作者: Yuqing Wang,Xiao Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Traditional security protection methods struggle to address sophisticated attack vectors in large-scale distributed systems, particularly when balancing detection accuracy with data privacy concerns. This paper presents a novel distributed security threat detection system that integrates federated learning with multimodal large language models (LLMs). Our system leverages federated learning to ensure data privacy while employing multimodal LLMs to process heterogeneous data sources including network traffic, system logs, images, and sensor data. Experimental evaluation on a 10TB distributed dataset demonstrates that our approach achieves 96.4% detection accuracy, outperforming traditional baseline models by 4.1 percentage points. The system reduces both false positive and false negative rates by 1.8 and 2.4 percentage points respectively. Performance analysis shows that our system maintains efficient processing capabilities in distributed environments, requiring 180 seconds for model training and 3.8 seconds for threat detection across the distributed network. These results demonstrate significant improvements in detection accuracy and computational efficiency while preserving data privacy, suggesting strong potential for real-world deployment in large-scale security systems.

[AI-48] Graded Neural Networks

链接: https://arxiv.org/abs/2502.17751
作者: Tony Shaska
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel framework for graded neural networks (GNNs) built over graded vector spaces \V_\w^n , extending classical neural architectures by incorporating algebraic grading. Leveraging a coordinate-wise grading structure with scalar action \lambda \star \x = (\lambda^q_i x_i) , defined by a tuple \w = (q_0, \ldots, q_n-1) , we introduce graded neurons, layers, activation functions, and loss functions that adapt to feature significance. Theoretical properties of graded spaces are established, followed by a comprehensive GNN design, addressing computational challenges like numerical stability and gradient scaling. Potential applications span machine learning and photonic systems, exemplified by high-speed laser-based implementations. This work offers a foundational step toward graded computation, unifying mathematical rigor with practical potential, with avenues for future empirical and hardware exploration.

[AI-49] Detection of LLM -Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features

链接: https://arxiv.org/abs/2502.17749
作者: Shinwoo Park,Hyundong Jin,Jeong-won Cha,Yo-Sub Han
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent progress in large language models (LLMs) for code generation has raised serious concerns about intellectual property protection. Malicious users can exploit LLMs to produce paraphrased versions of proprietary code that closely resemble the original. While the potential for LLM-assisted code paraphrasing continues to grow, research on detecting it remains limited, underscoring an urgent need for detection system. We respond to this need by proposing two tasks. The first task is to detect whether code generated by an LLM is a paraphrased version of original human-written code. The second task is to identify which LLM is used to paraphrase the original code. For these tasks, we construct a dataset LPcode consisting of pairs of human-written code and LLM-paraphrased code using various LLMs. We statistically confirm significant differences in the coding styles of human-written and LLM-paraphrased code, particularly in terms of naming consistency, code structure, and readability. Based on these findings, we develop LPcodedec, a detection method that identifies paraphrase relationships between human-written and LLM-generated code, and discover which LLM is used for the paraphrasing. LPcodedec outperforms the best baselines in two tasks, improving F1 scores by 2.64% and 15.17% while achieving speedups of 1,343x and 213x, respectively. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2502.17749 [cs.AI] (or arXiv:2502.17749v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.17749 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-50] he GigaMIDI Dataset with Features for Expressive Music Performance Detection

链接: https://arxiv.org/abs/2502.17726
作者: Keon Ju Maverick Lee,Jeff Ens,Sara Adkins,Pedro Sarmento,Mathieu Barthet,Philippe Pasquier
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: Published at Transactions of the International Society for Music Information Retrieval (TISMIR), 8(1), 1-19

点击查看摘要

Abstract:The Musical Instrument Digital Interface (MIDI), introduced in 1983, revolutionized music production by allowing computers and instruments to communicate efficiently. MIDI files encode musical instructions compactly, facilitating convenient music sharing. They benefit Music Information Retrieval (MIR), aiding in research on music understanding, computational musicology, and generative music. The GigaMIDI dataset contains over 1.4 million unique MIDI files, encompassing 1.8 billion MIDI note events and over 5.3 million MIDI tracks. GigaMIDI is currently the largest collection of symbolic music in MIDI format available for research purposes under fair dealing. Distinguishing between non-expressive and expressive MIDI tracks is challenging, as MIDI files do not inherently make this distinction. To address this issue, we introduce a set of innovative heuristics for detecting expressive music performance. These include the Distinctive Note Velocity Ratio (DNVR) heuristic, which analyzes MIDI note velocity; the Distinctive Note Onset Deviation Ratio (DNODR) heuristic, which examines deviations in note onset times; and the Note Onset Median Metric Level (NOMML) heuristic, which evaluates onset positions relative to metric levels. Our evaluation demonstrates these heuristics effectively differentiate between non-expressive and expressive MIDI tracks. Furthermore, after evaluation, we create the most substantial expressive MIDI dataset, employing our heuristic, NOMML. This curated iteration of GigaMIDI encompasses expressively-performed instrument tracks detected by NOMML, containing all General MIDI instruments, constituting 31% of the GigaMIDI dataset, totalling 1,655,649 tracks.

[AI-51] Aligning Compound AI Systems via System-level DPO AAAI25

链接: https://arxiv.org/abs/2502.17721
作者: Xiangwen Wang,Yibo Jacky Zhang,Zhoujie Ding,Katherine Tsai,Sanmi Koyejo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted to workshops MARW and WMAC (Oral) at AAAI25

点击查看摘要

Abstract:Compound AI systems, comprising multiple interacting components such as LLM agents and external tools, demonstrate state-of-the-art results across diverse tasks. It is hence crucial to align components within the system to produce consistent results that match human expectations. However, conventional alignment methods, such as Direct Preference Optimization (DPO), are not directly applicable to compound AI systems. These challenges include the non-differentiable interactions between components, making end-to-end gradient optimization infeasible. Additionally, system-level preferences cannot be directly translated into component-level preferences, further complicating alignment. We address the issues by formulating compound AI systems as Directed Acyclic Graphs (DAGs), capturing the connections between agents and the data generation processes. We propose a system-level DPO (SysDPO) to jointly align compound systems by adapting the DPO to operate on these DAGs. We study the joint alignment of an LLM and a diffusion model to demonstrate the effectiveness of our approach. Our exploration provides insights into the alignment of compound AI systems and lays a foundation for future advancements.

[AI-52] On the usability of generative AI: Human generative AI

链接: https://arxiv.org/abs/2502.17714
作者: Anna Ravera,Cristina Gena
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI systems are transforming content creation, but their usability remains a key challenge. This paper examines usability factors such as user experience, transparency, control, and cognitive load. Common challenges include unpredictability and difficulties in fine-tuning outputs. We review evaluation metrics like efficiency, learnability, and satisfaction, highlighting best practices from various domains. Improving interpretability, intuitive interfaces, and user feedback can enhance usability, making generative AI more accessible and effective.

[AI-53] o Patch or Not to Patch: Motivations Challenges and Implications for Cybersecurity

链接: https://arxiv.org/abs/2502.17703
作者: Jason R. C. Nurse
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 7th International Conference HCI for Cybersecurity, Privacy and Trust (27th HCI International Conference)

点击查看摘要

Abstract:As technology has become more embedded into our society, the security of modern-day systems is paramount. One topic which is constantly under discussion is that of patching, or more specifically, the installation of updates that remediate security vulnerabilities in software or hardware systems. This continued deliberation is motivated by complexities involved with patching; in particular, the various incentives and disincentives for organizations and their cybersecurity teams when deciding whether to patch. In this paper, we take a fresh look at the question of patching and critically explore why organizations and IT/security teams choose to patch or decide against it (either explicitly or due to inaction). We tackle this question by aggregating and synthesizing prominent research and industry literature on the incentives and disincentives for patching, specifically considering the human aspects in the context of these motives. Through this research, this study identifies key motivators such as organizational needs, the IT/security team’s relationship with vendors, and legal and regulatory requirements placed on the business and its staff. There are also numerous significant reasons discovered for why the decision is taken not to patch, including limited resources (e.g., person-power), challenges with manual patch management tasks, human error, bad patches, unreliable patch management tools, and the perception that related vulnerabilities would not be exploited. These disincentives, in combination with the motivators above, highlight the difficult balance that organizations and their security teams need to maintain on a daily basis. Finally, we conclude by discussing implications of these findings and important future considerations.

[AI-54] Yes Q-learning Helps Offline In-Context RL

链接: https://arxiv.org/abs/2502.17666
作者: Denis Tarasov,Alexander Nikulin,Ilya Zisman,Albina Klepach,Andrei Polubarov,Nikita Lyubaykin,Alexander Derevyagin,Igor Kiselev,Vladislav Kurenkov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we explore the integration of Reinforcement Learning (RL) approaches within a scalable offline In-Context RL (ICRL) framework. Through experiments across more than 150 datasets derived from GridWorld and MuJoCo environments, we demonstrate that optimizing RL objectives improves performance by approximately 40% on average compared to the widely established Algorithm Distillation (AD) baseline across various dataset coverages, structures, expertise levels, and environmental complexities. Our results also reveal that offline RL-based methods outperform online approaches, which are not specifically designed for offline scenarios. These findings underscore the importance of aligning the learning objectives with RL’s reward-maximization goal and demonstrate that offline RL is a promising direction for application in ICRL settings.

[AI-55] Wearable Meets LLM for Stress Management: A Duoethnographic Study Integrating Wearable-Triggered Stressors and LLM Chatbots for Personalized Interventions

链接: https://arxiv.org/abs/2502.17650
作者: Sameer Neupane(University of Memphis),Poorvesh Dongre(Virginia Tech),Denis Gracanin(Virginia Tech),Santosh Kumar(University of Memphis)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: In CHI '25 Proceedings of the CHI Conference on Human Factors in Computing Systems Yokohama, Japan

点击查看摘要

Abstract:We use a duoethnographic approach to study how wearable-integrated LLM chatbots can assist with personalized stress management, addressing the growing need for immediacy and tailored interventions. Two researchers interacted with custom chatbots over 22 days, responding to wearable-detected physiological prompts, recording stressor phrases, and using them to seek tailored interventions from their LLM-powered chatbots. They recorded their experiences in autoethnographic diaries and analyzed them during weekly discussions, focusing on the relevance, clarity, and impact of chatbot-generated interventions. Results showed that even though most events triggered by the wearable were meaningful, only one in five warranted an intervention. It also showed that interventions tailored with brief event descriptions were more effective than generic ones. By examining the intersection of wearables and LLM, this research contributes to developing more effective, user-centric mental health tools for real-time stress relief and behavior change.

[AI-56] Socratic: Enhancing Human Teamwork via AI-enabled Coaching AAMAS2025

链接: https://arxiv.org/abs/2502.17643
作者: Sangwon Seo,Bing Han,Rayan E. Harari,Roger D. Dias,Marco A. Zenati,Eduardo Salas,Vaibhav Unhelkar
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Extended version of an identically-titled paper accepted at AAMAS 2025

点击查看摘要

Abstract:Coaches are vital for effective collaboration, but cost and resource constraints often limit their availability during real-world tasks. This limitation poses serious challenges in life-critical domains that rely on effective teamwork, such as healthcare and disaster response. To address this gap, we propose and realize an innovative application of AI: task-time team coaching. Specifically, we introduce Socratic, a novel AI system that complements human coaches by providing real-time guidance during task execution. Socratic monitors team behavior, detects misalignments in team members’ shared understanding, and delivers automated interventions to improve team performance. We validated Socratic through two human subject experiments involving dyadic collaboration. The results demonstrate that the system significantly enhances team performance with minimal interventions. Participants also perceived Socratic as helpful and trustworthy, supporting its potential for adoption. Our findings also suggest promising directions both for AI research and its practical applications to enhance human teamwork.

[AI-57] Requirements for Quality Assurance of AI Models for Early Detection of Lung Cancer KR

链接: https://arxiv.org/abs/2502.17639
作者: Horst K. Hahn,Matthias S. May,Volker Dicken,Michael Walz,Rainer Eßeling,Bianca Lassen-Schmidt,Robert Rischen,Jens Vogel-Claussen,Konstantin Nikolaou,Jörg Barkhausen
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: 12 pages incl. 2 figures, 2 charts, and references, summary in English (page 2), article in German (original title: Anforderungen an die Qualitätssicherung von KI-Modellen für die Lungenkrebs-Früherkennung)

点击查看摘要

Abstract:Lung cancer is the second most common cancer and the leading cause of cancer-related deaths worldwide. Survival largely depends on tumor stage at diagnosis, and early detection with low-dose CT can significantly reduce mortality in high-risk patients. AI can improve the detection, measurement, and characterization of pulmonary nodules while reducing assessment time. However, the training data, functionality, and performance of available AI systems vary considerably, complicating software selection and regulatory evaluation. Manufacturers must specify intended use and provide test statistics, but they can choose their training and test data, limiting standardization and comparability. Under the EU AI Act, consistent quality assurance is required for AI-based nodule detection, measurement, and characterization. This position paper proposes systematic quality assurance grounded in a validated reference dataset, including real screening cases plus phantom data to verify volume and growth rate measurements. Regular updates shall reflect demographic shifts and technological advances, ensuring ongoing relevance. Consequently, ongoing AI quality assurance is vital. Regulatory challenges are also adressed. While the MDR and the EU AI Act set baseline requirements, they do not adequately address self-learning algorithms or their updates. A standardized, transparent quality assessment - based on sensitivity, specificity, and volumetric accuracy - enables an objective evaluation of each AI solution’s strengths and weaknesses. Establishing clear testing criteria and systematically using updated reference data lay the groundwork for comparable performance metrics, informing tenders, guidelines, and recommendations. Comments: 12 pages incl. 2 figures, 2 charts, and references, summary in English (page 2), article in German (original title: Anforderungen an die Qualitätssicherung von KI-Modellen für die Lungenkrebs-Früherkennung) Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Performance (cs.PF) ACMclasses: I.2.1; J.3; K.6.4 Cite as: arXiv:2502.17639 [cs.CY] (or arXiv:2502.17639v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2502.17639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-58] owards Robust Legal Reasoning : Harnessing Logical LLM s in Law

链接: https://arxiv.org/abs/2502.17638
作者: Manuj Kant,Sareh Nabi,Manav Kant,Roland Scharrer,Megan Ma,Marzieh Nabi
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Legal services rely heavily on text processing. While large language models (LLMs) show promise, their application in legal contexts demands higher accuracy, repeatability, and transparency. Logic programs, by encoding legal concepts as structured rules and facts, offer reliable automation, but require sophisticated text extraction. We propose a neuro-symbolic approach that integrates LLMs’ natural language understanding with logic-based reasoning to address these limitations. As a legal document case study, we applied neuro-symbolic AI to coverage-related queries in insurance contracts using both closed and open-source LLMs. While LLMs have improved in legal reasoning, they still lack the accuracy and consistency required for complex contract analysis. In our analysis, we tested three methodologies to evaluate whether a specific claim is covered under a contract: a vanilla LLM, an unguided approach that leverages LLMs to encode both the contract and the claim, and a guided approach that uses a framework for the LLM to encode the contract. We demonstrated the promising capabilities of LLM + Logic in the guided approach. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.17638 [cs.CY] (or arXiv:2502.17638v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2502.17638 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-59] Hierarchical Imitation Learning of Team Behavior from Heterogeneous Demonstrations AAMAS2025

链接: https://arxiv.org/abs/2502.17618
作者: Sangwon Seo,Vaibhav Unhelkar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Extended version of an identically-titled paper accepted at AAMAS 2025

点击查看摘要

Abstract:Successful collaboration requires team members to stay aligned, especially in complex sequential tasks. Team members must dynamically coordinate which subtasks to perform and in what order. However, real-world constraints like partial observability and limited communication bandwidth often lead to suboptimal collaboration. Even among expert teams, the same task can be executed in multiple ways. To develop multi-agent systems and human-AI teams for such tasks, we are interested in data-driven learning of multimodal team behaviors. Multi-Agent Imitation Learning (MAIL) provides a promising framework for data-driven learning of team behavior from demonstrations, but existing methods struggle with heterogeneous demonstrations, as they assume that all demonstrations originate from a single team policy. Hence, in this work, we introduce DTIL: a hierarchical MAIL algorithm designed to learn multimodal team behaviors in complex sequential tasks. DTIL represents each team member with a hierarchical policy and learns these policies from heterogeneous team demonstrations in a factored manner. By employing a distribution-matching approach, DTIL mitigates compounding errors and scales effectively to long horizons and continuous state representations. Experimental results show that DTIL outperforms MAIL baselines and accurately models team behavior across a variety of collaborative scenarios.

[AI-60] Flexible Counterfactual Explanations with Generative Models

链接: https://arxiv.org/abs/2502.17613
作者: Stig Hellemans,Andres Algaba,Sam Verboven,Vincent Ginis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注: 28 pages, 13 figures

点击查看摘要

Abstract:Counterfactual explanations provide actionable insights to achieve desired outcomes by suggesting minimal changes to input features. However, existing methods rely on fixed sets of mutable features, which makes counterfactual explanations inflexible for users with heterogeneous real-world constraints. Here, we introduce Flexible Counterfactual Explanations, a framework incorporating counterfactual templates, which allows users to dynamically specify mutable features at inference time. In our implementation, we use Generative Adversarial Networks (FCEGAN), which align explanations with user-defined constraints without requiring model retraining or additional optimization. Furthermore, FCEGAN is designed for black-box scenarios, leveraging historical prediction datasets to generate explanations without direct access to model internals. Experiments across economic and healthcare datasets demonstrate that FCEGAN significantly improves counterfactual explanations’ validity compared to traditional benchmark methods. By integrating user-driven flexibility and black-box compatibility, counterfactual templates support personalized explanations tailored to user constraints.

[AI-61] Representation Engineering for Large-Language Models: Survey and Research Challenges

链接: https://arxiv.org/abs/2502.17601
作者: Lukasz Bartoszcze,Sarthak Munshi,Bryan Sukidi,Jennifer Yen,Zejia Yang,David Williams-King,Linh Le,Kosi Asuzu,Carsten Maple
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.

[AI-62] Intention Recognition in Real-Time Interactive Navigation Maps

链接: https://arxiv.org/abs/2502.17581
作者: Peijie Zhao,Zunayed Arefin,Felipe Meneguzzi,Ramon Fraga Pereira
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this demonstration, we develop IntentRec4Maps, a system to recognise users’ intentions in interactive maps for real-world navigation. IntentRec4Maps uses the Google Maps Platform as the real-world interactive map, and a very effective approach for recognising users’ intentions in real-time. We showcase the recognition process of IntentRec4Maps using two different Path-Planners and a Large Language Model (LLM). GitHub: this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2502.17581 [cs.AI] (or arXiv:2502.17581v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.17581 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-63] How Do Large Language Monkeys Get Their Power (Laws)?

链接: https://arxiv.org/abs/2502.17578
作者: Rylan Schaeffer,Joshua Kazdan,John Hughes,Jordan Juravsky,Sara Price,Aengus Lynch,Erik Jones,Robert Kirk,Azalia Mirhoseini,Sanmi Koyejo
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task – succeeding if any attempt is correct – then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, \sim2-4 orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.

[AI-64] On the Vulnerability of Concept Erasure in Diffusion Models

链接: https://arxiv.org/abs/2502.17537
作者: Lucas Beerens,Alex D. Richardson,Kaicheng Zhang,Dongdong Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. To address these issues, research on machine unlearning has developed various concept erasure methods, which aim to remove the effect of unwanted data through post-hoc training. However, we show these erasure techniques are vulnerable, where images of supposedly erased concepts can still be generated using adversarially crafted prompts. We introduce RECORD, a coordinate-descent-based algorithm that discovers prompts capable of eliciting the generation of erased content. We demonstrate that RECORD significantly beats the attack success rate of current state-of-the-art attack methods. Furthermore, our findings reveal that models subjected to concept erasure are more susceptible to adversarial attacks than previously anticipated, highlighting the urgency for more robust unlearning approaches. We open source all our code at this https URL

[AI-65] Perceptual Noise-Masking with Music through Deep Spectral Envelope Shaping

链接: https://arxiv.org/abs/2502.17527
作者: Clémentine Berger(IP Paris, IDS, S2A),Roland Badeau(IP Paris, IDS, S2A),Slim Essid(IP Paris, IDS, S2A)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:People often listen to music in noisy environments, seeking to isolate themselves from ambient sounds. Indeed, a music signal can mask some of the noise’s frequency components due to the effect of simultaneous masking. In this article, we propose a neural network based on a psychoacoustic masking model, designed to enhance the music’s ability to mask ambient noise by reshaping its spectral envelope with predicted filter frequency responses. The model is trained with a perceptual loss function that balances two constraints: effectively masking the noise while preserving the original music mix and the user’s chosen listening level. We evaluate our approach on simulated data replicating a user’s experience of listening to music with headphones in a noisy environment. The results, based on defined objective metrics, demonstrate that our system improves the state of the art.

[AI-66] Spectral Theory for Edge Pruning in Asynchronous Recurrent Graph Neural Networks

链接: https://arxiv.org/abs/2502.17522
作者: Nicolas Bessone
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a powerful tool for learning on graph-structured data, finding applications in numerous domains including social network analysis and molecular biology. Within this broad category, Asynchronous Recurrent Graph Neural Networks (ARGNNs) stand out for their ability to capture complex dependencies in dynamic graphs, resembling living organisms’ intricate and adaptive nature. However, their complexity often leads to large and computationally expensive models. Therefore, pruning unnecessary edges becomes crucial for enhancing efficiency without significantly compromising performance. This paper presents a dynamic pruning method based on graph spectral theory, leveraging the imaginary component of the eigenvalues of the network graph’s Laplacian.

[AI-67] Ensemble RL through Classifier Models: Enhancing Risk-Return Trade-offs in Trading Strategies

链接: https://arxiv.org/abs/2502.17518
作者: Zheli Xiong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP); Machine Learning (stat.ML)
*备注: 16 pages,5 figures, 1 table

点击查看摘要

Abstract:This paper presents a comprehensive study on the use of ensemble Reinforcement Learning (RL) models in financial trading strategies, leveraging classifier models to enhance performance. By combining RL algorithms such as A2C, PPO, and SAC with traditional classifiers like Support Vector Machines (SVM), Decision Trees, and Logistic Regression, we investigate how different classifier groups can be integrated to improve risk-return trade-offs. The study evaluates the effectiveness of various ensemble methods, comparing them with individual RL models across key financial metrics, including Cumulative Returns, Sharpe Ratios (SR), Calmar Ratios, and Maximum Drawdown (MDD). Our results demonstrate that ensemble methods consistently outperform base models in terms of risk-adjusted returns, providing better management of drawdowns and overall stability. However, we identify the sensitivity of ensemble performance to the choice of variance threshold \tau, highlighting the importance of dynamic \tau adjustment to achieve optimal performance. This study emphasizes the value of combining RL with classifiers for adaptive decision-making, with implications for financial trading, robotics, and other dynamic environments.

[AI-68] A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

链接: https://arxiv.org/abs/2502.17516
作者: Zihao Lin,Samyadeep Basu,Mohammad Beigi,Varun Manjunatha,Ryan A. Rossi,Zichao Wang,Yufan Zhou,Sriram Balasubramanian,Arman Zarei,Keivan Rezaei,Ying Shen,Barry Menglong Yao,Zhiyang Xu,Qin Liu,Yuxiang Zhang,Yan Sun,Shilong Liu,Li Shen,Hongxuan Li,Soheil Feizi,Lifu Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 30 pages, 4 Figures, 10 Tables

点击查看摘要

Abstract:The rise of foundation models has transformed machine learning research, prompting efforts to uncover their inner workings and develop more efficient and reliable applications for better control. While significant progress has been made in interpreting Large Language Models (LLMs), multimodal foundation models (MMFMs) - such as contrastive vision-language models, generative vision-language models, and text-to-image models - pose unique interpretability challenges beyond unimodal frameworks. Despite initial studies, a substantial gap remains between the interpretability of LLMs and MMFMs. This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems. By systematically reviewing current MMFM analysis techniques, we propose a structured taxonomy of interpretability methods, compare insights across unimodal and multimodal architectures, and highlight critical research gaps.

[AI-69] owards User-level Private Reinforcement Learning with Human Feedback

链接: https://arxiv.org/abs/2502.17515
作者: Jiaming Zhang,Mingxi Lei,Meng Ding,Mengdi Li,Zihang Xiang,Difei Xu,Jinhui Xu,Di Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) has emerged as an influential technique, enabling the alignment of large language models (LLMs) with human preferences. Despite the promising potential of RLHF, how to protect user preference privacy has become a crucial issue. Most previous work has focused on using differential privacy (DP) to protect the privacy of individual data. However, they have concentrated primarily on item-level privacy protection and have unsatisfactory performance for user-level privacy, which is more common in RLHF. This study proposes a novel framework, AUP-RLHF, which integrates user-level label DP into RLHF. We first show that the classical random response algorithm, which achieves an acceptable performance in item-level privacy, leads to suboptimal utility when in the user-level settings. We then establish a lower bound for the user-level label DP-RLHF and develop the AUP-RLHF algorithm, which guarantees (\varepsilon, \delta) user-level privacy and achieves an improved estimation error. Experimental results show that AUP-RLHF outperforms existing baseline methods in sentiment generation and summarization tasks, achieving a better privacy-utility trade-off.

[AI-70] Int2Int: a framework for mathematics with transformers

链接: https://arxiv.org/abs/2502.17513
作者: François Charton
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS)
*备注:

点击查看摘要

Abstract:This paper documents Int2Int, an open source code base for using transformers on problems of mathematical research, with a focus on number theory and other problems involving integers. Int2Int is a complete PyTorch implementation of a transformer architecture, together with training and evaluation loops, and classes and functions to represent, generate and decode common mathematical objects. Ancillary code for data preparation, and Jupyter Notebooks for visualizing experimental results are also provided. This document presents the main features of Int2Int, serves as its user manual, and provides guidelines on how to extend it. Int2Int is released under the MIT licence, at this https URL.

[AI-71] C-3DPO: Constrained Controlled Classification for Direct Preference Optimization

链接: https://arxiv.org/abs/2502.17507
作者: Kavosh Asadi,Julien Han,Xingzi Xu,Dominique Perrault-Joncas,Shoham Sabach,Karim Bouyarmane,Mohammad Ghavamzadeh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Direct preference optimization (DPO)-style algorithms have emerged as a promising approach for solving the alignment problem in AI. We present a novel perspective that formulates these algorithms as implicit classification algorithms. This classification framework enables us to recover many variants of DPO-style algorithms by choosing appropriate classification labels and loss functions. We then leverage this classification framework to demonstrate that the underlying problem solved in these algorithms is under-specified, making them susceptible to probability collapse of the winner-loser responses. We address this by proposing a set of constraints designed to control the movement of probability mass between the winner and loser in the reference and target policies. Our resulting algorithm, which we call Constrained Controlled Classification DPO (\textttC-3DPO), has a meaningful RLHF interpretation. By hedging against probability collapse, \textttC-3DPO provides practical improvements over vanilla \textttDPO when aligning several large language models using standard preference datasets.

[AI-72] RAG -Enhanced Collaborative LLM Agents for Drug Discovery

链接: https://arxiv.org/abs/2502.17506
作者: Namkyeong Lee,Edward De Brouwer,Ehsan Hajiramezanali,Chanyoung Park,Gabriele Scalia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Machine Learning, Drug Discovery

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown great potential to accelerate drug discovery. However, the specialized nature of biochemical data often necessitates costly domain-specific fine-tuning, posing critical challenges. First, it hinders the application of more flexible general-purpose LLMs in cutting-edge drug discovery tasks. More importantly, it impedes the rapid integration of the vast amounts of scientific data continuously generated through experiments and research. To investigate these challenges, we propose CLADD, a retrieval-augmented generation (RAG)-empowered agentic system tailored to drug discovery tasks. Through the collaboration of multiple LLM agents, CLADD dynamically retrieves information from biomedical knowledge bases, contextualizes query molecules, and integrates relevant evidence to generate responses – all without the need for domain-specific fine-tuning. Crucially, we tackle key obstacles in applying RAG workflows to biochemical data, including data heterogeneity, ambiguity, and multi-source integration. We demonstrate the flexibility and effectiveness of this framework across a variety of drug discovery tasks, showing that it outperforms general-purpose and domain-specific LLMs as well as traditional deep learning approaches.

[AI-73] CoKV: Optimizing KV Cache Allocation via Cooperative Game

链接: https://arxiv.org/abs/2502.17501
作者: Qiheng Sun,Hongwei Zhang,Haocheng Xia,Jiayao Zhang,Jinfei Liu,Kui Ren
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success on various aspects of human life. However, one of the major challenges in deploying these models is the substantial memory consumption required to store key-value pairs (KV), which imposes significant resource demands. Recent research has focused on KV cache budget allocation, with several approaches proposing head-level budget distribution by evaluating the importance of individual attention heads. These methods, however, assess the importance of heads independently, overlooking their cooperative contributions within the model, which may result in a deviation from their true impact on model performance. In light of this limitation, we propose CoKV, a novel method that models the cooperation between heads in model inference as a cooperative game. By evaluating the contribution of each head within the cooperative game, CoKV can allocate the cache budget more effectively. Extensive experiments show that CoKV achieves state-of-the-art performance on the LongBench benchmark using LLama-3-8B-Instruct and Mistral-7B models.

[AI-74] Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm

链接: https://arxiv.org/abs/2502.17500
作者: Andrzej Cichocki
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, preprint of Journal paper

点击查看摘要

Abstract:In this paper we propose and investigate a new class of Generalized Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) approaches, and applying as a regularization function the Bregman divergence with two-parameter deformation of logarithm as a link function. This link function (referred to as the Euler logarithm) is associated with a wide class of generalized entropies. In order to derive novel GEG/MD updates, we estimate generalized exponential function, which closely approximates the inverse of the Euler two-parameter logarithm. The characteristic/shape and properties of the Euler logarithm and its inverse – deformed exponential functions are tuned by two or even more hyperparameters. By learning these hyperparameters, we can adapt to distribution of training data, and we can adjust them to achieve desired properties of gradient descent algorithms. The concept of generalized entropies and associated deformed logarithms provide deeper insight into novel gradient descent updates. In literature, there exist nowadays over fifty mathematically well-defined entropic functionals and associated deformed logarithms, so impossible to investigate all of them in one research paper. Therefore, we focus here on a wide-class of trace-form entropies and associated generalized logarithm. We applied the developed algorithms for Online Portfolio Selection (OPLS) in order to improve its performance and robustness. Comments: 10 pages, preprint of Journal paper Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.17500 [cs.LG] (or arXiv:2502.17500v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.17500 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-75] SpikeRL: A Scalable and Energy-efficient Framework for Deep Spiking Reinforcement Learning

链接: https://arxiv.org/abs/2502.17496
作者: Tokey Tahmid,Mark Gates,Piotr Luszczek,Catherine D. Schuman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In this era of AI revolution, massive investments in large-scale data-driven AI systems demand high-performance computing, consuming tremendous energy and resources. This trend raises new challenges in optimizing sustainability without sacrificing scalability or performance. Among the energy-efficient alternatives of the traditional Von Neumann architecture, neuromorphic computing and its Spiking Neural Networks (SNNs) are a promising choice due to their inherent energy efficiency. However, in some real-world application scenarios such as complex continuous control tasks, SNNs often lack the performance optimizations that traditional artificial neural networks have. Researchers have addressed this by combining SNNs with Deep Reinforcement Learning (DeepRL), yet scalability remains unexplored. In this paper, we extend our previous work on SpikeRL, which is a scalable and energy efficient framework for DeepRL-based SNNs for continuous control. In our initial implementation of SpikeRL framework, we depended on the population encoding from the Population-coded Spiking Actor Network (PopSAN) method for our SNN model and implemented distributed training with Message Passing Interface (MPI) through mpi4py. Also, further optimizing our model training by using mixed-precision for parameter updates. In our new SpikeRL framework, we have implemented our own DeepRL-SNN component with population encoding, and distributed training with PyTorch Distributed package with NCCL backend while still optimizing with mixed precision training. Our new SpikeRL implementation is 4.26X faster and 2.25X more energy efficient than state-of-the-art DeepRL-SNN methods. Our proposed SpikeRL framework demonstrates a truly scalable and sustainable solution for complex continuous control tasks in real-world applications.

[AI-76] External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation WWW

链接: https://arxiv.org/abs/2502.17494
作者: Mingfu Liang,Xi Liu,Rong Jin,Boyang Liu,Qiuling Suo,Qinghai Zhou,Song Zhou,Laming Chen,Hua Zheng,Zhiyuan Li,Shali Jiang,Jiyan Yang,Xiaozhen Xia,Fan Yang,Yasmine Badr,Ellie Wen,Shuyu Xu,Hansey Chen,Zhengyu Zhang,Jade Nie,Chunzhi Yang,Zhichen Zeng,Weilin Zhang,Xingliang Huang,Qianru Li,Shiquan Wang,Evelyn Lyu,Wenjing Lu,Rui Zhang,Wenjun Wang,Jason Rudy,Mengyue Hang,Kai Wang,Yinbin Ma,Shuaiwen Wang,Sihan Zeng,Tongyi Tang,Xiaohan Wei,Longhao Jin,Jamey Zhang,Marcus Chen,Jiayi Zhang,Angie Huang,Chi Zhang,Zhengli Zhao,Jared Yang,Qiang Jin,Xian Chen,Amit Anand Amlesahwaram,Lexi Song,Liang Luo,Yuchen Hao,Nan Xiao,Yavuz Yetim,Luoshang Pan,Gaoxiang Liu,Yuxi Hu,Yuzhen Huang,Jackie Xu,Rich Zhu,Xin Zhang,Yiqun Liu,Hang Yin,Yuxin Chen,Buyun Zhang,Xiaoyi Liu,Sylvia Wang,Wenguang Mao,Zhijing Li,Qin Huang,Chonglin Sun,Shupin Mao,Jingzheng Qin,Peggy Yao,Jae-Woo Choi,Bin Gao,Ernest Wang,Lei Zhang,Wen-Yen Chen,Ted Lee,Jay Zha,Yi Meng,Alex Gong,Edison Gao,Alireza Vahdatpour,Yiping Han,Yantao Yao,Toshinari Kureha,Shuo Chang,Musharaf Sultan,John Bocharov,Sagar Chordia,Xiaorui Gan,Peng Sun,Rocky Liu,Bo Long,Wenlin Chen,Santanu Kolay,Huayu Li
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by the ACM Web Conference (WWW) 2025 Industrial Track as Oral Presentation

点击查看摘要

Abstract:Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications. First, training and inference budgets are restricted for the model to be served, exceeding which may incur latency and impair user experience. Second, large-volume data arrive in a streaming mode with data distributions dynamically shifting, as new users/ads join and existing users/ads leave the system. We propose the External Large Foundation Model (ExFM) framework to address the overlooked challenges. Specifically, we develop external distillation and a data augmentation system (DAS) to control the computational cost of training/inference while maintaining high performance. We design the teacher in a way like a foundation model (FM) that can serve multiple students as vertical models (VMs) to amortize its building cost. We propose Auxiliary Head and Student Adapter to mitigate the data distribution gap between FM and VMs caused by the streaming data issue. Comprehensive experiments on internal industrial-scale applications and public datasets demonstrate significant performance gain by ExFM.

[AI-77] Pursuing Top Growth with Novel Loss Function

链接: https://arxiv.org/abs/2502.17493
作者: Ruoyu Guo,Haochen Qiu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
*备注: 30 pages, 7 figures, GitHub repo: this https URL

点击查看摘要

Abstract:Making consistently profitable financial decisions in a continuously evolving and volatile stock market has always been a difficult task. Professionals from different disciplines have developed foundational theories to anticipate price movement and evaluate securities such as the famed Capital Asset Pricing Model (CAPM). In recent years, the role of artificial intelligence (AI) in asset pricing has been growing. Although the black-box nature of deep learning models lacks interpretability, they have continued to solidify their position in the financial industry. We aim to further enhance AI’s potential and utility by introducing a return-weighted loss function that will drive top growth while providing the ML models a limited amount of information. Using only publicly accessible stock data (open/close/high/low, trading volume, sector information) and several technical indicators constructed from them, we propose an efficient daily trading system that detects top growth opportunities. Our best models achieve 61.73% annual return on daily rebalancing with an annualized Sharpe Ratio of 1.18 over 1340 testing days from 2019 to 2024, and 37.61% annual return with an annualized Sharpe Ratio of 0.97 over 1360 testing days from 2005 to 2010. The main drivers for success, especially independent of any domain knowledge, are the novel return-weighted loss function, the integration of categorical and continuous data, and the ML model architecture. We also demonstrate the superiority of our novel loss function over traditional loss functions via several performance metrics and statistical evidence.

[AI-78] A generalized dual potential for inelastic Constitutive Artificial Neural Networks: A JAX implementation at finite strains

链接: https://arxiv.org/abs/2502.17490
作者: Hagen Holthusen,Kevin Linka,Ellen Kuhl,Tim Brepols
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: 56 pages, 19 figures, 3 tables

点击查看摘要

Abstract:We present a methodology for designing a generalized dual potential, or pseudo potential, for inelastic Constitutive Artificial Neural Networks (iCANNs). This potential, expressed in terms of stress invariants, inherently satisfies thermodynamic consistency for large deformations. In comparison to our previous work, the new potential captures a broader spectrum of material behaviors, including pressure-sensitive inelasticity. To this end, we revisit the underlying thermodynamic framework of iCANNs for finite strain inelasticity and derive conditions for constructing a convex, zero-valued, and non-negative dual potential. To embed these principles in a neural network, we detail the architecture’s design, ensuring a priori compliance with thermodynamics. To evaluate the proposed architecture, we study its performance and limitations discovering visco-elastic material behavior, though the method is not limited to visco-elasticity. In this context, we investigate different aspects in the strategy of discovering inelastic materials. Our results indicate that the novel architecture robustly discovers interpretable models and parameters, while autonomously revealing the degree of inelasticity. The iCANN framework, implemented in JAX, is publicly accessible at this https URL. Comments: 56 pages, 19 figures, 3 tables Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE) MSC classes: 65, 74 ACMclasses: I.6; J.2 Cite as: arXiv:2502.17490 [cs.LG] (or arXiv:2502.17490v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.17490 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Hagen Holthusen [view email] [v1] Wed, 19 Feb 2025 20:16:45 UTC (3,623 KB)

[AI-79] User Intent to Use DeekSeep for Healthcare Purposes and their Trust in the Large Language Model: Multinational Survey Study

链接: https://arxiv.org/abs/2502.17487
作者: Avishek Choudhury,Yeganeh Shahsavar,Hamid Shamszare
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly serve as interactive healthcare resources, yet user acceptance remains underexplored. This study examines how ease of use, perceived usefulness, trust, and risk perception interact to shape intentions to adopt DeepSeek, an emerging LLM-based platform, for healthcare purposes. A cross-sectional survey of 556 participants from India, the United Kingdom, and the United States was conducted to measure perceptions and usage patterns. Structural equation modeling assessed both direct and indirect effects, including potential quadratic relationships. Results revealed that trust plays a pivotal mediating role: ease of use exerts a significant indirect effect on usage intentions through trust, while perceived usefulness contributes to both trust development and direct adoption. By contrast, risk perception negatively affects usage intent, emphasizing the importance of robust data governance and transparency. Notably, significant non-linear paths were observed for ease of use and risk, indicating threshold or plateau effects. The measurement model demonstrated strong reliability and validity, supported by high composite reliabilities, average variance extracted, and discriminant validity measures. These findings extend technology acceptance and health informatics research by illuminating the multifaceted nature of user adoption in sensitive domains. Stakeholders should invest in trust-building strategies, user-centric design, and risk mitigation measures to encourage sustained and safe uptake of LLMs in healthcare. Future work can employ longitudinal designs or examine culture-specific variables to further clarify how user perceptions evolve over time and across different regulatory environments. Such insights are critical for harnessing AI to enhance outcomes.

[AI-80] AI Agent ic workflows and Enterprise APIs: Adapting API architectures for the age of AI agents

链接: https://arxiv.org/abs/2502.17443
作者: Vaibhav Tupe,Shrinath Thube
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of Generative AI has catalyzed the emergence of autonomous AI agents, presenting unprecedented challenges for enterprise computing infrastructures. Current enterprise API architectures are predominantly designed for human-driven, predefined interaction patterns, rendering them ill-equipped to support intelligent agents’ dynamic, goal-oriented behaviors. This research systematically examines the architectural adaptations for enterprise APIs to support AI agentic workflows effectively. Through a comprehensive analysis of existing API design paradigms, agent interaction models, and emerging technological constraints, the paper develops a strategic framework for API transformation. The study employs a mixed-method approach, combining theoretical modeling, comparative analysis, and exploratory design principles to address critical challenges in standardization, performance, and intelligent interaction. The proposed research contributes a conceptual model for next-generation enterprise APIs that can seamlessly integrate with autonomous AI agent ecosystems, offering significant implications for future enterprise computing architectures.

[AI-81] hinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement

链接: https://arxiv.org/abs/2502.17442
作者: Xiaoqing Zhang,Yuhan Liu,Flood Sung,Xiuying Chen,Rui Yan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 14 pages, 10 figures

点击查看摘要

Abstract:Code generation is crucial in software engineering for automating the coding process efficiently. While test-time computation methods show promise, they suffer from high latency due to multiple computation rounds. To overcome this, we introduce ThinkCoder, a framework that combines thorough exploration with optimal refinement. The exploration phase diversifies the solution space by searching for potential solutions, followed by a refinement phase that enhances precision. This approach allows us to select the best solution through careful consideration before taking action, avoiding excessive trial and error. To further minimize test-time computation overhead, we introduce preference-driven optimization with Reinforced Self-Training (ReST), which uses exploration trajectories from ThinkCoder to guide LLM’s evolution. By learning preferences, this approach improves LLM’s exploration efficiency, reducing computational costs while maintaining accuracy. ThinkCoder boosts the performance of multiple base LLMs, excelling on benchmarks like HumanEval and MBPP. Compared to SOTA models, it improves Pass@1 by 1.5% over MapCoder with just 21.7% of the computation cost. Against AgentCoder, ThinkCoder achieves a 0.6% higher Pass@1 after 2 rounds, outperforming AgentCoder’s 5 rounds. Additionally, ReST with success trajectories enhances efficiency, allowing models like LLaMA2-7B to achieve competitive results using only 20% of the computational resources. These results highlight the framework’s effectiveness and scalability.

[AI-82] Large Language Models as Realistic Microservice Trace Generators

链接: https://arxiv.org/abs/2502.17439
作者: Donghyun Kim,Sriram Ravula,Taemin Ha,Alexandros G. Dimakis,Daehyeok Kim,Aditya Akella
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Operating Systems (cs.OS)
*备注:

点击查看摘要

Abstract:Computer system workload traces, which record hardware or software events during application execution, are essential for understanding the behavior of complex systems and managing their processing and memory resources. However, obtaining real-world traces can be challenging due to the significant collection overheads in performance and privacy concerns that arise in proprietary systems. As a result, synthetic trace generation is considered a promising alternative to using traces collected in real-world production deployments. This paper proposes to train a large language model (LLM) to generate synthetic workload traces, specifically microservice call graphs. To capture complex and arbitrary hierarchical structures and implicit constraints in such traces, we fine-tune LLMs to generate each layer recursively, making call graph generation a sequence of easier steps. To further enforce learning constraints in traces and generate uncommon situations, we apply additional instruction tuning steps to align our model with the desired trace features. Our evaluation results show that our model can generate diverse realistic traces under various conditions and outperform existing methods in accuracy and validity. We show that our synthetically generated traces can effectively substitute real-world data in optimizing or tuning systems management tasks. We also show that our model can be adapted to perform key downstream trace-related tasks, specifically, predicting key trace features and infilling missing data given partial traces. Codes are available in this https URL.

[AI-83] ology-Driven Affective Computing: A Causal Framework for Sustained Well-Being

链接: https://arxiv.org/abs/2502.17172
作者: Bin Yin,Chong-Yi Liu,Liya Fu,Jinkun Zhang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Neurons and Cognition (q-bio.NC)
*备注: 24 pages, 7 figures

点击查看摘要

Abstract:Affective computing has made significant strides in emotion recognition and generation, yet current approaches mainly focus on short-term pattern recognition and lack a comprehensive framework to guide affective agents toward long-term human well-being. To address this, we propose a teleology-driven affective computing framework that unifies major emotion theories (basic emotion, appraisal, and constructivist approaches) under the premise that affect is an adaptive, goal-directed process that facilitates survival and development. Our framework emphasizes aligning agent responses with both personal/individual and group/collective well-being over extended timescales. We advocate for creating a “dataverse” of personal affective events, capturing the interplay between beliefs, goals, actions, and outcomes through real-world experience sampling and immersive virtual reality. By leveraging causal modeling, this “dataverse” enables AI systems to infer individuals’ unique affective concerns and provide tailored interventions for sustained well-being. Additionally, we introduce a meta-reinforcement learning paradigm to train agents in simulated environments, allowing them to adapt to evolving affective concerns and balance hierarchical goals - from immediate emotional needs to long-term self-actualization. This framework shifts the focus from statistical correlations to causal reasoning, enhancing agents’ ability to predict and respond proactively to emotional challenges, and offers a foundation for developing personalized, ethically aligned affective systems that promote meaningful human-AI interactions and societal well-being.

[AI-84] FLARE: A Framework for Stellar Flare Forecasting using Stellar Physical Properties and Historical Records

链接: https://arxiv.org/abs/2502.18218
作者: Bingke Zhu,Xiaoxiao Wang,Minghui Jia,Yihan Tao,Xiao Kong,Ali Luo,Yingying Chen,Ming Tang,Jinqiao Wang
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Stellar flare events are critical observational samples for astronomical research; however, recorded flare events remain limited. Stellar flare forecasting can provide additional flare event samples to support research efforts. Despite this potential, no specialized models for stellar flare forecasting have been proposed to date. In this paper, we present extensive experimental evidence demonstrating that both stellar physical properties and historical flare records are valuable inputs for flare forecasting tasks. We then introduce FLARE (Forecasting Light-curve-based Astronomical Records via features Ensemble), the first-of-its-kind large model specifically designed for stellar flare forecasting. FLARE integrates stellar physical properties and historical flare records through a novel Soft Prompt Module and Residual Record Fusion Module. Our experiments on the publicly available Kepler light curve dataset demonstrate that FLARE achieves superior performance compared to other methods across all evaluation metrics. Finally, we validate the forecast capability of our model through a comprehensive case study.

[AI-85] Uncertainty Quantification for LLM -Based Survey Simulations

链接: https://arxiv.org/abs/2502.17773
作者: Chengpiao Huang,Yuhang Wu,Kaizheng Wang
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 30 pages, 6 figures, 10 tables

点击查看摘要

Abstract:We investigate the reliable use of simulated survey responses from large language models (LLMs) through the lens of uncertainty quantification. Our approach converts synthetic data into confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. A key innovation lies in determining the optimal number of simulated responses: too many produce overly narrow confidence sets with poor coverage, while too few yield excessively loose estimates. To resolve this, our method adaptively selects the simulation sample size, ensuring valid average-case coverage guarantees. It is broadly applicable to any LLM, irrespective of its fidelity, and any procedure for constructing confidence sets. Additionally, the selected sample size quantifies the degree of misalignment between the LLM and the target human population. We illustrate our method on real datasets and LLMs.

[AI-86] Solving the Traveling Salesman Problem via Different Quantum Computing Architectures

链接: https://arxiv.org/abs/2502.17725
作者: Venkat Padmasola,Zhaotong Li,Rupak Chatterjee,Wesley Dyk
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph)
*备注: 13 pages, 21 figures, 32 citations

点击查看摘要

Abstract:We study the application of emerging photonic and quantum computing architectures to solving the Traveling Salesman Problem (TSP), a well-known NP-hard optimization problem. We investigate several approaches: Simulated Annealing (SA), Quadratic Unconstrained Binary Optimization (QUBO-Ising) methods implemented on quantum annealers and Optical Coherent Ising Machines, as well as the Quantum Approximate Optimization Algorithm (QAOA) and the Quantum Phase Estimation (QPE) algorithm on gate-based quantum computers. QAOA and QPE were tested on the IBM Quantum platform. The QUBO-Ising method was explored using the D-Wave quantum annealer, which operates on superconducting Josephson junctions, and the QCI Dirac machine, a nonlinear optoelectronic Ising machine. Gate-based quantum computers demonstrated accurate results for small TSP instances in simulation. However, real quantum devices are hindered by noise and limited scalability. Circuit complexity grows with problem size, restricting performance to TSP instances with a maximum of 6 nodes. In contrast, Ising-based architectures show improved scalability for larger problem sizes. SQUID-based Ising machines can handle TSP instances with up to 12 nodes, while nonlinear optoelectronic Ising machines extend this capability to 18 nodes. Nevertheless, the solutions tend to be suboptimal due to hardware limitations and challenges in achieving ground state convergence as the problem size increases. Despite these limitations, Ising machines demonstrate significant time advantages over classical methods, making them a promising candidate for solving larger-scale TSPs efficiently. Comments: 13 pages, 21 figures, 32 citations Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph) ACMclasses: F.2.2 Cite as: arXiv:2502.17725 [quant-ph] (or arXiv:2502.17725v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2502.17725 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-87] Effective Field Neural Network

链接: https://arxiv.org/abs/2502.17665
作者: Xi Liu,Yujun Zhao,Chun Yu Wan,Yang Zhang,Junwei Liu
类目: Computational Physics (physics.comp-ph); Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:In recent years, with the rapid development of machine learning, physicists have been exploring its new applications in solving or alleviating the curse of dimensionality in many-body problems. In order to accurately reflect the underlying physics of the problem, domain knowledge must be encoded into the machine learning algorithms. In this work, inspired by field theory, we propose a new set of machine learning models called effective field neural networks (EFNNs) that can automatically and efficiently capture important many-body interactions through multiple self-refining processes. Taking the classical 3 -spin infinite-range model and the quantum double exchange model as case studies, we explicitly demonstrate that EFNNs significantly outperform fully-connected deep neural networks (DNNs) and the effective model. Furthermore, with the help of convolution operations, the EFNNs learned in a small system can be seamlessly used in a larger system without additional training and the relative errors even decrease, which further demonstrates the efficacy of EFNNs in representing core physical behaviors.

[AI-88] StatLLM : A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis

链接: https://arxiv.org/abs/2502.17657
作者: Xinyi Song,Lina Lee,Kexin Xie,Xueying Liu,Xinwei Deng,Yili Hong
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
*备注: 25 pages, 7 figures

点击查看摘要

Abstract:The coding capabilities of large language models (LLMs) have opened up new opportunities for automatic statistical analysis in machine learning and data science. However, before their widespread adoption, it is crucial to assess the accuracy of code generated by LLMs. A major challenge in this evaluation lies in the absence of a benchmark dataset for statistical code (e.g., SAS and R). To fill in this gap, this paper introduces StatLLM, an open-source dataset for evaluating the performance of LLMs in statistical analysis. The StatLLM dataset comprises three key components: statistical analysis tasks, LLM-generated SAS code, and human evaluation scores. The first component includes statistical analysis tasks spanning a variety of analyses and datasets, providing problem descriptions, dataset details, and human-verified SAS code. The second component features SAS code generated by ChatGPT 3.5, ChatGPT 4.0, and Llama 3.1 for those tasks. The third component contains evaluation scores from human experts in assessing the correctness, effectiveness, readability, executability, and output accuracy of the LLM-generated code. We also illustrate the unique potential of the established benchmark dataset for (1) evaluating and enhancing natural language processing metrics, (2) assessing and improving LLM performance in statistical coding, and (3) developing and testing of next-generation statistical software - advancements that are crucial for data science and machine learning research.

[AI-89] heory-guided Pseudo-spectral Full Waveform Inversion via Deep Neural Networks

链接: https://arxiv.org/abs/2502.17624
作者: Christopher Zerafa,Pauline Galea,Cristiana Sebu
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
*备注: 26 pages, 23 figures, article paper

点击查看摘要

Abstract:Full-Waveform Inversion seeks to achieve a high-resolution model of the subsurface through the application of multi-variate optimization to the seismic inverse problem. Although now a mature technology, FWI has limitations related to the choice of the appropriate solver for the forward problem in challenging environments requiring complex assumptions, and very wide angle and multi-azimuth data necessary for full reconstruction are often not available. Deep Learning techniques have emerged as excellent optimization frameworks. Data-driven methods do not impose a wave propagation model and are not exposed to modelling errors. On the contrary, deterministic models are governed by the laws of physics. Seismic FWI has recently started to be investigated as a Deep Learning framework. Focus has been on the time-domain, while the pseudo-spectral domain has not been yet explored. However, classical FWI experienced major breakthroughs when pseudo-spectral approaches were employed. This work addresses the lacuna that exists in incorporating the pseudo-spectral approach within Deep Learning. This has been done by re-formulating the pseudo-spectral FWI problem as a Deep Learning algorithm for a theory-driven pseudo-spectral approach. A novel Recurrent Neural Network framework is proposed. This is qualitatively assessed on synthetic data, applied to a two-dimensional Marmousi dataset and evaluated against deterministic and time-based approaches. Pseudo-spectral theory-guided FWI using RNN was shown to be more accurate than classical FWI with only 0.05 error tolerance and 1.45% relative percent-age error. Indeed, this provides more stable convergence, able to identify faults better and has more low frequency content than classical FWI. Moreover, RNN was more suited than classical FWI at edge detection in the shallow and deep sections due to cleaner receiver residuals. Comments: 26 pages, 23 figures, article paper Subjects: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.17624 [physics.geo-ph] (or arXiv:2502.17624v1 [physics.geo-ph] for this version) https://doi.org/10.48550/arXiv.2502.17624 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Christopher Zerafa [view email] [v1] Mon, 24 Feb 2025 20:18:55 UTC (7,400 KB)

[AI-90] Synergizing Deep Learning and Full-Waveform Inversion: Bridging Data-Driven and Theory-Guided Approaches for Enhanced Seismic Imaging

链接: https://arxiv.org/abs/2502.17585
作者: Christopher Zerafa,Pauline Galea,Cristiana Sebu
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 20 pages, 14 images, literature review

点击查看摘要

Abstract:This review explores the integration of deep learning (DL) with full-waveform inversion (FWI) for enhanced seismic imaging and subsurface characterization. It covers FWI and DL fundamentals, geophysical applications (velocity estimation, deconvolution, tomography), and challenges (model complexity, data quality). The review also outlines future research directions, including hybrid, generative, and physics-informed models for improved accuracy, efficiency, and reliability in subsurface property estimation. The synergy between DL and FWI has the potential to transform geophysics, providing new insights into Earth’s subsurface.

[AI-91] Multimodal Bearing Fault Classification Under Variable Conditions: A 1D CNN with Transfer Learning

链接: https://arxiv.org/abs/2502.17524
作者: Tasfiq E. Alam,Md Manjurul Ahsan,Shivakumar Raman
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bearings play an integral role in ensuring the reliability and efficiency of rotating machinery - reducing friction and handling critical loads. Bearing failures that constitute up to 90% of mechanical faults highlight the imperative need for reliable condition monitoring and fault detection. This study proposes a multimodal bearing fault classification approach that relies on vibration and motor phase current signals within a one-dimensional convolutional neural network (1D CNN) framework. The method fuses features from multiple signals to enhance the accuracy of fault detection. Under the baseline condition (1,500 rpm, 0.7 Nm load torque, and 1,000 N radial force), the model reaches an accuracy of 96% with addition of L2 regularization. This represents a notable improvement of 2% compared to the non-regularized model. In addition, the model demonstrates robust performance across three distinct operating conditions by employing transfer learning (TL) strategies. Among the tested TL variants, the approach that preserves parameters up to the first max-pool layer and then adjusts subsequent layers achieves the highest performance. While this approach attains excellent accuracy across varied conditions, it requires more computational time due to its greater number of trainable parameters. To address resource constraints, less computationally intensive models offer feasible trade-offs, albeit at a slight accuracy cost. Overall, this multimodal 1D CNN framework with late fusion and TL strategies lays a foundation for more accurate, adaptable, and efficient bearing fault classification in industrial environments with variable operating conditions.

[AI-92] Attention-based UAV Trajectory Optimization for Wireless Power Transfer-assisted IoT Systems

链接: https://arxiv.org/abs/2502.17517
作者: Li Dong,Feibo Jiang,Yubo Peng
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) in Wireless Power Transfer (WPT)-assisted Internet of Things (IoT) systems face the following challenges: limited resources and suboptimal trajectory planning. Reinforcement learning-based trajectory planning schemes face issues of low search efficiency and learning instability when optimizing large-scale systems. To address these issues, we present an Attention-based UAV Trajectory Optimization (AUTO) framework based on the graph transformer, which consists of an Attention Trajectory Optimization Model (ATOM) and a Trajectory lEarNing Method based on Actor-critic (TENMA). In ATOM, a graph encoder is used to calculate the self-attention characteristics of all IoTDs, and a trajectory decoder is developed to optimize the number and trajectories of UAVs. TENMA then trains the ATOM using an improved Actor-Critic method, in which the real reward of the system is applied as the baseline to reduce variances in the critic network. This method is suitable for high-quality and large-scale multi-UAV trajectory planning. Finally, we develop numerous experiments, including a hardware experiment in the field case, to verify the feasibility and efficiency of the AUTO framework.

[AI-93] Inverse Surrogate Model of a Soft X-Ray Spectrometer using Domain Adaptation

链接: https://arxiv.org/abs/2502.17505
作者: Enrico Ahlers,Peter Feuer-Forson,Gregor Hartmann,Rolf Mitzner,Peter Baumgärtel,Jens Viefhaus
类目: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:In this study, we present a method to create a robust inverse surrogate model for a soft X-ray spectrometer. During a beamtime at an electron storage ring, such as BESSY II, instrumentation and beamlines are required to be correctly aligned and calibrated for optimal experimental conditions. In order to automate these processes, machine learning methods can be developed and implemented, but in many cases these methods require the use of an inverse model which maps the output of the experiment, such as a detector image, to the parameters of the device. Due to limited experimental data, such models are often trained with simulated data, which creates the challenge of compensating for the inherent differences between simulation and experiment. In order to close this gap, we demonstrate the application of data augmentation and adversarial domain adaptation techniques, with which we can predict absolute coordinates for the automated alignment of our spectrometer. Bridging the simulation-experiment gap with minimal real-world data opens new avenues for automated experimentation using machine learning in scientific instrumentation.

[AI-94] Accuracy of Wearable ECG Parameter Calculation Method for Long QT and First-Degree A-V Block Detection: A Multi-Center Real-World Study with External Validations Compared to Standard ECG Machines and Cardiologist Assessments

链接: https://arxiv.org/abs/2502.17499
作者: Sumei Fan,Deyun Zhang,Yue Wang,Shijia Geng,Kun Lu,Meng Sang,Weilun Xu,Haixue Wang,Qinghao Zhao,Chuandong Cheng,Peng Wang,Shenda Hong
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 37 pages, 8 figures, 6 tables

点击查看摘要

Abstract:In recent years, wearable devices have revolutionized cardiac monitoring by enabling continuous, non-invasive ECG recording in real-world settings. Despite these advances, the accuracy of ECG parameter calculations (PR interval, QRS interval, QT interval, etc.) from wearables remains to be rigorously validated against conventional ECG machines and expert clinician assessments. In this large-scale, multicenter study, we evaluated FeatureDB, a novel algorithm for automated computation of ECG parameters from wearable single-lead signals Three diverse datasets were employed: the AHMU-FH dataset (n=88,874), the CSE dataset (n=106), and the HeartVoice-ECG-lite dataset (n=369) with annotations provided by two experienced cardiologists. FeatureDB demonstrates a statistically significant correlation with key parameters (PR interval, QRS duration, QT interval, and QTc) calculated by standard ECG machines and annotated by clinical doctors. Bland-Altman analysis confirms a high level of this http URL,FeatureDB exhibited robust diagnostic performance in detecting Long QT syndrome (LQT) and atrioventricular block interval abnormalities (AVBI),with excellent area under the ROC curve (LQT: 0.836, AVBI: 0.861),accuracy (LQT: 0.856, AVBI: 0.845),sensitivity (LQT: 0.815, AVBI: 0.877),and specificity (LQT: 0.856, AVBI: 0.845).This further validates its clinical reliability. These results validate the clinical applicability of FeatureDB for wearable ECG analysis and highlight its potential to bridge the gap between traditional diagnostic methods and emerging wearable this http URL,this study supports integrating wearable ECG devices into large-scale cardiovascular disease management and early intervention strategies,and it highlights the potential of wearable ECG technologies to deliver accurate,clinically relevant cardiac monitoring while advancing broader applications in cardiovascular care.

[AI-95] oward Foundational Model for Sleep Analysis Using a Multimodal Hybrid Self-Supervised Learning Framework

链接: https://arxiv.org/abs/2502.17481
作者: Cheol-Hui Lee,Hakseung Kim,Byung C. Yoon,Dong-Joo Kim
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:Sleep is essential for maintaining human health and quality of life. Analyzing physiological signals during sleep is critical in assessing sleep quality and diagnosing sleep disorders. However, manual diagnoses by clinicians are time-intensive and subjective. Despite advances in deep learning that have enhanced automation, these approaches remain heavily dependent on large-scale labeled datasets. This study introduces SynthSleepNet, a multimodal hybrid self-supervised learning framework designed for analyzing polysomnography (PSG) data. SynthSleepNet effectively integrates masked prediction and contrastive learning to leverage complementary features across multiple modalities, including electroencephalogram (EEG), electrooculography (EOG), electromyography (EMG), and electrocardiogram (ECG). This approach enables the model to learn highly expressive representations of PSG data. Furthermore, a temporal context module based on Mamba was developed to efficiently capture contextual information across signals. SynthSleepNet achieved superior performance compared to state-of-the-art methods across three downstream tasks: sleep-stage classification, apnea detection, and hypopnea detection, with accuracies of 89.89%, 99.75%, and 89.60%, respectively. The model demonstrated robust performance in a semi-supervised learning environment with limited labels, achieving accuracies of 87.98%, 99.37%, and 77.52% in the same tasks. These results underscore the potential of the model as a foundational tool for the comprehensive analysis of PSG data. SynthSleepNet demonstrates comprehensively superior performance across multiple downstream tasks compared to other methodologies, making it expected to set a new standard for sleep disorder monitoring and diagnostic systems.

[AI-96] MC2SleepNet: Multi-modal Cross-masking with Contrastive Learning for Sleep Stage Classification

链接: https://arxiv.org/abs/2502.17470
作者: Younghoon Na
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sleep profoundly affects our health, and sleep deficiency or disorders can cause physical and mental problems. % Despite significant findings from previous studies, challenges persist in optimizing deep learning models, especially in multi-modal learning for high-accuracy sleep stage classification. Our research introduces MC2SleepNet (Multi-modal Cross-masking with Contrastive learning for Sleep stage classification Network). It aims to facilitate the effective collaboration between Convolutional Neural Networks (CNNs) and Transformer architectures for multi-modal training with the help of contrastive learning and cross-masking. % Raw single channel EEG signals and corresponding spectrogram data provide differently characterized modalities for multi-modal learning. Our MC2SleepNet has achieved state-of-the-art performance with an accuracy of both 84.6% on the SleepEDF-78 and 88.6% accuracy on the Sleep Heart Health Study (SHHS). These results demonstrate the effective generalization of our proposed network across both small and large datasets.

[AI-97] PixleepFlow: A Pixel-Based Lifelog Framework for Predicting Sleep Quality and Stress Level

链接: https://arxiv.org/abs/2502.17469
作者: Younghoon Na
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The analysis of lifelogs can yield valuable insights into an individual’s daily life, particularly with regard to their health and well-being. The accurate assessment of quality of life is necessitated by the use of diverse sensors and precise synchronization. To rectify this issue, this study proposes the image-based sleep quality and stress level estimation flow (PixleepFlow). PixleepFlow employs a conversion methodology into composite image data to examine sleep patterns and their impact on overall health. Experiments were conducted using lifelog datasets to ascertain the optimal combination of data formats. In addition, we identified which sensor information has the greatest influence on the quality of life through Explainable Artificial Intelligence(XAI). As a result, PixleepFlow produced more significant results than various data formats. This study was part of a written-based competition, and the additional findings from the lifelog dataset are detailed in Section Section IV. More information about PixleepFlow can be found at this https URL.

[AI-98] he Case for Cleaner Biosignals: High-fidelity Neural Compressor Enables Transfer from Cleaner iEEG to Noisier EEG ICLR2025

链接: https://arxiv.org/abs/2502.17462
作者: Francesco Stefano Carzaniga,Gary Tom Hoppeler,Michael Hersche,Kaspar Anton Schindler,Abbas Rahimi
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at ICLR 2025, see this https URL . Code is available at this https URL

点击查看摘要

Abstract:All data modalities are not created equal, even when the signal they measure comes from the same source. In the case of the brain, two of the most important data modalities are the scalp electroencephalogram (EEG), and the intracranial electroencephalogram (iEEG). They are used by human experts, supported by deep learning (DL) models, to accomplish a variety of tasks, such as seizure detection and motor imagery classification. Although the differences between EEG and iEEG are well understood by human experts, the performance of DL models across these two modalities remains under-explored. To help characterize the importance of clean data on the performance of DL models, we propose BrainCodec, a high-fidelity EEG and iEEG neural compressor. We find that training BrainCodec on iEEG and then transferring to EEG yields higher reconstruction quality than training on EEG directly. In addition, we also find that training BrainCodec on both EEG and iEEG improves fidelity when reconstructing EEG. Our work indicates that data sources with higher SNR, such as iEEG, provide better performance across the board also in the medical time-series domain. BrainCodec also achieves up to a 64x compression on iEEG and EEG without a notable decrease in quality. BrainCodec markedly surpasses current state-of-the-art compression models both in final compression ratio and in reconstruction fidelity. We also evaluate the fidelity of the compressed signals objectively on a seizure detection and a motor imagery task performed by standard DL models. Here, we find that BrainCodec achieves a reconstruction fidelity high enough to ensure no performance degradation on the downstream tasks. Finally, we collect the subjective assessment of an expert neurologist, that confirms the high reconstruction quality of BrainCodec in a realistic scenario. The code is available at this https URL.

[AI-99] Finetuning and Quantization of EEG-Based Foundational BioSignal Models on ECG and PPG Data for Blood Pressure Estimation

链接: https://arxiv.org/abs/2502.17460
作者: Bálint Tóth,Dominik Senti,Thorir Mar Ingolfsson,Jeffrey Zweidler,Alexandre Elsig,Luca Benini,Yawei Li
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 1 figure, 5 tables, preprint

点击查看摘要

Abstract:Blood pressure (BP) is a key indicator of cardiovascular health. As hypertension remains a global cause of morbidity and mortality, accurate, continuous, and non-invasive BP monitoring is therefore of paramount importance. Photoplethysmography (PPG) and electrocardiography (ECG) can potentially enable continuous BP monitoring, yet training accurate and robust machine learning (ML) models remains challenging due to variability in data quality and patient-specific factors. Recently, multiple research groups explored Electroencephalographic (EEG)–based foundation models and demonstrated their exceptional ability to learn rich temporal resolution. Considering the morphological similarities between different biosignals, the question arises of whether a model pre-trained on one modality can effectively be exploited to improve the accuracy of a different signal type. In this work, we take an initial step towards generalized biosignal foundation models by investigating whether model representations learned from abundant EEG data can effectively be transferred to ECG/PPG data solely with fine-tuning, without the need for large-scale additional pre-training, for the BP estimation task. Evaluations on the MIMIC-III and VitalDB datasets demonstrate that our approach achieves near state-of-the-art accuracy for diastolic BP (mean absolute error of 1.57 mmHg) and surpasses by 1.5x the accuracy of prior works for systolic BP (mean absolute error 2.72 mmHg). Additionally, we perform dynamic INT8 quantization, reducing the smallest model size by over 3.5x (from 13.73 MB down to 3.83 MB) while preserving performance, thereby enabling unobtrusive, real-time BP monitoring on resource-constrained wearable devices.

[AI-100] MoEMba: A Mamba-based Mixture of Experts for High-Density EMG-based Hand Gesture Recognition

链接: https://arxiv.org/abs/2502.17457
作者: Mehran Shabanpour,Kasra Rad,Sadaf Khademi,Arash Mohammadi
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-Density surface Electromyography (HDsEMG) has emerged as a pivotal resource for Human-Computer Interaction (HCI), offering direct insights into muscle activities and motion intentions. However, a significant challenge in practical implementations of HD-sEMG-based models is the low accuracy of inter-session and inter-subject classification. Variability between sessions can reach up to 40% due to the inherent temporal variability of HD-sEMG signals. Targeting this challenge, the paper introduces the MoEMba framework, a novel approach leveraging Selective StateSpace Models (SSMs) to enhance HD-sEMG-based gesture recognition. The MoEMba framework captures temporal dependencies and cross-channel interactions through channel attention techniques. Furthermore, wavelet feature modulation is integrated to capture multi-scale temporal and spatial relations, improving signal representation. Experimental results on the CapgMyo HD-sEMG dataset demonstrate that MoEMba achieves a balanced accuracy of 56.9%, outperforming its state-of-the-art counterparts. The proposed framework’s robustness to session-to-session variability and its efficient handling of high-dimensional multivariate time series data highlight its potential for advancing HD-sEMG-powered HCI systems.

[AI-101] Survey on Recent Progress of AI for Chemistry: Methods Applications and Opportunities

链接: https://arxiv.org/abs/2502.17456
作者: Ding Hu,Pengxiang Hua,Zhen Huang
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 22 pages, 8 figures, 4 tables

点击查看摘要

Abstract:The development of artificial intelligence (AI) techniques has brought revolutionary changes across various realms. In particular, the use of AI-assisted methods to accelerate chemical research has become a popular and rapidly growing trend, leading to numerous groundbreaking works. In this paper, we provide a comprehensive review of current AI techniques in chemistry from a computational perspective, considering various aspects in the design of methods. We begin by discussing the characteristics of data from diverse sources, followed by an overview of various representation methods. Next, we review existing models for several topical tasks in the field, and conclude by highlighting some key challenges that warrant further attention.

[AI-102] Smart Sampling Strategies for Wireless Industrial Data Acquisition

链接: https://arxiv.org/abs/2502.17454
作者: Marcos Soto(Universidad Loyola Andalucía)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 17 pages, 11 figures

点击查看摘要

Abstract:In industrial environments, data acquisition accuracy is crucial for process control and optimization. Wireless telemetry has proven to be a valuable tool for improving efficiency in well-testing operations, enabling bidirectional communication and real-time control of downhole tools. However, high sampling frequencies present challenges in telemetry, including data storage, transmission, computational resource consumption, and battery life of wireless devices. This study explores how optimizing data acquisition strategies can reduce aliasing effects and systematic errors while improving sampling rates without compromising measurement accuracy. A reduction of 80% in sampling frequency was achieved without degrading measurement quality, demonstrating the potential for resource optimization in industrial environments.

[AI-103] AirTag Youre It: Reverse Logistics and Last Mile Dynamics

链接: https://arxiv.org/abs/2502.17447
作者: David Noever,Forrest McKee
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study addresses challenges in reverse logistics, a frequently overlooked but essential component of last-mile delivery, particularly in disaster relief scenarios where infrastructure disruptions demand adaptive solutions. While hub-and-spoke logistics networks excel at long-distance scalability, they often fail to optimize closely spaced spokes reliant on distant hubs, introducing inefficiencies in transit times and resource allocation. Using 20 Apple AirTags embedded in packages, this research provides empirical insights into logistical flows, capturing granular spatial and temporal data through Bluetooth LE (BLE) 5 trackers integrated with the Apple Find My network. These trackers demonstrated their value in monitoring dynamic cargo movements, enabling real-time adjustments in mobile hub placement and route optimization, particularly in disaster relief contexts like Hurricane Helene. A novel application of discrete event simulation (DES) further explored the saddle point in hub-spoke configurations, where excessive hub reliance clashes with diminishing spoke interaction demand. By coupling simulation results with empirical AirTag tracking, the study highlights the potential of BLE technology to refine reverse logistics, reduce delays, and improve operational flexibility in both routine and crisis-driven delivery networks.

[AI-104] DCentNet: Decentralized Multistage Biomedical Signal Classification using Early Exits

链接: https://arxiv.org/abs/2502.17446
作者: Xiaolin Li,Binhua Huang,Barry Cardiff,Deepu John
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:DCentNet is a novel decentralized multistage signal classification approach designed for biomedical data from IoT wearable sensors, integrating early exit points (EEP) to enhance energy efficiency and processing speed. Unlike traditional centralized processing methods, which result in high energy consumption and latency, DCentNet partitions a single CNN model into multiple sub-networks using EEPs. By introducing encoder-decoder pairs at EEPs, the system compresses large feature maps before transmission, significantly reducing wireless data transfer and power usage. If an input is confidently classified at an EEP, processing stops early, optimizing efficiency. Initial sub-networks can be deployed on fog or edge devices to further minimize energy consumption. A genetic algorithm is used to optimize EEP placement, balancing performance and complexity. Experimental results on ECG classification show that with one EEP, DCentNet reduces wireless data transmission by 94.54% and complexity by 21%, while maintaining original accuracy and sensitivity. With two EEPs, sensitivity reaches 98.36%, accuracy 97.74%, wireless data transmission decreases by 91.86%, and complexity is reduced by 22%. Implemented on an ARM Cortex-M4 MCU, DCentNet achieves an average power saving of 73.6% compared to continuous wireless ECG transmission.

[AI-105] Interpretable Dual-Filter Fuzzy Neural Networks for Affective Brain-Computer Interfaces

链接: https://arxiv.org/abs/2502.17445
作者: Xiaowei Jiang,Yanan Chen,Nikhil Ranjan Pal,Yu-Cheng Chang,Yunkai Yang,Thomas Do,Chin-Teng Lin
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Fuzzy logic provides a robust framework for enhancing explainability, particularly in domains requiring the interpretation of complex and ambiguous signals, such as brain-computer interface (BCI) systems. Despite significant advances in deep learning, interpreting human emotions remains a formidable challenge. In this work, we present iFuzzyAffectDuo, a novel computational model that integrates a dual-filter fuzzy neural network architecture for improved detection and interpretation of emotional states from neuroimaging data. The model introduces a new membership function (MF) based on the Laplace distribution, achieving superior accuracy and interpretability compared to traditional approaches. By refining the extraction of neural signals associated with specific emotions, iFuzzyAffectDuo offers a human-understandable framework that unravels the underlying decision-making processes. We validate our approach across three neuroimaging datasets using functional Near-Infrared Spectroscopy (fNIRS) and Electroencephalography (EEG), demonstrating its potential to advance affective computing. These findings open new pathways for understanding the neural basis of emotions and their application in enhancing human-computer interaction.

机器学习

[LG-0] Allocating Variance to Maximize Expectation

链接: https://arxiv.org/abs/2502.18463
作者: Renato Purita Paes Leme,Cliff Stein,Yifeng Teng,Pratik Worah
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We design efficient approximation algorithms for maximizing the expectation of the supremum of families of Gaussian random variables. In particular, let \mathrmOPT:=\max_\sigma_1,\cdots,\sigma_n\mathbbE\left[\sum_j=1^m\max_i\in S_j X_i\right] , where X_i are Gaussian, S_j\subset[n] and \sum_i\sigma_i^2=1 , then our theoretical results include: - We characterize the optimal variance allocation – it concentrates on a small subset of variables as |S_j| increases, - A polynomial time approximation scheme (PTAS) for computing \mathrmOPT when m=1 , and - An O(\log n) approximation algorithm for computing \mathrmOPT for general m1 . Such expectation maximization problems occur in diverse applications, ranging from utility maximization in auctions markets to learning mixture models in quantitative genetics. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2502.18463 [cs.LG] (or arXiv:2502.18463v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.18463 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pratik Worah [view email] [v1] Tue, 25 Feb 2025 18:59:46 UTC (809 KB) Full-text links: Access Paper: View a PDF of the paper titled Allocating Variance to Maximize Expectation, by Renato Purita Paes Leme and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-02 Change to browse by: cs stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-1] LLM -Based Design Pattern Detection

链接: https://arxiv.org/abs/2502.18458
作者: Christian Schindler,Andreas Rausch
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Submitted Version, that was accepted at PATTERNS 2025

点击查看摘要

Abstract:Detecting design pattern instances in unfamiliar codebases remains a challenging yet essential task for improving software quality and maintainability. Traditional static analysis tools often struggle with the complexity, variability, and lack of explicit annotations that characterize real-world pattern implementations. In this paper, we present a novel approach leveraging Large Language Models to automatically identify design pattern instances across diverse codebases. Our method focuses on recognizing the roles classes play within the pattern instances. By providing clearer insights into software structure and intent, this research aims to support developers, improve comprehension, and streamline tasks such as refactoring, maintenance, and adherence to best practices.

[LG-2] Supervised Reward Inference

链接: https://arxiv.org/abs/2502.18447
作者: Will Schwarzer,Jordan Schneider,Philip S. Thomas,Scott Niekum
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Existing approaches to reward inference from behavior typically assume that humans provide demonstrations according to specific models of behavior. However, humans often indicate their goals through a wide range of behaviors, from actions that are suboptimal due to poor planning or execution to behaviors which are intended to communicate goals rather than achieve them. We propose that supervised learning offers a unified framework to infer reward functions from any class of behavior, and show that such an approach is asymptotically Bayes-optimal under mild assumptions. Experiments on simulated robotic manipulation tasks show that our method can efficiently infer rewards from a wide variety of arbitrarily suboptimal demonstrations.

[LG-3] Enhancing DNA Foundation Models to Address Masking Inefficiencies

链接: https://arxiv.org/abs/2502.18405
作者: Monireh Safari,Pablo Millan Arias,Scott C. Lowe,Lila Kari,Angel X. Chang,Graham W. Taylor
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Masked language modelling (MLM) as a pretraining objective has been widely adopted in genomic sequence modelling. While pretrained models can successfully serve as encoders for various downstream tasks, the distribution shift between pretraining and inference detrimentally impacts performance, as the pretraining task is to map [MASK] tokens to predictions, yet the [MASK] is absent during downstream applications. This means the encoder does not prioritize its encodings of non-[MASK] tokens, and expends parameters and compute on work only relevant to the MLM task, despite this being irrelevant at deployment time. In this work, we propose a modified encoder-decoder architecture based on the masked autoencoder framework, designed to address this inefficiency within a BERT-based transformer. We empirically show that the resulting mismatch is particularly detrimental in genomic pipelines where models are often used for feature extraction without fine-tuning. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes. We achieve substantial performance gains in both closed-world and open-world classification tasks when compared against causal models and bidirectional architectures pretrained with MLM tasks.

[LG-4] he FFT Strikes Back: An Efficient Alternative to Self-Attention

链接: https://arxiv.org/abs/2502.18394
作者: Jacob Fein-Ashley
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conventional self-attention mechanisms incur quadratic complexity, limiting their scalability on long sequences. We introduce FFTNet, an adaptive spectral filtering framework that leverages the Fast Fourier Transform (FFT) to achieve global token mixing in \mathcalO(n\log n) time. By transforming inputs into the frequency domain, FFTNet exploits the orthogonality and energy preservation guaranteed by Parseval’s theorem to capture long-range dependencies efficiently. A learnable spectral filter and modReLU activation dynamically emphasize salient frequency components, providing a rigorous and adaptive alternative to traditional self-attention. Experiments on the Long Range Arena and ImageNet benchmarks validate our theoretical insights and demonstrate superior performance over fixed Fourier and standard attention models.

[LG-5] Mechanistic PDE Networks for Discovery of Governing Equations

链接: https://arxiv.org/abs/2502.18377
作者: Adeel Pervez,Efstratios Gavves,Francesco Locatello
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Mechanistic PDE Networks – a model for discovery of governing partial differential equations from data. Mechanistic PDE Networks represent spatiotemporal data as space-time dependent linear partial differential equations in neural network hidden representations. The represented PDEs are then solved and decoded for specific tasks. The learned PDE representations naturally express the spatiotemporal dynamics in data in neural network hidden space, enabling increased power for dynamical modeling. Solving the PDE representations in a compute and memory-efficient way, however, is a significant challenge. We develop a native, GPU-capable, parallel, sparse, and differentiable multigrid solver specialized for linear partial differential equations that acts as a module in Mechanistic PDE Networks. Leveraging the PDE solver, we propose a discovery architecture that can discover nonlinear PDEs in complex settings while also being robust to noise. We validate PDE discovery on a number of PDEs, including reaction-diffusion and Navier-Stokes equations.

[LG-6] WebGames: Challenging General-Purpose Web-Browsing AI Agents

链接: https://arxiv.org/abs/2502.18356
作者: George Thomas,Alex J. Chan,Jikun Kang,Wenqi Wu,Filippos Christianos,Fraser Greenlee,Andy Toulis,Marvin Purtorab
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for humans while systematically testing the limitations of current AI systems across fundamental browser interactions, advanced input processing, cognitive tasks, workflow automation, and interactive entertainment. Our framework eliminates external dependencies through a hermetic testing environment, ensuring reproducible evaluation with verifiable ground-truth solutions. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems’ ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at this http URL, offering a lightweight, client-side implementation that facilitates rapid evaluation cycles. Through its modular architecture and standardized challenge specifications, WebGames provides a robust foundation for measuring progress in development of more capable web-browsing agents.

[LG-7] Graph Inference with Effective Resistance Queries

链接: https://arxiv.org/abs/2502.18350
作者: Huck Bennett,Mitchell Black,Amir Nayyeri,Evelyn Warton
类目: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal of graph inference is to design algorithms for learning properties of a hidden graph using queries to an oracle that returns information about the graph. Graph reconstruction, verification, and property testing are all types of graph inference. In this work, we study graph inference using an oracle that returns the effective resistance (ER) between a pair of vertices. Effective resistance is a distance originating from the study of electrical circuits with many applications. However, ER has received little attention from a graph inference perspective. Indeed, although it is known that an n -vertex graph can be uniquely reconstructed from all \binomn2 possible ER queries, little else is known. We address this gap with several new results, including: 1. O(n) -query algorithms for testing whether a graph is a tree; deciding whether two graphs are equal assuming one is a subgraph of the other; and testing whether a given vertex (or edge) is a cut vertex (or cut edge). 2. Property testing algorithms, including for testing whether a graph is vertex- or edge-biconnected. We also give a reduction to adapt property testing results from the bounded-degree model to our ER query model. This yields ER-query-based algorithms for testing k -connectivity, bipartiteness, planarity, and containment of a fixed subgraph. 3. Graph reconstruction algorithms, including an algorithm for reconstructing a graph from a low-width tree decomposition; a \Theta(k^2) -query, polynomial-time algorithm for recovering the adjacency matrix A of a hidden graph, given A with k of its entries deleted; and a k -query, exponential-time algorithm for the same task. We also compare the power of ER queries and shortest path queries, which are closely related but better studied. Interestingly, we show that the two query models are incomparable in power. Subjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Machine Learning (cs.LG) Cite as: arXiv:2502.18350 [cs.DS] (or arXiv:2502.18350v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2502.18350 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mitchell Black [view email] [v1] Tue, 25 Feb 2025 16:37:25 UTC (128 KB) Full-text links: Access Paper: View a PDF of the paper titled Graph Inference with Effective Resistance Queries, by Huck Bennett and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.DS prev | next new | recent | 2025-02 Change to browse by: cs cs.DM cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-8] Structural Alignment Improves Graph Test-Time Adaptation

链接: https://arxiv.org/abs/2502.18334
作者: Hans Hao-Hsun Hsu,Shikun Liu,Han Zhao,Pan Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph-based learning has achieved remarkable success in domains ranging from recommendation to fraud detection and particle physics by effectively capturing underlying interaction patterns. However, it often struggles to generalize when distribution shifts occur, particularly those involving changes in network connectivity or interaction patterns. Existing approaches designed to mitigate such shifts typically require retraining with full access to source data, rendering them infeasible under strict computational or privacy constraints. To address this limitation, we propose a test-time structural alignment (TSA) algorithm for Graph Test-Time Adaptation (GTTA), a novel method that aligns graph structures during inference without revisiting the source domain. Built upon a theoretically grounded treatment of graph data distribution shifts, TSA integrates three key strategies: an uncertainty-aware neighborhood weighting that accommodates structure shifts, an adaptive balancing of self-node and neighborhood-aggregated representations driven by node representations’ signal-to-noise ratio, and a decision boundary refinement that corrects remaining label and feature shifts. Extensive experiments on synthetic and real-world datasets demonstrate that TSA can consistently outperform both non-graph TTA methods and state-of-the-art GTTA baselines.

[LG-9] Pretraining Frequency Predicts Compositional Generalization of CLIP on Real-World Tasks NEURIPS2024

链接: https://arxiv.org/abs/2502.18326
作者: Thaddäus Wiedemer,Yash Sharma,Ameya Prabhu,Matthias Bethge,Wieland Brendel
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward

点击查看摘要

Abstract:We investigate the success conditions for compositional generalization of CLIP models on real-world data through performance prediction. Prior work shows that CLIP requires exponentially more pretraining data for linear performance gains on individual concepts. This sample-inefficient scaling could be mitigated if CLIP systematically understood new inputs as compositions of learned components, allowing rare observation to be mapped to common concepts. To explore CLIP’s compositional generalization ability, we filter retrieval corpora for samples with object combinations not present in the pretraining corpus. We show that CLIP’s performance on these samples can be accurately predicted from the pretraining frequencies of individual objects. Our findings demonstrate that CLIP learns to disentangle objects observed in its pretraining data and can recompose them straightforwardly. Additionally, we are the first to show how this ability scales with pretraining data. For data curation in practice, our results suggest that balancing object occurrences improves generalization, which should benefit CLIP’s efficiency and accuracy without scaling data volume.

[LG-10] Accelerated Training on Low-Power Edge Devices

链接: https://arxiv.org/abs/2502.18323
作者: Mohamed Aboelenien Ahmed,Kilian Pfeiffer,Heba Khdr,Osama Abboud,Ramin Khalili,Jörg Henkel
类目: Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注:

点击查看摘要

Abstract:Training on edge devices poses several challenges as these devices are generally resource-constrained, especially in terms of power. State-of-the-art techniques at the device level reduce the GPU frequency to enforce power constraints, leading to a significant increase in training time. To accelerate training, we propose to jointly adjust the system and application parameters (in our case, the GPU frequency and the batch size of the training task) while adhering to the power constraints on devices. We introduce a novel cross-layer methodology that combines predictions of batch size efficiency and device profiling to achieve the desired optimization. Our evaluation on real hardware shows that our method outperforms the current baselines that depend on state of the art techniques, reducing the training time by 2.4\times with results very close to optimal. Our measurements also indicate a substantial reduction in the overall energy used for the training process. These gains are achieved without reduction in the performance of the trained model.

[LG-11] Global-Decision-Focused Neural ODEs for Proactive Grid Resilience Management

链接: https://arxiv.org/abs/2502.18321
作者: Shuyi Chen,Ferdinando Fioretto,Feng Qiu,Shixiang Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extreme hazard events such as wildfires and hurricanes increasingly threaten power systems, causing widespread outages and disrupting critical services. Recently, predict-then-optimize approaches have gained traction in grid operations, where system functionality forecasts are first generated and then used as inputs for downstream decision-making. However, this two-stage method often results in a misalignment between prediction and optimization objectives, leading to suboptimal resource allocation. To address this, we propose predict-all-then-optimize-globally (PATOG), a framework that integrates outage prediction with globally optimized interventions. At its core, our global-decision-focused (GDF) neural ODE model captures outage dynamics while optimizing resilience strategies in a decision-aware manner. Unlike conventional methods, our approach ensures spatially and temporally coherent decision-making, improving both predictive accuracy and operational efficiency. Experiments on synthetic and real-world datasets demonstrate significant improvements in outage prediction consistency and grid resilience.

[LG-12] Bayesian Computation in Deep Learning

链接: https://arxiv.org/abs/2502.18300
作者: Wenlong Chen,Bolian Li,Ruqi Zhang,Yingzhen Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 pages, 7 figures

点击查看摘要

Abstract:This review paper is intended for the 2nd edition of the Handbook of Markov chain Monte this http URL provide an introduction to approximate inference techniques as Bayesian computation methods applied to deep learning models. We organize the chapter by presenting popular computational methods for (1) Bayesian neural networks and (2) deep generative models, explaining their unique challenges in posterior inference as well as the solutions.

[LG-13] DeepCircuitX: A Comprehensive Repository-Level Dataset for RTL Code Understanding Generation and PPA Analysis

链接: https://arxiv.org/abs/2502.18297
作者: Zeju Li,Changran Xu,Zhengyuan Shi,Zedong Peng,Yi Liu,Yunhao Zhou,Lingfeng Zhou,Chengyu Ma,Jianyuan Zhong,Xi Wang,Jieru Zhao,Zhufei Chu,Xiaoyan Yang,Qiang Xu
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:This paper introduces DeepCircuitX, a comprehensive repository-level dataset designed to advance RTL (Register Transfer Level) code understanding, generation, and power-performance-area (PPA) analysis. Unlike existing datasets that are limited to either file-level RTL code or physical layout data, DeepCircuitX provides a holistic, multilevel resource that spans repository, file, module, and block-level RTL code. This structure enables more nuanced training and evaluation of large language models (LLMs) for RTL-specific tasks. DeepCircuitX is enriched with Chain of Thought (CoT) annotations, offering detailed descriptions of functionality and structure at multiple levels. These annotations enhance its utility for a wide range of tasks, including RTL code understanding, generation, and completion. Additionally, the dataset includes synthesized netlists and PPA metrics, facilitating early-stage design exploration and enabling accurate PPA prediction directly from RTL code. We demonstrate the dataset’s effectiveness on various LLMs finetuned with our dataset and confirm the quality with human evaluations. Our results highlight DeepCircuitX as a critical resource for advancing RTL-focused machine learning applications in hardware design this http URL data is available at this https URL.

[LG-14] Neural Network Graph Similarity Computation Based on Graph Fusion

链接: https://arxiv.org/abs/2502.18291
作者: Zenghui Chang,Yiqiao Zhang,Hong Cai Chen
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Graph similarity learning, crucial for tasks such as graph classification and similarity search, focuses on measuring the similarity between two graph-structured entities. The core challenge in this field is effectively managing the interactions between graphs. Traditional methods often entail separate, redundant computations for each graph pair, leading to unnecessary complexity. This paper revolutionizes the approach by introducing a parallel graph interaction method called graph fusion. By merging the node sequences of graph pairs into a single large graph, our method leverages a global attention mechanism to facilitate interaction computations and to harvest cross-graph insights. We further assess the similarity between graph pairs at two distinct levels-graph-level and node-level-introducing two innovative, yet straightforward, similarity computation algorithms. Extensive testing across five public datasets shows that our model not only outperforms leading baseline models in graph-to-graph classification and regression tasks but also sets a new benchmark for performance and efficiency. The code for this paper is open-source and available at this https URL

[LG-15] Causal AI-based Root Cause Identification: Research to Practice at Scale

链接: https://arxiv.org/abs/2502.18240
作者: Saurabh Jha,Ameet Rahane,Laura Shwartz,Marc Palaci-Olgun,Frank Bagehorn,Jesus Rios,Dan Stingaciu,Ragu Kattinakere,Debasish Banerjee
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Modern applications are built as large, distributed systems spanning numerous modules, teams, and data centers. Despite robust engineering and recovery strategies, failures and performance issues remain inevitable, risking significant disruptions and affecting end users. Rapid and accurate root cause identification is therefore vital to ensure system reliability and maintain key service metrics. We have developed a novel causality-based Root Cause Identification (RCI) algorithm that emphasizes causation over correlation. This algorithm has been integrated into IBM Instana-bridging research to practice at scale-and is now in production use by enterprise customers. By leveraging “causal AI,” Instana stands apart from typical Application Performance Management (APM) tools, pinpointing issues in near real-time. This paper highlights Instana’s advanced failure diagnosis capabilities, discussing both the theoretical underpinnings and practical implementations of the RCI algorithm. Real-world examples illustrate how our causality-based approach enhances reliability and performance in today’s complex system landscapes. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE) Cite as: arXiv:2502.18240 [cs.LG] (or arXiv:2502.18240v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.18240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Unveiling and Causalizing CoT: A Causal Pespective

链接: https://arxiv.org/abs/2502.18239
作者: Jiarun Fu,Lizhong Ding,Hao Li,Pengqi Li,Qiuning Wei,Xu Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although Chain-of-Thought (CoT) has achieved remarkable success in enhancing the reasoning ability of large language models (LLMs), the mechanism of CoT remains a ``black box’'. Even if the correct answers can frequently be obtained, existing CoTs struggle to make the reasoning understandable to human. In this paper, we unveil and causalize CoT from a causal perspective to ensure both correctness and understandability of all reasoning steps (to the best of our knowledge, the first such). We model causality of CoT via structural causal models (SCM) to unveil the reasoning mechanism of CoT. To measure the causality of CoT, we define the CoT Average Causal Effect (CACE) to test the causal relations between steps. For those steps without causality (wrong or unintelligible steps), we design a role-playing causal query algorithm to causalize these steps, resulting a causalized CoT with all steps correct and understandable. Experimental results on both open-source and closed-source LLMs demonstrate that the causal errors commonly in steps are effectively corrected and the reasoning ability of LLMs is significantly improved.

[LG-17] Beyond the convexity assumption: Realistic tabular data generation under quantifier-free real linear constraints ICLR2025

链接: https://arxiv.org/abs/2502.18237
作者: Mihaela Cătălina Stoian,Eleonora Giunchiglia
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Synthetic tabular data generation has traditionally been a challenging problem due to the high complexity of the underlying distributions that characterise this type of data. Despite recent advances in deep generative models (DGMs), existing methods often fail to produce realistic datapoints that are well-aligned with available background knowledge. In this paper, we address this limitation by introducing Disjunctive Refinement Layer (DRL), a novel layer designed to enforce the alignment of generated data with the background knowledge specified in user-defined constraints. DRL is the first method able to automatically make deep learning models inherently compliant with constraints as expressive as quantifier-free linear formulas, which can define non-convex and even disconnected spaces. Our experimental analysis shows that DRL not only guarantees constraint satisfaction but also improves efficacy in downstream tasks. Notably, when applied to DGMs that frequently violate constraints, DRL eliminates violations entirely. Further, it improves performance metrics by up to 21.4% in F1-score and 20.9% in Area Under the ROC Curve, thus demonstrating its practical impact on data generation.

[LG-18] Software implemented fault diagnosis of natural gas pumping unit based on feedforward neural network

链接: https://arxiv.org/abs/2502.18233
作者: Mykola Kozlenko,Olena Zamikhovska,Leonid Zamikhovskyi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:In recent years, more and more attention has been paid to the use of artificial neural networks (ANN) for diagnostics of gas pumping units (GPU). Usually, ANN training is carried out on models of GPU workflows, and generated sets of diagnostic data are used to simulate defect conditions. At the same time, the results obtained do not allow assessing the real state of the GPU. It is proposed to use the values of the characteristics of the acoustic and vibration processes of the GPU as the input data of the ANN. A descriptive statistical analysis of real vibration and acoustic processes generated by the operation of the GPU type GTK-25-i (Nuovo Pignone, Italy) has been carried out. The formation of packets of diagnostic signs arriving at the input of the ANN has been carried out. The diagnostic features are the five maximum amplitude components of the acoustic and vibration signals, as well as the value of the standard deviation for each sample. Diagnostic signs are calculated directly in the input pipeline of ANN data in real time for three technical states of the GPU. Using the frameworks TensorFlow, Keras, NumPy, pandas, in the Python 3 programming language, an architecture was developed for a deep fully connected feedforward ANN, training on the error backpropagation algorithm. The results of training and testing of the developed ANN are presented. During testing, it was found that the signal classification precision for the “nominal” state of all 1475 signal samples is 1.0000, for the “current” state, precision equils 0.9853, and for the “defective” state, precision is 0.9091. The use of the developed ANN makes it possible to classify the technical states of the GPU with an accuracy sufficient for practical use, which will prevent the occurrence of GPU failures. ANN can be used to diagnose GPU of any type and power.

[LG-19] Near-optimal Active Regression of Single-Index Models

链接: https://arxiv.org/abs/2502.18213
作者: Yi Li,Wai Ming Tai
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The active regression problem of the single-index model is to solve \min_x \lVert f(Ax)-b\rVert_p , where A is fully accessible and b can only be accessed via entry queries, with the goal of minimizing the number of queries to the entries of b . When f is Lipschitz, previous results only obtain constant-factor approximations. This work presents the first algorithm that provides a (1+\varepsilon) -approximation solution by querying \tildeO(d^\fracp2\vee 1/\varepsilon^p\vee 2) entries of b . This query complexity is also shown to be optimal up to logarithmic factors for p\in [1,2] and the \varepsilon -dependence of 1/\varepsilon^p is shown to be optimal for p2 .

[LG-20] Graph Augmentation for Cross Graph Domain Generalization

链接: https://arxiv.org/abs/2502.18188
作者: Guanzi Chen,Jiying Zhang,Yang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-graph node classification, utilizing the abundant labeled nodes from one graph to help classify unlabeled nodes in another graph, can be viewed as a domain generalization problem of graph neural networks (GNNs) due to the structure shift commonly appearing among various graphs. Nevertheless, current endeavors for cross-graph node classification mainly focus on model training. Data augmentation approaches, a simple and easy-to-implement domain generalization technique, remain under-explored. In this paper, we develop a new graph structure augmentation for the crossgraph domain generalization problem. Specifically, low-weight edgedropping is applied to remove potential noise edges that may hinder the generalization ability of GNNs, stimulating the GNNs to capture the essential invariant information underlying different structures. Meanwhile, clustering-based edge-adding is proposed to generate invariant structures based on the node features from the same distribution. Consequently, with these augmentation techniques, the GNNs can maintain the domain invariant structure information that can improve the generalization ability. The experiments on out-ofdistribution citation network datasets verify our method achieves state-of-the-art performance among conventional augmentations.

[LG-21] Sharper Concentration Inequalities for Multi-Graph Dependent Variables

链接: https://arxiv.org/abs/2502.18167
作者: Xiao Shao,Guoqiang Wu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages

点击查看摘要

Abstract:In multi-task learning (MTL) with each task involving graph-dependent data, generalization results of existing theoretical analyses yield a sub-optimal risk bound of O(\frac1\sqrtn) , where n is the number of training this http URL is attributed to the lack of a foundational sharper concentration inequality for multi-graph dependent random variables. To fill this gap, this paper proposes a new corresponding Bennett inequality, enabling the derivation of a sharper risk bound of O(\frac\log nn) . Specifically, building on the proposed Bennett inequality, we propose a new corresponding Talagrand inequality for the empirical process and further develop an analytical framework of the local Rademacher complexity to enhance theoretical generalization analyses in MTL with multi-graph dependent data. Finally, we apply the theoretical advancements to applications such as Macro-AUC Optimization, demonstrating the superiority of our theoretical results over previous work, which is also corroborated by experimental results.

[LG-22] Actively Inferring Optimal Measurement Sequences

链接: https://arxiv.org/abs/2502.18142
作者: Catherine F. Higham,Paul Henderson,Roderick Murray-Smith
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Measurement of a physical quantity such as light intensity is an integral part of many reconstruction and decision scenarios but can be costly in terms of acquisition time, invasion of or damage to the environment and storage. Data minimisation and compliance with data protection laws is also an important consideration. Where there are a range of measurements that can be made, some may be more informative and compliant with the overall measurement objective than others. We develop an active sequential inference algorithm that uses the low dimensional representational latent space from a variational autoencoder (VAE) to choose which measurement to make next. Our aim is to recover high dimensional data by making as few measurements as possible. We adapt the VAE encoder to map partial data measurements on to the latent space of the complete data. The algorithm draws samples from this latent space and uses the VAE decoder to generate data conditional on the partial measurements. Estimated measurements are made on the generated data and fed back through the partial VAE encoder to the latent space where they can be evaluated prior to making a measurement. Starting from no measurements and a normal prior on the latent space, we consider alternative strategies for choosing the next measurement and updating the predictive posterior prior for the next step. The algorithm is illustrated using the Fashion MNIST dataset and a novel convolutional Hadamard pattern measurement basis. We see that useful patterns are chosen within 10 steps, leading to the convergence of the guiding generative images. Compared with using stochastic variational inference to infer the parameters of the posterior distribution for each generated data point individually, the partial VAE framework can efficiently process batches of generated data and obtains superior results with minimal measurements.

[LG-23] Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models

链接: https://arxiv.org/abs/2502.18099
作者: Xu Chu,Zhixin Zhang,Tianyu Jia,Yujie Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning language models with human preferences is critical for real-world deployment, but existing methods often require large amounts of high-quality human annotations. Aiming at a data-efficient alignment method, we propose Stackelberg Game Preference Optimization (SGPO), a framework that models alignment as a two-player Stackelberg game, where a policy (leader) optimizes against a worst-case preference distribution (follower) within an \epsilon -Wasserstein ball, ensuring robustness to (self-)annotation noise and distribution shifts. SGPO guarantees O(\epsilon) -bounded regret, unlike Direct Preference Optimization (DPO), which suffers from linear regret growth in the distribution mismatch. We instantiate SGPO with the Stackelberg Self-Annotated Preference Optimization (SSAPO) algorithm, which iteratively self-annotates preferences and adversarially reweights synthetic annotated preferences. Using only 2K seed preferences, from the UltraFeedback dataset, i.e., 1/30 of human labels in the dataset, our method achieves 35.82% GPT-4 win-rate with Mistral-7B and 40.12% with Llama3-8B-Instruct within three rounds of SSAPO.

[LG-24] A Market for Accuracy: Classification under Competition

链接: https://arxiv.org/abs/2502.18052
作者: Ohad Einav,Nir Rosenfeld
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 26 pages

点击查看摘要

Abstract:Machine learning models play a key role for service providers looking to gain market share in consumer markets. However, traditional learning approaches do not take into account the existence of additional providers, who compete with each other for consumers. Our work aims to study learning in this market setting, as it affects providers, consumers, and the market itself. We begin by analyzing such markets through the lens of the learning objective, and show that accuracy cannot be the only consideration. We then propose a method for classification under competition, so that a learner can maximize market share in the presence of competitors. We show that our approach benefits the providers as well as the consumers, and find that the timing of market entry and model updates can be crucial. We display the effectiveness of our approach across a range of domains, from simple distributions to noisy datasets, and show that the market as a whole remains stable by converging quickly to an equilibrium.

[LG-25] Enhancing 5G O-RAN Communication Efficiency Through AI-Based Latency Forecasting

链接: https://arxiv.org/abs/2502.18046
作者: Raúl Parada,Ebrahim Abu-Helalah,Jordi Serra,Anton Aguilar,Paolo Dini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing complexity and dynamic nature of 5G open radio access networks (O-RAN) pose significant challenges to maintaining low latency, high throughput, and resource efficiency. While existing methods leverage machine learning for latency prediction and resource management, they often lack real-world scalability and hardware validation. This paper addresses these limitations by presenting an artificial intelligence-driven latency forecasting system integrated into a functional O-RAN prototype. The system uses a bidirectional long short-term memory model to predict latency in real time within a scalable, open-source framework built with FlexRIC. Experimental results demonstrate the model’s efficacy, achieving a loss metric below 0.04, thus validating its applicability in dynamic 5G environments.

[LG-26] Patient Trajectory Prediction: Integrating Clinical Notes with Transformers

链接: https://arxiv.org/abs/2502.18009
作者: Sifal Klioui,Sana Sellami,Youssef Trardi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting disease trajectories from electronic health records (EHRs) is a complex task due to major challenges such as data non-stationarity, high granularity of medical codes, and integration of multimodal data. EHRs contain both structured data, such as diagnostic codes, and unstructured data, such as clinical notes, which hold essential information often overlooked. Current models, primarily based on structured data, struggle to capture the complete medical context of patients, resulting in a loss of valuable information. To address this issue, we propose an approach that integrates unstructured clinical notes into transformer-based deep learning models for sequential disease prediction. This integration enriches the representation of patients’ medical histories, thereby improving the accuracy of diagnosis predictions. Experiments on MIMIC-IV datasets demonstrate that the proposed approach outperforms traditional models relying solely on structured data.

[LG-27] A Perspective on Symbolic Machine Learning in Physical Sciences NEURIPS2024

链接: https://arxiv.org/abs/2502.17993
作者: Nour Makke,Sanjay Chawla
类目: Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); High Energy Physics - Theory (hep-th)
*备注: Machine Learning and the Physical Sciences Workshop at NeurIPS 2024

点击查看摘要

Abstract:Machine learning is rapidly making its pathway across all of the natural sciences, including physical sciences. The rate at which ML is impacting non-scientific disciplines is incomparable to that in the physical sciences. This is partly due to the uninterpretable nature of deep neural networks. Symbolic machine learning stands as an equal and complementary partner to numerical machine learning in speeding up scientific discovery in physics. This perspective discusses the main differences between the ML and scientific approaches. It stresses the need to develop and apply symbolic machine learning to physics problems equally, in parallel to numerical machine learning, because of the dual nature of physics research.

[LG-28] Generalized Decision Focused Learning under Imprecise Uncertainty–Theoretical Study

链接: https://arxiv.org/abs/2502.17984
作者: Keivan Shariatmadar,Neil Yorke-Smith,Ahmad Osman,Fabio Cuzzolin,Hans Hallez,David Moens
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注: 13 pages

点击查看摘要

Abstract:Decision Focused Learning has emerged as a critical paradigm for integrating machine learning with downstream optimisation. Despite its promise, existing methodologies predominantly rely on probabilistic models and focus narrowly on task objectives, overlooking the nuanced challenges posed by epistemic uncertainty, non-probabilistic modelling approaches, and the integration of uncertainty into optimisation constraints. This paper bridges these gaps by introducing innovative frameworks: (i) a non-probabilistic lens for epistemic uncertainty representation, leveraging intervals (the least informative uncertainty model), Contamination (hybrid model), and probability boxes (the most informative uncertainty model); (ii) methodologies to incorporate uncertainty into constraints, expanding Decision-Focused Learning’s utility in constrained environments; (iii) the adoption of Imprecise Decision Theory for ambiguity-rich decision-making contexts; and (iv) strategies for addressing sparse data challenges. Empirical evaluations on benchmark optimisation problems demonstrate the efficacy of these approaches in improving decision quality and robustness and dealing with said gaps.

[LG-29] Provable Performance Bounds for Digital Twin-driven Deep Reinforcement Learning in Wireless Networks: A Novel Digital-Twin Bisimulation Metric

链接: https://arxiv.org/abs/2502.17983
作者: Zhenyu Tao,Wei Xu,Xiaohu You
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Digital twin (DT)-driven deep reinforcement learning (DRL) has emerged as a promising paradigm for wireless network optimization, offering safe and efficient training environment for policy exploration. However, in theory existing methods cannot always guarantee real-world performance of DT-trained policies before actual deployment, due to the absence of a universal metric for assessing DT’s ability to support reliable DRL training transferrable to physical networks. In this paper, we propose the DT bisimulation metric (DT-BSM), a novel metric based on the Wasserstein distance, to quantify the discrepancy between Markov decision processes (MDPs) in both the DT and the corresponding real-world wireless network environment. We prove that for any DT-trained policy, the sub-optimality of its performance (regret) in the real-world deployment is bounded by a weighted sum of the DT-BSM and its sub-optimality within the MDP in the DT. Then, a modified DT-BSM based on the total variation distance is also introduced to avoid the prohibitive calculation complexity of Wasserstein distance for large-scale wireless network scenarios. Further, to tackle the challenge of obtaining accurate transition probabilities of the MDP in real world for the DT-BSM calculation, we propose an empirical DT-BSM method based on statistical sampling. We prove that the empirical DT-BSM always converges to the desired theoretical one, and quantitatively establish the relationship between the required sample size and the target level of approximation accuracy. Numerical experiments validate this first theoretical finding on the provable and calculable performance bounds for DT-driven DRL.

[LG-30] XGBoost-Based Prediction of ICU Mortality in Sepsis-Associated Acute Kidney Injury Patients Using MIMIC-IV Database with Validation from eICU Database

链接: https://arxiv.org/abs/2502.17978
作者: Shuheng Chen,Junyi Fan,Elham Pishgar,Kamiar Alaei,Greg Placencia,Maryam Pishgar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Sepsis-Associated Acute Kidney Injury (SA-AKI) leads to high mortality in intensive care. This study develops machine learning models using the Medical Information Mart for Intensive Care IV (MIMIC-IV) database to predict Intensive Care Unit (ICU) mortality in SA-AKI patients. External validation is conducted using the eICU Collaborative Research Database. Methods: For 9,474 identified SA-AKI patients in MIMIC-IV, key features like lab results, vital signs, and comorbidities were selected using Variance Inflation Factor (VIF), Recursive Feature Elimination (RFE), and expert input, narrowing to 24 predictive variables. An Extreme Gradient Boosting (XGBoost) model was built for in-hospital mortality prediction, with hyperparameters optimized using GridSearch. Model interpretability was enhanced with SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME). External validation was conducted using the eICU database. Results: The proposed XGBoost model achieved an internal Area Under the Receiver Operating Characteristic curve (AUROC) of 0.878 (95% Confidence Interval: 0.859-0.897). SHAP identified Sequential Organ Failure Assessment (SOFA), serum lactate, and respiratory rate as key mortality predictors. LIME highlighted serum lactate, Acute Physiology and Chronic Health Evaluation II (APACHE II) score, total urine output, and serum calcium as critical features. Conclusions: The integration of advanced techniques with the XGBoost algorithm yielded a highly accurate and interpretable model for predicting SA-AKI mortality across diverse populations. It supports early identification of high-risk patients, enhancing clinical decision-making in intensive care. Future work needs to focus on enhancing adaptability, versatility, and real-world applications. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.17978 [cs.LG] (or arXiv:2502.17978v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.17978 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuheng Chen [view email] [v1] Tue, 25 Feb 2025 08:49:22 UTC (566 KB)

[LG-31] Model-Free Adversarial Purification via Coarse-To-Fine Tensor Network Representation

链接: https://arxiv.org/abs/2502.17972
作者: Guang Lin,Duc Thien Nguyen,Zerui Tao,Konstantinos Slavakis,Toshihisa Tanaka,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks are known to be vulnerable to well-designed adversarial attacks. Although numerous defense strategies have been proposed, many are tailored to the specific attacks or tasks and often fail to generalize across diverse scenarios. In this paper, we propose Tensor Network Purification (TNP), a novel model-free adversarial purification method by a specially designed tensor network decomposition algorithm. TNP depends neither on the pre-trained generative model nor the specific dataset, resulting in strong robustness across diverse adversarial scenarios. To this end, the key challenge lies in relaxing Gaussian-noise assumptions of classical decompositions and accommodating the unknown distribution of adversarial perturbations. Unlike the low-rank representation of classical decompositions, TNP aims to reconstruct the unobserved clean examples from an adversarial example. Specifically, TNP leverages progressive downsampling and introduces a novel adversarial optimization objective to address the challenge of minimizing reconstruction error but without inadvertently restoring adversarial perturbations. Extensive experiments conducted on CIFAR-10, CIFAR-100, and ImageNet demonstrate that our method generalizes effectively across various norm threats, attack types, and tasks, providing a versatile and promising adversarial purification technique.

[LG-32] Late Breaking Results: The Art of Beating the Odds with Predictor-Guided Random Design Space Exploration

链接: https://arxiv.org/abs/2502.17936
作者: Felix Arnold,Maxence Bouvier,Ryan Amaudruz,Renzo Andri,Lukas Cavigelli
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 2 pages, 3 figures, conference, this research manuscript is currently under review for publication in an IEEE conference

点击查看摘要

Abstract:This work introduces an innovative method for improving combinational digital circuits through random exploration in MIG-based synthesis. High-quality circuits are crucial for performance, power, and cost, making this a critical area of active research. Our approach incorporates next-state prediction and iterative selection, significantly accelerating the synthesis process. This novel method achieves up to 14x synthesis speedup and up to 20.94% better MIG minimization on the EPFL Combinational Benchmark Suite compared to state-of-the-art techniques. We further explore various predictor models and show that increased prediction accuracy does not guarantee an equivalent increase in synthesis quality of results or speedup, observing that randomness remains a desirable factor.

[LG-33] chniques for Enhancing Memory Capacity of Reservoir Computing

链接: https://arxiv.org/abs/2502.17923
作者: Atsuki Yokota,Ichiro Kawashima,Yohei Saito,Hakaru Tamukoh,Osamu Nomura,Takashi Morie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reservoir Computing (RC) is a bio-inspired machine learning framework, and various models have been proposed. RC is a well-suited model for time series data processing, but there is a trade-off between memory capacity and nonlinearity. In this study, we propose methods to improve the memory capacity of reservoir models by modifying their network configuration except for the inside of reservoirs. The Delay method retains past inputs by adding delay node chains to the input layer with the specified number of delay steps. To suppress the effect of input value increase due to the Delay method, we divide the input weights by the number of added delay steps. The Pass through method feeds input values directly to the output layer. The Clustering method divides the input and reservoir nodes into multiple parts and integrates them at the output layer. We applied these methods to an echo state network (ESN), a typical RC model, and the chaotic Boltzmann machine (CBM)-RC, which can be efficiently implemented in integrated circuits. We evaluated their performance on the NARMA task, and measured information processing capacity (IPC) to evaluate the trade-off between memory capacity and nonlinearity.

[LG-34] C-LoRA: Continual Low-Rank Adaptation for Pre-trained Models

链接: https://arxiv.org/abs/2502.17920
作者: Xin Zhang,Liang Bai,Xian Yang,Jiye Liang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is an efficient fine-tuning method that has been extensively applied in areas such as natural language processing and computer vision. Existing LoRA fine-tuning approaches excel in static environments but struggle in dynamic learning due to reliance on multiple adapter modules, increasing overhead and complicating inference. We propose Continual Low-Rank Adaptation (C-LoRA), a novel extension of LoRA for continual learning. C-LoRA uses a learnable routing matrix to dynamically manage parameter updates across tasks, ensuring efficient reuse of learned subspaces while enforcing orthogonality to minimize interference and forgetting. Unlike existing approaches that require separate adapters for each task, C-LoRA enables a integrated approach for task adaptation, achieving both scalability and parameter efficiency in sequential learning scenarios. C-LoRA achieves state-of-the-art accuracy and parameter efficiency on benchmarks while providing theoretical insights into its routing matrix’s role in retaining and transferring knowledge, establishing a scalable framework for continual learning.

[LG-35] Batch normalization does not improve initialization

链接: https://arxiv.org/abs/2502.17913
作者: Joris Dannemann,Gero Junike
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Batch normalization is one of the most important regularization techniques for neural networks, significantly improving training by centering the layers of the neural network. There have been several attempts to provide a theoretical justification for batch ormalization. Santurkar and Tsipras (2018) [How does batch normalization help optimization? Advances in neural information rocessing systems, 31] claim that batch normalization improves initialization. We provide a counterexample showing that this claim s not true, i.e., batch normalization does not improve initialization.

[LG-36] Neural Graph Matching Improves Retrieval Augmented Generation in Molecular Machine Learning

链接: https://arxiv.org/abs/2502.17874
作者: Runzhong Wang,Rui-Xi Wang,Mrunali Manjrekar,Connor W. Coley
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Molecular machine learning has gained popularity with the advancements of geometric deep learning. In parallel, retrieval-augmented generation has become a principled approach commonly used with language models. However, the optimal integration of retrieval augmentation into molecular machine learning remains unclear. Graph neural networks stand to benefit from clever matching to understand the structural alignment of retrieved molecules to a query molecule. Neural graph matching offers a compelling solution by explicitly modeling node and edge affinities between two structural graphs while employing a noise-robust, end-to-end neural network to learn affinity metrics. We apply this approach to mass spectrum simulation and introduce MARASON, a novel model that incorporates neural graph matching to enhance a fragmentation-based neural network. Experimental results highlight the effectiveness of our design, with MARASON achieving 28% top-1 accuracy, a substantial improvement over the non-retrieval state-of-the-art accuracy of 19%. Moreover, MARASON outperforms both naive retrieval-augmented generation methods and traditional graph matching approaches.

[LG-37] EEGM2: An Efficient Mamba-2-Based Self-Supervised Framework for Long-Sequence EEG Modeling

链接: https://arxiv.org/abs/2502.17873
作者: Jiazhen Hong,Geoffrey Mackellar,Soheila Ghane
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Deep learning has achieved significant progress in the development of electroencephalogram (EEG) foundation models, with Transformer-based architectures excelling at capturing long-range dependencies. However, their quadratic computational complexity presents challenges in memory efficiency, training, and inference speed, limiting their scalability and generalizability as a foundation model. In this paper, we propose EEGM2, a self-supervised framework based on structured state space duality (SSD) that overcomes these limitations. EEGM2 introduces three key innovations: (1) a reconstruction-based framework that captures both local and global EEG features through Mamba-2 structured state space models, (2) a spatiotemporal-aware loss function that enhances robustness to noise and preserves spectral information, and (3) a multi-branch receptive field input embedding strategy that improves cross-subject generalization and stability for EEG sequences of varying lengths. In comparison to traditional pretraining methods, on raw EEG or latent representation spaces, EEGM2 shows superior performance on long-sequence tasks, where conventional models struggle. Our experimental results on six EEG datasets validate that EEGM2 not only achieves state-of-the-art cross-domain accuracy but also reduces computational overhead, making it a more efficient solution for deployment on resource-constrained BCI devices.

[LG-38] Mitigating Attrition: Data-Driven Approach Using Machine Learning and Data Engineering

链接: https://arxiv.org/abs/2502.17865
作者: Naveen Edapurath Vijayan
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Software Engineering (cs.SE)
*备注: 7 pages

点击查看摘要

Abstract:This paper presents a novel data-driven approach to mitigating employee attrition using machine learning and data engineering techniques. The proposed framework integrates data from various human resources systems and leverages advanced feature engineering to capture a comprehensive set of factors influencing attrition. The study outlines a robust modeling approach that addresses challenges such as imbalanced datasets, categorical data handling, and model interpretation. The methodology includes careful consideration of training and testing strategies, baseline model establishment, and the development of calibrated predictive models. The research emphasizes the importance of model interpretation using techniques like SHAP values to provide actionable insights for organizations. Key design choices in algorithm selection, hyperparameter tuning, and probability calibration are discussed. This approach enables organizations to proactively identify attrition risks and develop targeted retention strategies, ultimately redu

[LG-39] Armada: Memory-Efficient Distributed Training of Large-Scale Graph Neural Networks

链接: https://arxiv.org/abs/2502.17846
作者: Roger Waleffe,Devesh Sarda,Jason Mohoney,Emmanouil-Vasileios Vlatakis-Gkaragkounis,Theodoros Rekatsinas,Shivaram Venkataraman
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:We study distributed training of Graph Neural Networks (GNNs) on billion-scale graphs that are partitioned across machines. Efficient training in this setting relies on min-edge-cut partitioning algorithms, which minimize cross-machine communication due to GNN neighborhood sampling. Yet, min-edge-cut partitioning over large graphs remains a challenge: State-of-the-art (SoTA) offline methods (e.g., METIS) are effective, but they require orders of magnitude more memory and runtime than GNN training itself, while computationally efficient algorithms (e.g., streaming greedy approaches) suffer from increased edge cuts. Thus, in this work we introduce Armada, a new end-to-end system for distributed GNN training whose key contribution is GREM, a novel min-edge-cut partitioning algorithm that can efficiently scale to large graphs. GREM builds on streaming greedy approaches with one key addition: prior vertex assignments are continuously refined during streaming, rather than frozen after an initial greedy selection. Our theoretical analysis and experimental results show that this refinement is critical to minimizing edge cuts and enables GREM to reach partition quality comparable to METIS but with 8-65x less memory and 8-46x faster. Given a partitioned graph, Armada leverages a new disaggregated architecture for distributed GNN training to further improve efficiency; we find that on common cloud machines, even with zero communication, GNN neighborhood sampling and feature loading bottleneck training. Disaggregation allows Armada to independently allocate resources for these operations and ensure that expensive GPUs remain saturated with computation. We evaluate Armada against SoTA systems for distributed GNN training and find that the disaggregated architecture leads to runtime improvements up to 4.5x and cost reductions up to 3.1x.

[LG-40] LeanKAN: A Parameter-Lean Kolmogorov-Arnold Network Layer with Improved Memory Efficiency and Convergence Behavior

链接: https://arxiv.org/abs/2502.17844
作者: Benjamin C. Koenig,Suyong Kim,Sili Deng
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 15 pages, 5 figures, and 1 table

点击查看摘要

Abstract:The recently proposed Kolmogorov-Arnold network (KAN) is a promising alternative to multi-layer perceptrons (MLPs) for data-driven modeling. While original KAN layers were only capable of representing the addition operator, the recently-proposed MultKAN layer combines addition and multiplication subnodes in an effort to improve representation performance. Here, we find that MultKAN layers suffer from a few key drawbacks including limited applicability in output layers, bulky parameterizations with extraneous activations, and the inclusion of complex hyperparameters. To address these issues, we propose LeanKANs, a direct and modular replacement for MultKAN and traditional AddKAN layers. LeanKANs address these three drawbacks of MultKAN through general applicability as output layers, significantly reduced parameter counts for a given network structure, and a smaller set of hyperparameters. As a one-to-one layer replacement for standard AddKAN and MultKAN layers, LeanKAN is able to provide these benefits to traditional KAN learning problems as well as augmented KAN structures in which it serves as the backbone, such as KAN Ordinary Differential Equations (KAN-ODEs) or Deep Operator KANs (DeepOKAN). We demonstrate LeanKAN’s simplicity and efficiency in a series of demonstrations carried out across both a standard KAN toy problem and a KAN-ODE dynamical system modeling problem, where we find that its sparser parameterization and compact structure serve to increase its expressivity and learning capability, leading it to outperform similar and even much larger MultKANs in various tasks.

[LG-41] ask-Driven Semantic Quantization and Imitation Learning for Goal-Oriented Communications

链接: https://arxiv.org/abs/2502.17842
作者: Yu-Chieh Chao,Yubei Chen,Weiwei Wang,Achintha Wijesinghe,Suchinthaka Wanninayaka,Songyang Zhang,Zhi Ding
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted for publication in 2025 International Conference on Communications (IEEE ICC); 6 pages, 4 figures

点击查看摘要

Abstract:Semantic communication marks a new paradigm shift from bit-wise data transmission to semantic information delivery for the purpose of bandwidth reduction. To more effectively carry out specialized downstream tasks at the receiver end, it is crucial to define the most critical semantic message in the data based on the task or goal-oriented features. In this work, we propose a novel goal-oriented communication (GO-COM) framework, namely Goal-Oriented Semantic Variational Autoencoder (GOS-VAE), by focusing on the extraction of the semantics vital to the downstream tasks. Specifically, we adopt a Vector Quantized Variational Autoencoder (VQ-VAE) to compress media data at the transmitter side. Instead of targeting the pixel-wise image data reconstruction, we measure the quality-of-service at the receiver end based on a pre-defined task-incentivized model. Moreover, to capture the relevant semantic features in the data reconstruction, imitation learning is adopted to measure the data regeneration quality in terms of goal-oriented semantics. Our experimental results demonstrate the power of imitation learning in characterizing goal-oriented semantics and bandwidth efficiency of our proposed GOS-VAE.

[LG-42] Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement Learning

链接: https://arxiv.org/abs/2502.17813
作者: Meng Feng,Viraj Parimi,Brian Williams
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Due to the limitation “The abstract field cannot be longer than 1,920 characters”, the abstract here is shorter than that in the PDF file

点击查看摘要

Abstract:Safe navigation is essential for autonomous systems operating in hazardous environments. Traditional planning methods excel at long-horizon tasks but rely on a predefined graph with fixed distance metrics. In contrast, safe Reinforcement Learning (RL) can learn complex behaviors without relying on manual heuristics but fails to solve long-horizon tasks, particularly in goal-conditioned and multi-agent scenarios. In this paper, we introduce a novel method that integrates the strengths of both planning and safe RL. Our method leverages goal-conditioned RL and safe RL to learn a goal-conditioned policy for navigation while concurrently estimating cumulative distance and safety levels using learned value functions via an automated self-training algorithm. By constructing a graph with states from the replay buffer, our method prunes unsafe edges and generates a waypoint-based plan that the agent follows until reaching its goal, effectively balancing faster and safer routes over extended distances. Utilizing this unified high-level graph and a shared low-level goal-conditioned safe RL policy, we extend this approach to address the multi-agent safe navigation problem. In particular, we leverage Conflict-Based Search (CBS) to create waypoint-based plans for multiple agents allowing for their safe navigation over extended horizons. This integration enhances the scalability of goal-conditioned safe RL in multi-agent scenarios, enabling efficient coordination among agents. Extensive benchmarking against state-of-the-art baselines demonstrates the effectiveness of our method in achieving distance goals safely for multiple agents in complex and hazardous environments. Our code will be released to support future research. Comments: Due to the limitation “The abstract field cannot be longer than 1,920 characters”, the abstract here is shorter than that in the PDF file Subjects: Robotics (cs.RO); Machine Learning (cs.LG) Cite as: arXiv:2502.17813 [cs.RO] (or arXiv:2502.17813v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2502.17813 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Viraj Parimi [view email] [v1] Tue, 25 Feb 2025 03:38:52 UTC (2,265 KB)

[LG-43] PVBF: A Framework for Mitigating Parameter Variation Imbalance in Online Continual Learning

链接: https://arxiv.org/abs/2502.17794
作者: Zelin Tao,Hao Deng,Mingqing Liu,Lijun Zhang,Shengjie Zhao
类目: Machine Learning (cs.LG)
*备注: 27 pages, 11 figures

点击查看摘要

Abstract:Online continual learning (OCL), which enables AI systems to adaptively learn from non-stationary data streams, is commonly achieved using experience replay (ER)-based methods that retain knowledge by replaying stored past during training. However, these methods face challenges of prediction bias, stemming from deviations in parameter update directions during task transitions. This paper identifies parameter variation imbalance as a critical factor contributing to prediction bias in ER-based OCL. Specifically, using the proposed parameter variation evaluation method, we highlight two types of imbalance: correlation-induced imbalance, where certain parameters are disproportionately updated across tasks, and layer-wise imbalance, where output layer parameters update faster than those in preceding layers. To mitigate the above imbalances, we propose the Parameter Variation Balancing Framework (PVBF), which incorporates: 1) a novel method to compute parameter correlations with previous tasks based on parameter variations, 2) an encourage-and-consolidate (EC) method utilizing parameter correlations to perform gradient adjustments across all parameters during training, 3) a dual-layer copy weights with reinit (D-CWR) strategy to slowly update output layer parameters for frequently occuring sample categories. Experiments on short and long task sequences demonstrate that PVBF significantly reduces prediction bias and improves OCL performance, achieving up to 47% higher accuracy compared to existing ER-based methods.

[LG-44] On-device edge learning for IoT data streams: a survey

链接: https://arxiv.org/abs/2502.17788
作者: Afonso Lourenço,João Rodrigo,João Gama,Goreti Marreiros
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This literature review explores continual learning methods for on-device training in the context of neural networks (NNs) and decision trees (DTs) for classification tasks on smart environments. We highlight key constraints, such as data architecture (batch vs. stream) and network capacity (cloud vs. edge), which impact TinyML algorithm design, due to the uncontrolled natural arrival of data streams. The survey details the challenges of deploying deep learners on resource-constrained edge devices, including catastrophic forgetting, data inefficiency, and the difficulty of handling IoT tabular data in open-world settings. While decision trees are more memory-efficient for on-device training, they are limited in expressiveness, requiring dynamic adaptations, like pruning and meta-learning, to handle complex patterns and concept drifts. We emphasize the importance of multi-criteria performance evaluation tailored to edge applications, which assess both output-based and internal representation metrics. The key challenge lies in integrating these building blocks into autonomous online systems, taking into account stability-plasticity trade-offs, forward-backward transfer, and model convergence.

[LG-45] Adaptive Nesterov Accelerated Distributional Deep Hedging for Efficient Volatility Risk Management

链接: https://arxiv.org/abs/2502.17777
作者: Lei Zhao,Lin Cai,Wu-Sheng Lu
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:In the field of financial derivatives trading, managing volatility risk is crucial for protecting investment portfolios from market changes. Traditional Vega hedging strategies, which often rely on basic and rule-based models, are hard to adapt well to rapidly changing market conditions. We introduce a new framework for dynamic Vega hedging, the Adaptive Nesterov Accelerated Distributional Deep Hedging (ANADDH), which combines distributional reinforcement learning with a tailored design based on adaptive Nesterov acceleration. This approach improves the learning process in complex financial environments by modeling the hedging efficiency distribution, providing a more accurate and responsive hedging strategy. The design of adaptive Nesterov acceleration refines gradient momentum adjustments, significantly enhancing the stability and speed of convergence of the model. Through empirical analysis and comparisons, our method demonstrates substantial performance gains over existing hedging techniques. Our results confirm that this innovative combination of distributional reinforcement learning with the proposed optimization techniques improves financial risk management and highlights the practical benefits of implementing advanced neural network architectures in the finance sector.

[LG-46] An Improved Privacy and Utility Analysis of Differentially Private SGD with Bounded Domain and Smooth Losses

链接: https://arxiv.org/abs/2502.17772
作者: Hao Liang,Wanrong Zhang,Xinlei He,Kaishun He,Hong Xing
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 18 pages, 2 figures, submitted for possible publication

点击查看摘要

Abstract:Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to protect sensitive data during the training of machine learning models, but its privacy guarantees often come at the cost of model performance, largely due to the inherent challenge of accurately quantifying privacy loss. While recent efforts have strengthened privacy guarantees by focusing solely on the final output and bounded domain cases, they still impose restrictive assumptions, such as convexity and other parameter limitations, and often lack a thorough analysis of utility. In this paper, we provide rigorous privacy and utility characterization for DPSGD for smooth loss functions in both bounded and unbounded domains. We track the privacy loss over multiple iterations by exploiting the noisy smooth-reduction property and establish the utility analysis by leveraging the projection’s non-expansiveness and clipped SGD properties. In particular, we show that for DPSGD with a bounded domain, (i) the privacy loss can still converge without the convexity assumption, and (ii) a smaller bounded diameter can improve both privacy and utility simultaneously under certain conditions. Numerical results validate our results.

[LG-47] Applications of deep reinforcement learning to urban transit network design

链接: https://arxiv.org/abs/2502.17758
作者: Andrew Holliday
类目: Machine Learning (cs.LG)
*备注: This is a copy of my PhD thesis, which was successfully defended at McGill University in December of 2024. arXiv admin note: text overlap with arXiv:2404.05894

点击查看摘要

Abstract:This thesis concerns the use of reinforcement learning to train neural networks to aid in the design of public transit networks. The Transit Network Design Problem (TNDP) is an optimization problem of considerable practical importance. Given a city with an existing road network and travel demands, the goal is to find a set of transit routes - each of which is a path through the graph - that collectively satisfy all demands, while minimizing a cost function that may depend both on passenger satisfaction and operating costs. The existing literature on this problem mainly considers metaheuristic optimization algorithms, such as genetic algorithms and ant-colony optimization. By contrast, we begin by taking a reinforcement learning approach, formulating the construction of a set of transit routes as a Markov Decision Process (MDP) and training a neural net policy to act as the agent in this MDP. We then show that, beyond using this policy to plan a transit network directly, it can be combined with existing metaheuristic algorithms, both to initialize the solution and to suggest promising moves at each step of a search through solution space. We find that such hybrid algorithms, which use a neural policy trained via reinforcement learning as a core component within a classical metaheuristic framework, can plan transit networks that are superior to those planned by either the neural policy or the metaheuristic algorithm. We demonstrate the utility of our approach by using it to redesign the transit network for the city of Laval, Quebec, and show that in simulation, the resulting transit network provides better service at lower cost than the existing transit network.

[LG-48] Robust and Efficient Deep Hedging via Linearized Objective Neural Network

链接: https://arxiv.org/abs/2502.17757
作者: Lei Zhao,Lin Cai
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注:

点击查看摘要

Abstract:Deep hedging represents a cutting-edge approach to risk management for financial derivatives by leveraging the power of deep learning. However, existing methods often face challenges related to computational inefficiency, sensitivity to noisy data, and optimization complexity, limiting their practical applicability in dynamic and volatile markets. To address these limitations, we propose Deep Hedging with Linearized-objective Neural Network (DHLNN), a robust and generalizable framework that enhances the training procedure of deep learning models. By integrating a periodic fixed-gradient optimization method with linearized training dynamics, DHLNN stabilizes the training process, accelerates convergence, and improves robustness to noisy financial data. The framework incorporates trajectory-wide optimization and Black-Scholes Delta anchoring, ensuring alignment with established financial theory while maintaining flexibility to adapt to real-world market conditions. Extensive experiments on synthetic and real market data validate the effectiveness of DHLNN, demonstrating its ability to achieve faster convergence, improved stability, and superior hedging performance across diverse market scenarios.

[LG-49] FinP: Fairness-in-Privacy in Federated Learning by Addressing Disparities in Privacy Risk

链接: https://arxiv.org/abs/2502.17748
作者: Tianyu Zhao,Mahmoud Srewa,Salma Elmalaki
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Ensuring fairness in machine learning, particularly in human-centric applications, extends beyond algorithmic bias to encompass fairness in privacy, specifically the equitable distribution of privacy risk. This is critical in federated learning (FL), where decentralized data necessitates balanced privacy preservation across clients. We introduce FinP, a framework designed to achieve fairness in privacy by mitigating disproportionate exposure to source inference attacks (SIA). FinP employs a dual approach: (1) server-side adaptive aggregation to address unfairness in client contributions in global model, and (2) client-side regularization to reduce client vulnerability. This comprehensive strategy targets both the symptoms and root causes of privacy unfairness. Evaluated on the Human Activity Recognition (HAR) and CIFAR-10 datasets, FinP demonstrates ~20% improvement in fairness in privacy on HAR with minimal impact on model utility, and effectively mitigates SIA risks on CIFAR-10, showcasing its ability to provide fairness in privacy in FL systems without compromising performance.

[LG-50] oward 6-DOF Autonomous Underwater Vehicle Energy-Aware Position Control based on Deep Reinforcement Learning: Preliminary Results

链接: https://arxiv.org/abs/2502.17742
作者: Gustavo Boré(1),Vicente Sufán(1),Sebastián Rodríguez-Martínez(2),Giancarlo Troni(2) ((1) Pontificia Universidad Católica de Chile, (2) Monterey Bay Aquarium Research Institute)
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 pages, 5 figures, submitted to 2024 IEEE OES AUV Symposium

点击查看摘要

Abstract:The use of autonomous underwater vehicles (AUVs) for surveying, mapping, and inspecting unexplored underwater areas plays a crucial role, where maneuverability and power efficiency are key factors for extending the use of these platforms, making six degrees of freedom (6-DOF) holonomic platforms essential tools. Although Proportional-Integral-Derivative (PID) and Model Predictive Control controllers are widely used in these applications, they often require accurate system knowledge, struggle with repeatability when facing payload or configuration changes, and can be time-consuming to fine-tune. While more advanced methods based on Deep Reinforcement Learning (DRL) have been proposed, they are typically limited to operating in fewer degrees of freedom. This paper proposes a novel DRL-based approach for controlling holonomic 6-DOF AUVs using the Truncated Quantile Critics (TQC) algorithm, which does not require manual tuning and directly feeds commands to the thrusters without prior knowledge of their configuration. Furthermore, it incorporates power consumption directly into the reward function. Simulation results show that the TQC High-Performance method achieves better performance to a fine-tuned PID controller when reaching a goal point, while the TQC Energy-Aware method demonstrates slightly lower performance but consumes 30% less power on average.

[LG-51] Phoeni6: a Systematic Approach for Evaluating the Energy Consumption of Neural Networks

链接: https://arxiv.org/abs/2502.17734
作者: Antônio Oliveira-Filho,Wellington Silva-de-Souza,Carlos Alberto Valderrama Sakuyama,Samuel Xavier-de-Souza
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: The paper consists of 24 pages and 25 figures. It is currently under review at the journal Sustainable Computing: Informatics and Systems

点击查看摘要

Abstract:This paper presents Phoeni6, a systematic approach for assessing the energy consumption of neural networks while upholding the principles of fair comparison and reproducibility. Phoeni6 offers a comprehensive solution for managing energy-related data and configurations, ensuring portability, transparency, and coordination during evaluations. The methodology automates energy evaluations through containerized tools, robust database management, and versatile data models. In the first case study, the energy consumption of AlexNet and MobileNet was compared using raw and resized images. Results showed that MobileNet is up to 6.25% more energy-efficient for raw images and 2.32% for resized datasets, while maintaining competitive accuracy levels. In the second study, the impact of image file formats on energy consumption was evaluated. BMP images reduced energy usage by up to 30% compared to PNG, highlighting the influence of file formats on energy efficiency. These findings emphasize the importance of Phoeni6 in optimizing energy consumption for diverse neural network applications and establishing sustainable artificial intelligence practices.

[LG-52] Learning Backbones: Sparsifying Graphs through Zero Forcing for Effective Graph-Based Learning

链接: https://arxiv.org/abs/2502.17713
作者: Obaid Ullah Ahmad,Anwar Said,Mudassir Shabbir,Xenofon Koutsoukos,Waseem Abbas
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: 13th International Conference on Complex Networks and their Applications

点击查看摘要

Abstract:This paper introduces a novel framework for graph sparsification that preserves the essential learning attributes of original graphs, improving computational efficiency and reducing complexity in learning algorithms. We refer to these sparse graphs as “learning backbones”. Our approach leverages the zero-forcing (ZF) phenomenon, a dynamic process on graphs with applications in network control. The key idea is to generate a tree from the original graph that retains critical dynamical properties. By correlating these properties with learning attributes, we construct effective learning backbones. We evaluate the performance of our ZF-based backbones in graph classification tasks across eight datasets and six baseline models. The results demonstrate that our method outperforms existing techniques. Additionally, we explore extensions using node distance metrics to further enhance the framework’s utility.

[LG-53] Robust Federated Learning with Global Sensitivity Estimation for Financial Risk Management

链接: https://arxiv.org/abs/2502.17694
作者: Lei Zhao,Lin Cai,Wu-Sheng Lu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In decentralized financial systems, robust and efficient Federated Learning (FL) is promising to handle diverse client environments and ensure resilience to systemic risks. We propose Federated Risk-Aware Learning with Central Sensitivity Estimation (FRAL-CSE), an innovative FL framework designed to enhance scalability, stability, and robustness in collaborative financial decision-making. The framework’s core innovation lies in a central acceleration mechanism, guided by a quadratic sensitivity-based approximation of global model dynamics. By leveraging local sensitivity information derived from robust risk measurements, FRAL-CSE performs a curvature-informed global update that efficiently incorporates second-order information without requiring repeated local re-evaluations, thereby enhancing training efficiency and improving optimization stability. Additionally, distortion risk measures are embedded into the training objectives to capture tail risks and ensure robustness against extreme scenarios. Extensive experiments validate the effectiveness of FRAL-CSE in accelerating convergence and improving resilience across heterogeneous datasets compared to state-of-the-art baselines.

[LG-54] Predictive Response Optimization: Using Reinforcement Learning to Fight Online Social Network Abuse USENIX-SECURITY2025

链接: https://arxiv.org/abs/2502.17693
作者: Garrett Wilson,Geoffrey Goh,Yan Jiang,Ajay Gupta,Jiaxuan Wang,David Freeman,Francesco Dinuzzo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
*备注: To appear in USENIX Security 2025

点击查看摘要

Abstract:Detecting phishing, spam, fake accounts, data scraping, and other malicious activity in online social networks (OSNs) is a problem that has been studied for well over a decade, with a number of important results. Nearly all existing works on abuse detection have as their goal producing the best possible binary classifier; i.e., one that labels unseen examples as “benign” or “malicious” with high precision and recall. However, no prior published work considers what comes next: what does the service actually do after it detects abuse? In this paper, we argue that detection as described in previous work is not the goal of those who are fighting OSN abuse. Rather, we believe the goal to be selecting actions (e.g., ban the user, block the request, show a CAPTCHA, or “collect more evidence”) that optimize a tradeoff between harm caused by abuse and impact on benign users. With this framing, we see that enlarging the set of possible actions allows us to move the Pareto frontier in a way that is unattainable by simply tuning the threshold of a binary classifier. To demonstrate the potential of our approach, we present Predictive Response Optimization (PRO), a system based on reinforcement learning that utilizes available contextual information to predict future abuse and user-experience metrics conditioned on each possible action, and select actions that optimize a multi-dimensional tradeoff between abuse/harm and impact on user experience. We deployed versions of PRO targeted at stopping automated activity on Instagram and Facebook. In both cases our experiments showed that PRO outperforms a baseline classification system, reducing abuse volume by 59% and 4.5% (respectively) with no negative impact to users. We also present several case studies that demonstrate how PRO can quickly and automatically adapt to changes in business constraints, system behavior, and/or adversarial tactics. Comments: To appear in USENIX Security 2025 Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI) Cite as: arXiv:2502.17693 [cs.LG] (or arXiv:2502.17693v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.17693 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-55] Architecting Digital Twins for Intelligent Transportation Systems

链接: https://arxiv.org/abs/2502.17646
作者: Hiya Bhatt,Sahil,Karthik Vaidhyanathan,Rahul Biju,Deepak Gangadharan,Ramona Trestian,Purav Shah
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Modern transportation systems face growing challenges in managing traffic flow, ensuring safety, and maintaining operational efficiency amid dynamic traffic patterns. Addressing these challenges requires intelligent solutions capable of real-time monitoring, predictive analytics, and adaptive control. This paper proposes an architecture for DigIT, a Digital Twin (DT) platform for Intelligent Transportation Systems (ITS), designed to overcome the limitations of existing frameworks by offering a modular and scalable solution for traffic management. Built on a Domain Concept Model (DCM), the architecture systematically models key ITS components enabling seamless integration of predictive modeling and simulations. The architecture leverages machine learning models to forecast traffic patterns based on historical and real-time data. To adapt to evolving traffic patterns, the architecture incorporates adaptive Machine Learning Operations (MLOps), automating the deployment and lifecycle management of predictive models. Evaluation results highlight the effectiveness of the architecture in delivering accurate predictions and computational efficiency.

[LG-56] he Power of Graph Signal Processing for Chip Placement Acceleration

链接: https://arxiv.org/abs/2502.17632
作者: Yiting Liu,Hai Zhou,Jia Wang,Fan Yang,Xuan Zeng,Li Shang
类目: Machine Learning (cs.LG)
*备注: ICCAD’24 conference

点击查看摘要

Abstract:Placement is a critical task with high computation complexity in VLSI physical design. Modern analytical placers formulate the placement objective as a nonlinear optimization task, which suffers a long iteration time. To accelerate and enhance the placement process, recent studies have turned to deep learning-based approaches, particularly leveraging graph convolution networks (GCNs). However, learning-based placers require time- and data-consuming model training due to the complexity of circuit placement that involves large-scale cells and design-specific graph statistics. This paper proposes GiFt, a parameter-free technique for accelerating placement, rooted in graph signal processing. GiFt excels at capturing multi-resolution smooth signals of circuit graphs to generate optimized placement solutions without the need for time-consuming model training, and meanwhile significantly reduces the number of iterations required by analytical placers. Experimental results show that GiFt significantly improving placement efficiency, while achieving competitive or superior performance compared to state-of-the-art placers. In particular, compared to DREAMPlace, the recently proposed GPU-accelerated analytical placer, GF-Placer improves total runtime over 45%. Comments: ICCAD’24 conference Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.17632 [cs.LG] (or arXiv:2502.17632v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.17632 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-57] Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback

链接: https://arxiv.org/abs/2502.17625
作者: Shinji Ito,Haipeng Luo,Taira Tsuchiya,Yue Wu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:No-regret self-play learning dynamics have become one of the premier ways to solve large-scale games in practice. Accelerating their convergence via improving the regret of the players over the naive O(\sqrtT) bound after T rounds has been extensively studied in recent years, but almost all studies assume access to exact gradient feedback. We address the question of whether acceleration is possible under bandit feedback only and provide an affirmative answer for two-player zero-sum normal-form games. Specifically, we show that if both players apply the Tsallis-INF algorithm of Zimmert and Seldin (2018, arXiv:1807.07623), then their regret is at most O(c_1 \log T + \sqrtc_2 T) , where c_1 and c_2 are game-dependent constants that characterize the difficulty of learning – c_1 resembles the complexity of learning a stochastic multi-armed bandit instance and depends inversely on some gap measures, while c_2 can be much smaller than the number of actions when the Nash equilibria have a small support or are close to the boundary. In particular, for the case when a pure strategy Nash equilibrium exists, c_2 becomes zero, leading to an optimal instance-dependent regret bound as we show. We additionally prove that in this case, our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample complexity.

[LG-58] Provable Model-Parallel Distributed Principal Component Analysis with Parallel Deflation

链接: https://arxiv.org/abs/2502.17615
作者: Fangshuo Liao,Wenyi Su,Anastasios Kyrillidis
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: CPAL 2025

点击查看摘要

Abstract:We study a distributed Principal Component Analysis (PCA) framework where each worker targets a distinct eigenvector and refines its solution by updating from intermediate solutions provided by peers deemed as “superior”. Drawing intuition from the deflation method in centralized eigenvalue problems, our approach breaks the sequential dependency in the deflation steps and allows asynchronous updates of workers, while incurring only a small communication cost. To our knowledge, a gap in the literature – the theoretical underpinning of such distributed, dynamic interactions among workers – has remained unaddressed. This paper offers a theoretical analysis explaining why, how, and when these intermediate, hierarchical updates lead to practical and provable convergence in distributed environments. Despite being a theoretical work, our prototype implementation demonstrates that such a distributed PCA algorithm converges effectively and in scalable way: through experiments, our proposed framework offers comparable performance to EigenGame- \mu , the state-of-the-art model-parallel PCA solver.

[LG-59] Scalable Graph Condensation with Evolving Capabilities

链接: https://arxiv.org/abs/2502.17614
作者: Shengbo Gong,Mohammad Hashemi,Juntong Ni,Carl Yang,Wei Jin
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:Graph data has become a pivotal modality due to its unique ability to model relational datasets. However, real-world graph data continues to grow exponentially, resulting in a quadratic increase in the complexity of most graph algorithms as graph sizes expand. Although graph condensation (GC) methods have been proposed to address these scalability issues, existing approaches often treat the training set as static, overlooking the evolving nature of real-world graph data. This limitation leads to inefficiencies when condensing growing training sets. In this paper, we introduce GECC (Graph Evolving Clustering Condensation), a scalable graph condensation method designed to handle large-scale and evolving graph data. GECC employs a traceable and efficient approach by performing class-wise clustering on aggregated features. Furthermore, it can inherits previous condensation results as clustering centroids when the condensed graph expands, thereby attaining an evolving capability. This methodology is supported by robust theoretical foundations and demonstrates superior empirical performance. Comprehensive experiments show that GECC achieves better performance than most state-of-the-art graph condensation methods while delivering an around 1,000x speedup on large datasets.

[LG-60] Learning Decentralized Swarms Using Rotation Equivariant Graph Neural Networks

链接: https://arxiv.org/abs/2502.17612
作者: Taos Transue,Bao Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The orchestration of agents to optimize a collective objective without centralized control is challenging yet crucial for applications such as controlling autonomous fleets, and surveillance and reconnaissance using sensor networks. Decentralized controller design has been inspired by self-organization found in nature, with a prominent source of inspiration being flocking; however, decentralized controllers struggle to maintain flock cohesion. The graph neural network (GNN) architecture has emerged as an indispensable machine learning tool for developing decentralized controllers capable of maintaining flock cohesion, but they fail to exploit the symmetries present in flocking dynamics, hindering their generalizability. We enforce rotation equivariance and translation invariance symmetries in decentralized flocking GNN controllers and achieve comparable flocking control with 70% less training data and 75% fewer trainable weights than existing GNN controllers without these symmetries enforced. We also show that our symmetry-aware controller generalizes better than existing GNN controllers. Code and animations are available at this http URL.

[LG-61] VANPY: Voice Analysis Framework

链接: https://arxiv.org/abs/2502.17579
作者: Gregory Koushnir,Michael Fire,Galit Fuhrmann Alpert,Dima Kagan
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Voice data is increasingly being used in modern digital communications, yet there is still a lack of comprehensive tools for automated voice analysis and characterization. To this end, we developed the VANPY (Voice Analysis in Python) framework for automated pre-processing, feature extraction, and classification of voice data. The VANPY is an open-source end-to-end comprehensive framework that was developed for the purpose of speaker characterization from voice data. The framework is designed with extensibility in mind, allowing for easy integration of new components and adaptation to various voice analysis applications. It currently incorporates over fifteen voice analysis components - including music/speech separation, voice activity detection, speaker embedding, vocal feature extraction, and various classification models. Four of the VANPY’s components were developed in-house and integrated into the framework to extend its speaker characterization capabilities: gender classification, emotion classification, age regression, and height regression. The models demonstrate robust performance across various datasets, although not surpassing state-of-the-art performance. As a proof of concept, we demonstrate the framework’s ability to extract speaker characteristics on a use-case challenge of analyzing character voices from the movie “Pulp Fiction.” The results illustrate the framework’s capability to extract multiple speaker characteristics, including gender, age, height, emotion type, and emotion intensity measured across three dimensions: arousal, dominance, and valence. Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2502.17579 [cs.SD] (or arXiv:2502.17579v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2502.17579 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-62] FedSV: Byzantine-Robust Federated Learning via Shapley Value

链接: https://arxiv.org/abs/2502.17526
作者: Khaoula Otmani(AU, LIA),Rachid Elazouzi(LIA, CMU),Vincent Labatut(AU, LIA)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:In Federated Learning (FL), several clients jointly learn a machine learning model: each client maintains a local model for its local learning dataset, while a master server maintains a global model by aggregating the local models of the client devices. However, the repetitive communication between server and clients leaves room for attacks aimed at compromising the integrity of the global model, causing errors in its targeted predictions. In response to such threats on FL, various defense measures have been proposed in the literature. In this paper, we present a powerful defense against malicious clients in FL, called FedSV, using the Shapley Value (SV), which has been proposed recently to measure user contribution in FL by computing the marginal increase of average accuracy of the model due to the addition of local data of a user. Our approach makes the identification of malicious clients more robust, since during the learning phase, it estimates the contribution of each client according to the different groups to which the target client belongs. FedSV’s effectiveness is demonstrated by extensive experiments on MNIST datasets in a cross-silo context under various attacks.

[LG-63] UNCA: A Neutrosophic-Based Framework for Robust Clustering and Enhanced Data Interpretation

链接: https://arxiv.org/abs/2502.17523
作者: D. Dhinakaran,S. Edwin Raja,S. Gopalakrishnan,D. Selvaraj,S. D. Lalitha
类目: Machine Learning (cs.LG)
*备注: 17 pages, 8 Figures, 1 Table

点击查看摘要

Abstract:Accurately representing the complex linkages and inherent uncertainties included in huge datasets is still a major difficulty in the field of data clustering. We address these issues with our proposed Unified Neutrosophic Clustering Algorithm (UNCA), which combines a multifaceted strategy with Neutrosophic logic to improve clustering performance. UNCA starts with a full-fledged similarity examination via a \lambda-cutting matrix that filters meaningful relationships between each two points of data. Then, we initialize centroids for Neutrosophic K-Means clustering, where the membership values are based on their degrees of truth, indeterminacy and falsity. The algorithm then integrates with a dynamic network visualization and MST (Minimum Spanning Tree) so that a visual interpretation of the relationships between the clusters can be clearly represented. UNCA employs SingleValued Neutrosophic Sets (SVNSs) to refine cluster assignments, and after fuzzifying similarity measures, guarantees a precise clustering result. The final step involves solidifying the clustering results through defuzzification methods, offering definitive cluster assignments. According to the performance evaluation results, UNCA outperforms conventional approaches in several metrics: it achieved a Silhouette Score of 0.89 on the Iris Dataset, a Davies-Bouldin Index of 0.59 on the Wine Dataset, an Adjusted Rand Index (ARI) of 0.76 on the Digits Dataset, and a Normalized Mutual Information (NMI) of 0.80 on the Customer Segmentation Dataset. These results demonstrate how UNCA enhances interpretability and resilience in addition to improving clustering accuracy when contrasted with Fuzzy C-Means (FCM), Neutrosophic C-Means (NCM), as well as Kernel Neutrosophic C-Means (KNCM). This makes UNCA a useful tool for complex data processing tasks

[LG-64] Learning multi-phase flow and transport in fractured porous media with auto-regressive and recurrent graph neural networks

链接: https://arxiv.org/abs/2502.17512
作者: Mohammed Al Kobaisi,Wenjuan Zhang,Waleed Diab,Hadi Hajibeygi
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:In the past three decades, a wide array of computational methodologies and simulation frameworks has emerged to address the complexities of modeling multi-phase flow and transport processes in fractured porous media. The conformal mesh approaches which explicitly align the computational grid with fracture surfaces are considered by many to be the most accurate. However, such methods require excessive fine-scale meshing, rendering them impractical for large or complex fracture networks. In this work, we propose to learn the complex multi-phase flow and transport dynamics in fractured porous media with graph neural networks (GNN). GNNs are well suited for this task due to the unstructured topology of the computation grid resulting from the Embedded Discrete Fracture Model (EDFM) discretization. We propose two deep learning architectures, a GNN and a recurrent GNN. Both networks follow a two-stage training strategy: an autoregressive one step roll-out, followed by a fine-tuning step where the model is supervised using the whole ground-truth sequence. We demonstrate that the two-stage training approach is effective in mitigating error accumulation during autoregressive model rollouts in the testing phase. Our findings indicate that both GNNs generalize well to unseen fracture realizations, with comparable performance in forecasting saturation sequences, and slightly better performance for the recurrent GNN in predicting pressure sequences. While the second stage of training proved to be beneficial for the GNN model, its impact on the recurrent GNN model was less pronounced. Finally, the performance of both GNNs for temporal extrapolation is tested. The recurrent GNN significantly outperformed the GNN in terms of accuracy, thereby underscoring its superior capability in predicting long sequences.

[LG-65] Hard constraint learning approaches with trainable influence functions for evolutionary equations

链接: https://arxiv.org/abs/2502.17497
作者: Yushi Zhang,Shuai Su,Yong Wang,Yanzhong Yao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper develops a novel deep learning approach for solving evolutionary equations, which integrates sequential learning strategies with an enhanced hard constraint strategy featuring trainable parameters, addressing the low computational accuracy of standard Physics-Informed Neural Networks (PINNs) in large temporal this http URL learning strategies divide a large temporal domain into multiple subintervals and solve them one by one in a chronological order, which naturally respects the principle of causality and improves the stability of the PINN solution. The improved hard constraint strategy strictly ensures the continuity and smoothness of the PINN solution at time interval nodes, and at the same time passes the information from the previous interval to the next interval, which avoids the incorrect/trivial solution at the position far from the initial time. Furthermore, by investigating the requirements of different types of equations on hard constraints, we design a novel influence function with trainable parameters for hard constraints, which provides theoretical and technical support for the effective implementations of hard constraint strategies, and significantly improves the universality and computational accuracy of our method. In addition, an adaptive time-domain partitioning algorithm is proposed, which plays an important role in the application of the proposed method as well as in the improvement of computational efficiency and accuracy. Numerical experiments verify the performance of the method. The data and code accompanying this paper are available at this https URL.

[LG-66] Spatiotemporal Forecasting in Climate Data Using EOFs and Machine Learning Models: A Case Study in Chile

链接: https://arxiv.org/abs/2502.17495
作者: Mauricio Herrera,Francisca Kleisinger,Andrés Wilsón
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 25 pages, 6 figures

点击查看摘要

Abstract:Effective resource management and environmental planning in regions with high climatic variability, such as Chile, demand advanced predictive tools. This study addresses this challenge by employing an innovative and computationally efficient hybrid methodology that integrates machine learning (ML) methods for time series forecasting with established statistical techniques. The spatiotemporal data undergo decomposition using time-dependent Empirical Orthogonal Functions (EOFs), denoted as (\phi_k(t)), and their corresponding spatial coefficients, (\alpha_k(s)), to reduce dimensionality. Wavelet analysis provides high-resolution time and frequency information from the (\phi_k(t)) functions, while neural networks forecast these functions within a medium-range horizon (h). By utilizing various ML models, particularly a Wavelet - ANN hybrid model, we forecast (\phi_k(t+h)) up to a time horizon (h), and subsequently reconstruct the spatiotemporal data using these extended EOFs. This methodology is applied to a grid of climate data covering the territory of Chile. It transitions from a high-dimensional multivariate spatiotemporal data forecasting problem to a low-dimensional univariate forecasting problem. Additionally, cluster analysis with Dynamic Time Warping for defining similarities between rainfall time series, along with spatial coherence and predictability assessments, has been instrumental in identifying geographic areas where model performance is enhanced. This approach also elucidates the reasons behind poor forecast performance in regions or clusters with low spatial coherence and predictability. By utilizing cluster medoids, the forecasting process becomes more practical and efficient. This compound approach significantly reduces computational complexity while generating forecasts of reasonable accuracy and utility.

[LG-67] Rapid Parameter Inference with Uncertainty Quantification for a Radiological Plume Source Identification Problem

链接: https://arxiv.org/abs/2502.17492
作者: Christopher Edwards,Ralph C Smith
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:In the event of a nuclear accident, or the detonation of a radiological dispersal device, quickly locating the source of the accident or blast is important for emergency response and environmental decontamination. At a specified time after a simulated instantaneous release of an aerosolized radioactive contaminant, measurements are recorded downwind from an array of radiation sensors. Neural networks are employed to infer the source release parameters in an accurate and rapid manner using sensor and mean wind speed data. We consider two neural network constructions that quantify the uncertainty of the predicted values; a categorical classification neural network and a Bayesian neural network. With the categorical classification neural network, we partition the spatial domain and treat each partition as a separate class for which we estimate the probability that it contains the true source location. In a Bayesian neural network, the weights and biases have a distribution rather than a single optimal value. With each evaluation, these distributions are sampled, yielding a different prediction with each evaluation. The trained Bayesian neural network is thus evaluated to construct posterior densities for the release parameters. Results are compared to Markov chain Monte Carlo (MCMC) results found using the Delayed Rejection Adaptive Metropolis Algorithm. The Bayesian neural network approach is generally much cheaper computationally than the MCMC approach as it relies on the computational cost of the neural network evaluation to generate posterior densities as opposed to the MCMC approach which depends on the computational expense of the transport and radiation detection models.

[LG-68] Renaissance of Literate Programming in the Era of LLM s: Enhancing LLM -Based Code Generation in Large-Scale Projects

链接: https://arxiv.org/abs/2502.17441
作者: Wuyang Zhang,Yansong Li,Zeyu Dong,Yu Wu,Yingyao Zhou,Duolei Wang,Songsirou Xing,Chichun Zhou,Da Shen
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have helped programmers increase efficiency through code generation, comprehension, and repair. However, their application to large-scale projects remains challenging due to complex interdependencies and the extensive size of modern codebases. Although Knuth’s concept of Literate Programming (LP) combines code and natural language to convey logic and intent, its potential for enhancing relationships in large projects has not been fully explored. In this study, we introduce the idea of Interoperable LP (ILP), which leverages literate programming principles to enhance the development of both small-scale documents and large-scale projects with LLMs. We investigate how LLMs perform under ILP-style instructions for both document-oriented tasks and entire projects. Recognizing that many researchers rely on well-structured templates to guide LLMs, we propose a concise prompt engineering method to write LP documents so LLMs can better be involved in code generation. We also examine the capacity of various LLMs to generate Scheme and Python code on the RepoBench benchmark, illustrating the advantages of our approach. Our findings indicate that ILP with LLMs can enhance LLM-based code generation in large-scale project development.

[LG-69] GenAIOps for GenAI Model-Agility

链接: https://arxiv.org/abs/2502.17440
作者: Ken Ueno,Makoto Kogo,Hiromi Kawatsu,Yohsuke Uchiumi,Michiaki Tatsubori
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 2 tables

点击查看摘要

Abstract:AI-agility, with which an organization can be quickly adapted to its business priorities, is desired even for the development and operations of generative AI (GenAI) applications. Especially in this paper, we discuss so-called GenAI Model-agility, which we define as the readiness to be flexibly adapted to base foundation models as diverse as the model providers and versions. First, for handling issues specific to generative AI, we first define a methodology of GenAI application development and operations, as GenAIOps, to identify the problem of application quality degradation caused by changes to the underlying foundation models. We study prompt tuning technologies, which look promising to address this problem, and discuss their effectiveness and limitations through case studies using existing tools.

[LG-70] Global law of conjugate kernel random matrices with heavy-tailed weights

链接: https://arxiv.org/abs/2502.18428
作者: Alice Guionnet,Vanessa Piccolo
类目: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 45 pages, 1 figure

点击查看摘要

Abstract:We study the asymptotic spectral behavior of the conjugate kernel random matrix YY^\top , where Y= f(WX) arises from a two-layer neural network model. We consider the setting where W and X are both random rectangular matrices with i.i.d. entries, where the entries of W follow a heavy-tailed distribution, while those of X have light tails. Our assumptions on W include a broad class of heavy-tailed distributions, such as symmetric \alpha -stable laws with \alpha \in (0,2) and sparse matrices with \mathcalO(1) nonzero entries per row. The activation function f , applied entrywise, is nonlinear, smooth, and odd. By computing the eigenvalue distribution of YY^\top through its moments, we show that heavy-tailed weights induce strong correlations between the entries of Y , leading to richer and fundamentally different spectral behavior compared to models with light-tailed weights.

[LG-71] Learning sparse generalized linear models with binary outcomes via iterative hard thresholding

链接: https://arxiv.org/abs/2502.18393
作者: Namiko Matsumoto,Arya Mazumdar
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In statistics, generalized linear models (GLMs) are widely used for modeling data and can expressively capture potential nonlinear dependence of the model’s outcomes on its covariates. Within the broad family of GLMs, those with binary outcomes, which include logistic and probit regressions, are motivated by common tasks such as binary classification with (possibly) non-separable data. In addition, in modern machine learning and statistics, data is often high-dimensional yet has a low intrinsic dimension, making sparsity constraints in models another reasonable consideration. In this work, we propose to use and analyze an iterative hard thresholding (projected gradient descent on the ReLU loss) algorithm, called binary iterative hard thresholding (BIHT), for parameter estimation in sparse GLMs with binary outcomes. We establish that BIHT is statistically efficient and converges to the correct solution for parameter estimation in a general class of sparse binary GLMs. Unlike many other methods for learning GLMs, including maximum likelihood estimation, generalized approximate message passing, and GLM-tron (Kakade et al. 2011; Bahmani et al. 2016), BIHT does not require knowledge of the GLM’s link function, offering flexibility and generality in allowing the algorithm to learn arbitrary binary GLMs. As two applications, logistic and probit regression are additionally studied. In this regard, it is shown that in logistic regression, the algorithm is in fact statistically optimal in the sense that the order-wise sample complexity matches (up to logarithmic factors) the lower bound obtained previously. To the best of our knowledge, this is the first work achieving statistical optimality for logistic regression in all noise regimes with a computationally efficient algorithm. Moreover, for probit regression, our sample complexity is on the same order as that obtained for logistic regression.

[LG-72] Learning atomic forces from uncertainty-calibrated adversarial attacks

链接: https://arxiv.org/abs/2502.18314
作者: Henrique Musseli Cezar,Tilmann Bodenstein,Henrik Andersen Sveinsson,Morten Ledum,Simen Reine,Sigbjørn Løland Bore
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial approaches, which intentionally challenge machine learning models by generating difficult examples, are increasingly being adopted to improve machine learning interatomic potentials (MLIPs). While already providing great practical value, little is known about the actual prediction errors of MLIPs on adversarial structures and whether these errors can be controlled. We propose the Calibrated Adversarial Geometry Optimization (CAGO) algorithm to discover adversarial structures with user-assigned errors. Through uncertainty calibration, the estimated uncertainty of MLIPs is unified with real errors. By performing geometry optimization for calibrated uncertainty, we reach adversarial structures with the user-assigned target MLIP prediction error. Integrating with active learning pipelines, we benchmark CAGO, demonstrating stable MLIPs that systematically converge structural, dynamical, and thermodynamical properties for liquid water and water adsorption in a metal-organic framework within only hundreds of training structures, where previously many thousands were typically required.

[LG-73] Exploring proteomic signatures in sepsis and non-infectious systemic inflammatory response syndrome

链接: https://arxiv.org/abs/2502.18305
作者: Adolfo Ruiz-Sanmartín,Vicent Ribas,David Suñol,Luis Chiscano-Camón,Laura Martín,Iván Bajaña,Juliana Bastida,Nieves Larrosa,Juan José González,M Dolores Carrasco,Núria Canela,Ricard Ferrer,Juan Carlos Ruiz-Rodrígue
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: The search for new biomarkers that allow an early diagnosis in sepsis has become a necessity in medicine. The objective of this study is to identify potential protein biomarkers of differential expression between sepsis and non-infectious systemic inflammatory response syndrome (NISIRS). Methods: Prospective observational study of a cohort of septic patients activated by the Sepsis Code and patients admitted with NISIRS, during the period 2016-2017. A mass spectrometry-based approach was used to analyze the plasma proteins in the enrolled subjects. Subsequently, using recursive feature elimination (RFE) classification and cross-validation with a vector classifier, an association of these proteins in patients with sepsis compared to patients with NISIRS. The protein-protein interaction network was analyzed with String software. Results: A total of 277 patients (141 with sepsis and 136 with NISIRS) were included. After performing RFE, 25 proteins in the study patient cohort showed statistical significance, with an accuracy of 0.960, specificity of 0.920, sensitivity of 0.973, and an AUC of 0.985. Of these, 14 proteins (vWF, PPBP, C5, C1RL, FCN3, SAA2, ORM1, ITIH3, GSN, C1QA, CA1, CFB, C3, LBP) have a greater relationship with sepsis while 11 proteins (FN1, IGFALS, SERPINA4, APOE, APOH, C6, SERPINA3, AHSG, LUM, ITIH2, SAA1) are more expressed in NISIRS. Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG) Cite as: arXiv:2502.18305 [q-bio.QM] (or arXiv:2502.18305v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2502.18305 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] Nested Expectations with Kernel Quadrature

链接: https://arxiv.org/abs/2502.18284
作者: Zonghao Chen,Masha Naslidnyk,François-Xavier Briol
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper considers the challenging computational task of estimating nested expectations. Existing algorithms, such as nested Monte Carlo or multilevel Monte Carlo, are known to be consistent but require a large number of samples at both inner and outer levels to converge. Instead, we propose a novel estimator consisting of nested kernel quadrature estimators and we prove that it has a faster convergence rate than all baseline methods when the integrands have sufficient smoothness. We then demonstrate empirically that our proposed method does indeed require fewer samples to estimate nested expectations on real-world applications including Bayesian optimisation, option pricing, and health economics.

[LG-75] Near-Optimal Approximations for Bayesian Inference in Function Space

链接: https://arxiv.org/abs/2502.18279
作者: Veit Wild,James Wu,Dino Sejdinovic,Jeremias Knoblauch
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 59 pages (26 pages main paper + 33 pages appendices); 6 figures

点击查看摘要

Abstract:We propose a scalable inference algorithm for Bayes posteriors defined on a reproducing kernel Hilbert space (RKHS). Given a likelihood function and a Gaussian random element representing the prior, the corresponding Bayes posterior measure \Pi_\textB can be obtained as the stationary distribution of an RKHS-valued Langevin diffusion. We approximate the infinite-dimensional Langevin diffusion via a projection onto the first M components of the Kosambi-Karhunen-Loève expansion. Exploiting the thus obtained approximate posterior for these M components, we perform inference for \Pi_\textB by relying on the law of total probability and a sufficiency assumption. The resulting method scales as O(M^3+JM^2) , where J is the number of samples produced from the posterior measure \Pi_\textB . Interestingly, the algorithm recovers the posterior arising from the sparse variational Gaussian process (SVGP) (see Titsias, 2009) as a special case, owed to the fact that the sufficiency assumption underlies both methods. However, whereas the SVGP is parametrically constrained to be a Gaussian process, our method is based on a non-parametric variational family \mathcalP(\mathbbR^M) consisting of all probability measures on \mathbbR^M . As a result, our method is provably close to the optimal M -dimensional variational approximation of the Bayes posterior \Pi_\textB in \mathcalP(\mathbbR^M) for convex and Lipschitz continuous negative log likelihoods, and coincides with SVGP for the special case of a Gaussian error likelihood.

[LG-76] Recurrent Neural Networks for Dynamic VWAP Execution: Adaptive Trading Strategies with Temporal Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2502.18177
作者: Remi Genet
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The execution of Volume Weighted Average Price (VWAP) orders remains a critical challenge in modern financial markets, particularly as trading volumes and market complexity continue to increase. In my previous work arXiv:2502.13722, I introduced a novel deep learning approach that demonstrated significant improvements over traditional VWAP execution methods by directly optimizing the execution problem rather than relying on volume curve predictions. However, that model was static because it employed the fully linear approach described in arXiv:2410.21448, which is not designed for dynamic adjustment. This paper extends that foundation by developing a dynamic neural VWAP framework that adapts to evolving market conditions in real time. We introduce two key innovations: first, the integration of recurrent neural networks to capture complex temporal dependencies in market dynamics, and second, a sophisticated dynamic adjustment mechanism that continuously optimizes execution decisions based on market feedback. The empirical analysis, conducted across five major cryptocurrency markets, demonstrates that this dynamic approach achieves substantial improvements over both traditional methods and our previous static implementation, with execution performance gains of 10 to 15% in liquid markets and consistent outperformance across varying conditions. These results suggest that adaptive neural architectures can effectively address the challenges of modern VWAP execution while maintaining computational efficiency suitable for practical deployment.

[LG-77] Inverse Materials Design by Large Language Model-Assisted Generative Framework

链接: https://arxiv.org/abs/2502.18127
作者: Yun Hao,Che Fan,Beilin Ye,Wenhao Lu,Zhen Lu,Peilin Zhao,Zhifeng Gao,Qingyao Wu,Yanhui Liu,Tongqi Wen
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative models hold great promise for inverse materials design, yet their efficiency and accuracy remain constrained by data scarcity and model architecture. Here, we introduce AlloyGAN, a closed-loop framework that integrates Large Language Model (LLM)-assisted text mining with Conditional Generative Adversarial Networks (CGANs) to enhance data diversity and improve inverse design. Taking alloy discovery as a case study, AlloyGAN systematically refines material candidates through iterative screening and experimental validation. For metallic glasses, the framework predicts thermodynamic properties with discrepancies of less than 8% from experiments, demonstrating its robustness. By bridging generative AI with domain knowledge and validation workflows, AlloyGAN offers a scalable approach to accelerate the discovery of materials with tailored properties, paving the way for broader applications in materials science.

[LG-78] Controlling dynamics of stochastic systems with deep reinforcement learning

链接: https://arxiv.org/abs/2502.18111
作者: Ruslan Mukhamadiarov
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:A properly designed controller can help improve the quality of experimental measurements or force a dynamical system to follow a completely new time-evolution path. Recent developments in deep reinforcement learning have made steep advances toward designing effective control schemes for fairly complex systems. However, a general simulation scheme that employs deep reinforcement learning for exerting control in stochastic systems is yet to be established. In this paper, we attempt to further bridge a gap between control theory and deep reinforcement learning by proposing a simulation algorithm that allows achieving control of the dynamics of stochastic systems through the use of trained artificial neural networks. Specifically, we use agent-based simulations where the neural network plays the role of the controller that drives local state-to-state transitions. We demonstrate the workflow and the effectiveness of the proposed control methods by considering the following two stochastic processes: particle coalescence on a lattice and a totally asymmetric exclusion process.

[LG-79] Golden Ratio Mixing of Real and Synthetic Data for Stabilizing Generative Model Training

链接: https://arxiv.org/abs/2502.18049
作者: Hengzhi He,Shirong Xu,Guang Cheng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies identified an intriguing phenomenon in recursive generative model training known as model collapse, where models trained on data generated by previous models exhibit severe performance degradation. Addressing this issue and developing more effective training strategies have become central challenges in generative model research. In this paper, we investigate this phenomenon theoretically within a novel framework, where generative models are iteratively trained on a combination of newly collected real data and synthetic data from the previous training step. To develop an optimal training strategy for integrating real and synthetic data, we evaluate the performance of a weighted training scheme in various scenarios, including Gaussian distribution estimation and linear regression. We theoretically characterize the impact of the mixing proportion and weighting scheme of synthetic data on the final model’s performance. Our key finding is that, across different settings, the optimal weighting scheme under different proportions of synthetic data asymptotically follows a unified expression, revealing a fundamental trade-off between leveraging synthetic data and generative model performance. Notably, in some cases, the optimal weight assigned to real data corresponds precisely to the reciprocal of the golden ratio. Finally, we validate our theoretical results on extensive simulated datasets and a real tabular dataset.

[LG-80] Conformal Prediction Under Generalized Covariate Shift with Posterior Drift AISTATS2025

链接: https://arxiv.org/abs/2502.17744
作者: Baozhen Wang,Xingye Qiao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Accepted to AISTATS 2025

点击查看摘要

Abstract:In many real applications of statistical learning, collecting sufficiently many training data is often expensive, time-consuming, or even unrealistic. In this case, a transfer learning approach, which aims to leverage knowledge from a related source domain to improve the learning performance in the target domain, is more beneficial. There have been many transfer learning methods developed under various distributional assumptions. In this article, we study a particular type of classification problem, called conformal prediction, under a new distributional assumption for transfer learning. Classifiers under the conformal prediction framework predict a set of plausible labels instead of one single label for each data instance, affording a more cautious and safer decision. We consider a generalization of the \textitcovariate shift with posterior drift setting for transfer learning. Under this setting, we propose a weighted conformal classifier that leverages both the source and target samples, with a coverage guarantee in the target domain. Theoretical studies demonstrate favorable asymptotic properties. Numerical studies further illustrate the usefulness of the proposed method.

[LG-81] Are GNNs doomed by the topology of their input graph?

链接: https://arxiv.org/abs/2502.17739
作者: Amine Mohamed Aboussalah,Abdessalam Ed-dib
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable success in learning from graph-structured data. However, the influence of the input graph’s topology on GNN behavior remains poorly understood. In this work, we explore whether GNNs are inherently limited by the structure of their input graphs, focusing on how local topological features interact with the message-passing scheme to produce global phenomena such as oversmoothing or expressive representations. We introduce the concept of k -hop similarity and investigate whether locally similar neighborhoods lead to consistent node representations. This interaction can result in either effective learning or inevitable oversmoothing, depending on the inherent properties of the graph. Our empirical experiments validate these insights, highlighting the practical implications of graph topology on GNN performance.

[LG-82] A Fokker-Planck-Based Loss Function that Bridges Dynamics with Density Estimation ICML

链接: https://arxiv.org/abs/2502.17690
作者: Zhixin Lu,Łukasz Kuśmierz,Stefan Mihalas
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注: Under review by the ICML

点击查看摘要

Abstract:We have derived a novel loss function from the Fokker-Planck equation that links dynamical system models with their probability density functions, demonstrating its utility in model identification and density estimation. In the first application, we show that this loss function can enable the extraction of dynamical parameters from non-temporal datasets, including timestamp-free measurements from steady non-equilibrium systems such as noisy Lorenz systems and gene regulatory networks. In the second application, when coupled with a density estimator, this loss facilitates density estimation when the dynamic equations are known. For density estimation, we propose a density estimator that integrates a Gaussian Mixture Model with a normalizing flow model. It simultaneously estimates normalized density, energy, and score functions from both empirical data and dynamics. It is compatible with a variety of data-based training methodologies, including maximum likelihood and score matching. It features a latent space akin to a modern Hopfield network, where the inherent Hopfield energy effectively assigns low densities to sparsely populated data regions, addressing common challenges in neural density estimators. Additionally, this Hopfield-like energy enables direct and rapid data manipulation through the Concave-Convex Procedure (CCCP) rule, facilitating tasks such as denoising and clustering. Our work demonstrates a principled framework for leveraging the complex interdependencies between dynamics and density estimation, as illustrated through synthetic examples that clarify the underlying theoretical intuitions.

[LG-83] A stochastic smoothing framework for nonconvex-nonconcave min-sum-max problems with applications to Wasserstein distributionally robust optimization

链接: https://arxiv.org/abs/2502.17602
作者: Wei Liu,Muhammad Khan,Gabriel Mancino-Ball,Yangyang Xu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 35 pages

点击查看摘要

Abstract:Applications such as adversarially robust training and Wasserstein Distributionally Robust Optimization (WDRO) can be naturally formulated as min-sum-max optimization problems. While this formulation can be rewritten as an equivalent min-max problem, the summation of max terms introduces computational challenges, including increased complexity and memory demands, which must be addressed. These challenges are particularly evident in WDRO, where existing tractable algorithms often rely on restrictive assumptions on the objective function, limiting their applicability to state-of-the-art machine learning problems such as the training of deep neural networks. This study introduces a novel stochastic smoothing framework based on the \mboxlog-sum-exp function, efficiently approximating the max operator in min-sum-max problems. By leveraging the Clarke regularity of the max operator, we develop an iterative smoothing algorithm that addresses these computational difficulties and guarantees almost surely convergence to a Clarke/directional stationary point. We further prove that the proposed algorithm finds an \epsilon -scaled Clarke stationary point of the original problem, with a worst-case iteration complexity of \widetildeO(\epsilon^-3) . Our numerical experiments demonstrate that our approach outperforms or is competitive with state-of-the-art methods in solving the newsvendor problem, deep learning regression, and adversarially robust deep learning. The results highlight that our method yields more accurate and robust solutions in these challenging problem settings.

[LG-84] Multi-Year-to-Decadal Temperature Prediction using a Machine Learning Model-Analog Framework

链接: https://arxiv.org/abs/2502.17583
作者: M. A. Fernandez,Elizabeth A. Barnes
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 14 pages, 10 figures (+ 8 supplemental figures)

点击查看摘要

Abstract:Multi-year-to-decadal climate prediction is a key tool in understanding the range of potential regional and global climate futures. Here, we present a framework that combines machine learning and analog forecasting for predictions on these timescales. A neural network is used to learn a mask, specific to a region and lead time, with global weights based on relative importance as precursors to the evolution of that prediction target. A library of mask-weighted model states, or potential analogs, are then compared to a single mask-weighted observational state. The known future of the best matching potential analogs serve as the prediction for the future of the observational state. We match and predict 2-meter temperature using the Berkeley Earth Surface Temperature dataset for observations, and a set of CMIP6 models as the analog library. We find improved performance over traditional analog methods and initialized decadal predictions.

[LG-85] Utilizing Machine Learning to Predict Host Stars and the Key Elemental Abundances of Small Planets

链接: https://arxiv.org/abs/2502.17563
作者: Amílcar R. Torres-Quijano,Natalie R. Hinkel,Caleb H. Wheeler III,Patrick A. Young,Luan Ghezzi,Augusto P. Baldo
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 22 pages, 9 figures, 3 tables, accepted to AJ

点击查看摘要

Abstract:Stars and their associated planets originate from the same cloud of gas and dust, making a star’s elemental composition a valuable indicator for indirectly studying planetary compositions. While the connection between a star’s iron (Fe) abundance and the presence of giant exoplanets is established (e.g. Gonzalez 1997; Fischer Valenti 2005), the relationship with small planets remains unclear. The elements Mg, Si, and Fe are important in forming small planets. Employing machine learning algorithms like XGBoost, trained on the abundances (e.g., the Hypatia Catalog, Hinkel et al. 2014) of known exoplanet-hosting stars (NASA Exoplanet Archive), allows us to determine significant “features” (abundances or molar ratios) that may indicate the presence of small planets. We test on three groups of exoplanets: (a) all small, R P 3.5 R\oplus , (b) sub-Neptunes, 2.0 R_\oplus R P 3.5 R\oplus , and © super-Earths, 1.0 R_\oplus R P 2.0 R\oplus – each subdivided into 7 ensembles to test different combinations of features. We created a list of stars with \geq90% probability of hosting small planets across all ensembles and experiments (“overlap stars”). We found abundance trends for stars hosting small planets, possibly indicating star-planet chemical interplay during formation. We also found that Na and V are key features regardless of planetary radii. We expect our results to underscore the importance of elements in exoplanet formation and machine learning’s role in target selection for future NASA missions: e.g., the James Webb Space Telescope (JWST), Nancy Grace Roman Space Telescope (NGRST), Habitable Worlds Observatory (HWO) – all of which are aimed at small planet detection.

[LG-86] Expressive equivalence of classical and quantum restricted Boltzmann machines

链接: https://arxiv.org/abs/2502.17562
作者: Maria Demidik,Cenk Tüysüz,Nico Piatkowski,Michele Grossi,Karl Jansen
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures; supplementary material 6 pages, 1 figure

点击查看摘要

Abstract:Quantum computers offer the potential for efficiently sampling from complex probability distributions, attracting increasing interest in generative modeling within quantum machine learning. This surge in interest has driven the development of numerous generative quantum models, yet their trainability and scalability remain significant challenges. A notable example is a quantum restricted Boltzmann machine (QRBM), which is based on the Gibbs state of a parameterized non-commuting Hamiltonian. While QRBMs are expressive, their non-commuting Hamiltonians make gradient evaluation computationally demanding, even on fault-tolerant quantum computers. In this work, we propose a semi-quantum restricted Boltzmann machine (sqRBM), a model designed for classical data that mitigates the challenges associated with previous QRBM proposals. The sqRBM Hamiltonian is commuting in the visible subspace while remaining non-commuting in the hidden subspace. This structure allows us to derive closed-form expressions for both output probabilities and gradients. Leveraging these analytical results, we demonstrate that sqRBMs share a close relationship with classical restricted Boltzmann machines (RBM). Our theoretical analysis predicts that, to learn a given probability distribution, an RBM requires three times as many hidden units as an sqRBM, while both models have the same total number of parameters. We validate these findings through numerical simulations involving up to 100 units. Our results suggest that sqRBMs could enable practical quantum machine learning applications in the near future by significantly reducing quantum resource requirements.

[LG-87] CLEP-GAN: An Innovative Approach to Subject-Independent ECG Reconstruction from PPG Signals

链接: https://arxiv.org/abs/2502.17536
作者: Xiaoyan Li,Shixin Xu,Faisal Habib,Neda Aminnejad,Arvind Gupta,Huaxiong Huang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses the challenge of reconstructing unseen ECG signals from PPG signals, a critical task for non-invasive cardiac monitoring. While numerous public ECG-PPG datasets are available, they lack the diversity seen in image datasets, and data collection processes often introduce noise, complicating ECG reconstruction from PPG even with advanced machine learning models. To tackle these challenges, we first introduce a novel synthetic ECG-PPG data generation technique using an ODE model to enhance training diversity. Next, we develop a novel subject-independent PPG-to-ECG reconstruction model that integrates contrastive learning, adversarial learning, and attention gating, achieving results comparable to or even surpassing existing approaches for unseen ECG reconstruction. Finally, we examine factors such as sex and age that impact reconstruction accuracy, emphasizing the importance of considering demographic diversity during model training and dataset augmentation.

[LG-88] A Machine Learning Approach for Design of Frequency Selective Surface based Radar Absorbing Material via Image Prediction

链接: https://arxiv.org/abs/2502.17534
作者: Vijay Kumar Sutrakar,Anjana P K,Sajal Kesharwani,Siddharth Bisariya
类目: ignal Processing (eess.SP); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The paper presents an innovative methodology for designing frequency selective surface (FSS) based radar absorbing materials using machine learning (ML) technique. In conventional electromagnetic design, unit cell dimensions of FSS are used as input and absorption coefficient is then predicted for a given design. In this paper, absorption coefficient is considered as input to ML model and image of FSS unit cell is predicted. Later, this image is used for generating the FSS unit cell parameters. Eleven different ML models are studied over a wide frequency band of 1GHz to 30GHz. Out of which six ML models (i.e. (a) Random Forest classification, (b) K- Neighbors Classification, © Grid search regression, (d) Random Forest regression, (e) Decision tree classification, and (f) Decision tree regression) show training accuracy more than 90%. The absorption coefficients with varying frequencies of these predicted images are subsequently evaluated using commercial electromagnetic solver. The performance of these ML models is encouraging, and it can be used for accelerating design and optimization of high performance FSS based radar absorbing material for advanced electromagnetic applications in future.

[LG-89] Multimodal Sleep Stage and Sleep Apnea Classification Using Vision Transformer: A Multitask Explainable Learning Approach

链接: https://arxiv.org/abs/2502.17486
作者: Kianoosh Kazemi,Iman Azimi,Michelle Khine,Rami N. Khayat,Amir M. Rahmani,Pasi Liljeberg
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sleep is an essential component of human physiology, contributing significantly to overall health and quality of life. Accurate sleep staging and disorder detection are crucial for assessing sleep quality. Studies in the literature have proposed PSG-based approaches and machine-learning methods utilizing single-modality signals. However, existing methods often lack multimodal, multilabel frameworks and address sleep stages and disorders classification separately. In this paper, we propose a 1D-Vision Transformer for simultaneous classification of sleep stages and sleep disorders. Our method exploits the sleep disorders’ correlation with specific sleep stage patterns and performs a simultaneous identification of a sleep stage and sleep disorder. The model is trained and tested using multimodal-multilabel sensory data (including photoplethysmogram, respiratory flow, and respiratory effort signals). The proposed method shows an overall accuracy (cohen’s Kappa) of 78% (0.66) for five-stage sleep classification and 74% (0.58) for sleep apnea classification. Moreover, we analyzed the encoder attention weights to clarify our models’ predictions and investigate the influence different features have on the models’ outputs. The result shows that identified patterns, such as respiratory troughs and peaks, make a higher contribution to the final classification process.

[LG-90] Urinary Tract Infection Detection in Digital Remote Monitoring: Strategies for Managing Participant-Specific Prediction Complexity

链接: https://arxiv.org/abs/2502.17484
作者: Kexin Fan,Alexander Capstick,Ramin Nilforooshan,Payam Barnaghi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urinary tract infections (UTIs) are a significant health concern, particularly for people living with dementia (PLWD), as they can lead to severe complications if not detected and treated early. This study builds on previous work that utilised machine learning (ML) to detect UTIs in PLWD by analysing in-home activity and physiological data collected through low-cost, passive sensors. The current research focuses on improving the performance of previous models, particularly by refining the Multilayer Perceptron (MLP), to better handle variations in home environments and improve sex fairness in predictions by making use of concepts from multitask learning. This study implemented three primary model designs: feature clustering, loss-dependent clustering, and participant ID embedding which were compared against a baseline MLP model. The results demonstrated that the loss-dependent MLP achieved the most significant improvements, increasing validation precision from 48.92% to 72.60% and sensitivity from 27.44% to 70.52%, while also enhancing model fairness across sexes. These findings suggest that the refined models offer a more reliable and equitable approach to early UTI detection in PLWD, addressing participant-specific data variations and enabling clinicians to detect and screen for UTI risks more effectively, thereby facilitating earlier and more accurate treatment decisions.

[LG-91] ConSense: Continually Sensing Human Activity with WiFi via Growing and Picking

链接: https://arxiv.org/abs/2502.17483
作者: Rong Li,Tao Deng,Siwei Feng,Mingjie Sun,Juncheng Jia
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:WiFi-based human activity recognition (HAR) holds significant application potential across various fields. To handle dynamic environments where new activities are continuously introduced, WiFi-based HAR systems must adapt by learning new concepts without forgetting previously learned ones. Furthermore, retaining knowledge from old activities by storing historical exemplar is impractical for WiFi-based HAR due to privacy concerns and limited storage capacity of edge devices. In this work, we propose ConSense, a lightweight and fast-adapted exemplar-free class incremental learning framework for WiFi-based HAR. The framework leverages the transformer architecture and involves dynamic model expansion and selective retraining to preserve previously learned knowledge while integrating new information. Specifically, during incremental sessions, small-scale trainable parameters that are trained specifically on the data of each task are added in the multi-head self-attention layer. In addition, a selective retraining strategy that dynamically adjusts the weights in multilayer perceptron based on the performance stability of neurons across tasks is used. Rather than training the entire model, the proposed strategies of dynamic model expansion and selective retraining reduce the overall computational load while balancing stability on previous tasks and plasticity on new tasks. Evaluation results on three public WiFi datasets demonstrate that ConSense not only outperforms several competitive approaches but also requires fewer parameters, highlighting its practical utility in class-incremental scenarios for HAR.

[LG-92] Multi-View Contrastive Network (MCNet) for Motor Imagery Classification

链接: https://arxiv.org/abs/2502.17482
作者: Ziwei Wang,Siyang Li,Xiaoqing Chen,Wei Li,Dongrui Wu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:Objective: An electroencephalography (EEG)-based brain-computer interface (BCI) serves as a direct communication pathway between the human brain and an external device. While supervised learning has been extensively explored for motor imagery (MI) EEG classification, small data quantity has been a key factor limiting the performance of deep feature learning. Methods: This paper proposes a knowledge-driven time-space-frequency based multi-view contrastive network (MCNet) for MI EEG decoding in BCIs. MCNet integrates knowledge from the time, space, and frequency domains into the training process through data augmentations from multiple views, fostering more discriminative feature learning of the characteristics of EEG data. We introduce a cross-view contrasting module to learn from different augmented views and a cross-model contrasting module to enhance the consistency of features extracted between knowledge-guided and data-driven models. Results: The combination of EEG data augmentation strategies was systematically investigated for more informative supervised contrastive learning. Experiments on four public MI datasets and three different architectures demonstrated that MCNet outperformed 10 existing approaches. Significance: Our approach can significantly boost EEG classification performance beyond designated networks, showcasing the potential to enhance the feature learning process for better EEG decoding.

[LG-93] Frequency-Aware Masked Autoencoders for Human Activity Recognition using Accelerometers

链接: https://arxiv.org/abs/2502.17477
作者: Niels R. Lorenzen,Poul J. Jennum,Emmanuel Mignot,Andreas Brink-Kjaer
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, submitted to 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)

点击查看摘要

Abstract:Wearable accelerometers are widely used for continuous monitoring of physical activity. Supervised machine learning and deep learning algorithms have long been used to extract meaningful activity information from raw accelerometry data, but progress has been hampered by the limited amount of publicly available labeled data. Exploiting large unlabeled datasets using self-supervised pretraining is a relatively new and underexplored approach in the field of human activity recognition (HAR). We used a time-series transformer masked autoencoder (MAE) approach to self-supervised pretraining and propose a novel spectrogram-based loss function named the log-scale mean magnitude (LMM) loss. We compared MAE models pretrained with LMM to one trained with the mean squared error (MSE) loss. We leveraged the large unlabeled UK Biobank accelerometry dataset (n = 109k) for pretraining and evaluated downstream HAR performance using linear classifier in a smaller labelled dataset. We found that pretraining with the LMM loss improved performance compared to a model pretrained with the MSE loss, with balanced accuracies of 0.848 and 0.709, respectively. Further analysis revealed that better convergence of the LMM loss, but not the MSE loss significantly correlated with improved downstream performance (r=-0.61, p=0.04) for balanced accuracy). Finally, we compared our MAE models to the state-of-the-art for HAR, also pretrained on the UK Biobank accelerometry data. Our LMM-pretrained models performed better when finetuned using a linear classifier and performed comparably when finetuned using an LSTM classifier, while MSE-pretrained models consistently underperformed. Our findings demonstrate that the LMM loss is a robust and effective method for pretraining MAE models on accelerometer data for HAR. Future work should explore optimizing loss function combinations and extending our approach to other tasks.

[LG-94] Fusion of ECG Foundation Model Embeddings to Improve Early Detection of Acute Coronary Syndromes

链接: https://arxiv.org/abs/2502.17476
作者: Zeyuan Meng,Lovely Yeswanth Panchumarthi,Saurabh Kataria,Alex Fedorov,Jessica Zègre-Hemsey,Xiao Hu,Ran Xiao
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Acute Coronary Syndrome (ACS) is a life-threatening cardiovascular condition where early and accurate diagnosis is critical for effective treatment and improved patient outcomes. This study explores the use of ECG foundation models, specifically ST-MEM and ECG-FM, to enhance ACS risk assessment using prehospital ECG data collected in ambulances. Both models leverage self-supervised learning (SSL), with ST-MEM using a reconstruction-based approach and ECG-FM employing contrastive learning, capturing unique spatial and temporal ECG features. We evaluate the performance of these models individually and through a fusion approach, where their embeddings are combined for enhanced prediction. Results demonstrate that both foundation models outperform a baseline ResNet-50 model, with the fusion-based approach achieving the highest performance (AUROC: 0.843 +/- 0.006, AUCPR: 0.674 +/- 0.012). These findings highlight the potential of ECG foundation models for early ACS detection and motivate further exploration of advanced fusion strategies to maximize complementary feature utilization.

[LG-95] CSSSTN: A Class-sensitive Subject-to-subject Semantic Style Transfer Network for EEG Classification in RSVP Tasks

链接: https://arxiv.org/abs/2502.17468
作者: Ziyue Yang,Chengrui Chen,Yong Peng,Qiong Chen,Wanzeng Kong
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Rapid Serial Visual Presentation (RSVP) paradigm represents a promising application of electroencephalography (EEG) in Brain-Computer Interface (BCI) systems. However, cross-subject variability remains a critical challenge, particularly for BCI-illiterate users who struggle to effectively interact with these systems. To address this issue, we propose the Class-Sensitive Subject-to-Subject Semantic Style Transfer Network (CSSSTN), which incorporates a class-sensitive approach to align feature distributions between golden subjects (BCI experts) and target (BCI-illiterate) users on a class-by-class basis. Building on the SSSTN framework, CSSSTN incorporates three key components: (1) subject-specific classifier training, (2) a unique style loss to transfer class-discriminative features while preserving semantic information through a modified content loss, and (3) an ensemble approach to integrate predictions from both source and target domains. We evaluated CSSSTN using both a publicly available dataset and a self-collected dataset. Experimental results demonstrate that CSSSTN outperforms state-of-the-art methods, achieving mean balanced accuracy improvements of 6.4% on the Tsinghua dataset and 3.5% on the HDU dataset, with notable benefits for BCI-illiterate users. Ablation studies confirm the effectiveness of each component, particularly the class-sensitive transfer and the use of lower-layer features, which enhance transfer performance and mitigate negative transfer. Additionally, CSSSTN achieves competitive results with minimal target data, reducing calibration time and effort. These findings highlight the practical potential of CSSSTN for real-world BCI applications, offering a robust and scalable solution to improve the performance of BCI-illiterate users while minimizing reliance on extensive training data. Our code is available at this https URL.

[LG-96] Large Cognition Model: Towards Pretrained EEG Foundation Model

链接: https://arxiv.org/abs/2502.17464
作者: Chi-Sheng Chen,Ying-Jung Chen,Aidan Hung-Wen Tsai
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Electroencephalography provides a non-invasive window into brain activity, offering valuable insights for neurological research, brain-computer interfaces, and clinical diagnostics. However, the development of robust machine learning models for EEG analysis is hindered by the scarcity of large-scale, well-annotated datasets and the inherent variability of EEG signals across subjects and recording conditions. Inspired by the success of foundation models in natural language processing and computer vision, we propose the Large Cognition Model-a transformer-based foundation model designed to generalize across diverse EEG datasets and downstream tasks. Unlike traditional approaches, our proposed transformer-based architecture demonstrates strong generalization capabilities across datasets and tasks, even without pretraining, surpassing some existing EEG universal models on specific downstream applications. LCM leverages large-scale self-supervised learning techniques to capture universal EEG representations, enabling efficient fine-tuning for applications such as cognitive state decoding, disease classification, and neurofeedback systems. We introduce a novel architecture that integrates temporal and spectral attention mechanisms, optimizing the model’s ability to extract meaningful features from raw EEG signals. Extensive evaluations demonstrate that LCM outperforms state-of-the-art approaches across multiple EEG benchmarks, exhibiting strong cross-subject and cross-task generalization. Our findings highlight the potential of pretrained EEG foundation models to accelerate advancements in neuroscience, personalized medicine, and BCI technology.

[LG-97] SincPD: An Explainable Method based on Sinc Filters to Diagnose Parkinsons Disease Severity by Gait Cycle Analysis

链接: https://arxiv.org/abs/2502.17463
作者: Armin Salimi-Badr,Mahan Veisi,Sadra Berangi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, an explainable deep learning-based classifier based on adaptive sinc filters for Parkinson’s Disease diagnosis (PD) along with determining its severity, based on analyzing the gait cycle (SincPD) is presented. Considering the effects of PD on the gait cycle of patients, the proposed method utilizes raw data in the form of vertical Ground Reaction Force (vGRF) measured by wearable sensors placed in soles of subjects’ shoes. The proposed method consists of Sinc layers that model adaptive bandpass filters to extract important frequency-bands in gait cycle of patients along with healthy subjects. Therefore, by considering these frequencies, the reasons behind the classification a person as a patient or healthy can be explained. In this method, after applying some preprocessing processes, a large model equipped with many filters is first trained. Next, to prune the extra units and reach a more explainable and parsimonious structure, the extracted filters are clusters based on their cut-off frequencies using a centroid-based clustering approach. Afterward, the medoids of the extracted clusters are considered as the final filters. Therefore, only 15 bandpass filters for each sensor are derived to classify patients and healthy subjects. Finally, the most effective filters along with the sensors are determined by comparing the energy of each filter encountering patients and healthy subjects.

[LG-98] Study on Downlink CSI compression: Are Neural Networks the Only Solution?

链接: https://arxiv.org/abs/2502.17459
作者: K. Sai Praneeth,Anil Kumar Yerrapragada,Achyuth Sagireddi,Sai Prasad,Radha Krishna Ganti
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Massive Multi Input Multi Output (MIMO) systems enable higher data rates in the downlink (DL) with spatial multiplexing achieved by forming narrow beams. The higher DL data rates are achieved by effective implementation of spatial multiplexing and beamforming which is subject to availability of DL channel state information (CSI) at the base station. For Frequency Division Duplexing (FDD) systems, the DL CSI has to be transmitted by User Equipment (UE) to the gNB and it constitutes a significant overhead which scales with the number of transmitter antennas and the granularity of the CSI. To address the overhead issue, AI/ML methods using auto-encoders have been investigated, where an encoder neural network model at the UE compresses the CSI and a decoder neural network model at the gNB reconstructs it. However, the use of AI/ML methods has a number of challenges related to (1) model complexity, (2) model generalization across channel scenarios and (3) inter-vendor compatibility of the two sides of the model. In this work, we investigate a more traditional dimensionality reduction method that uses Principal Component Analysis (PCA) and therefore does not suffer from the above challenges. Simulation results show that PCA based CSI compression actually achieves comparable reconstruction performance to commonly used deep neural networks based models.

信息检索

[IR-0] A Unified Bayesian Perspective for Conventional and Robust Adaptive Filters

链接: https://arxiv.org/abs/2502.18325
作者: Leszek Szczecinski,Jacob Benesty,Eduardo Vinicius Kuhn
类目: Information Retrieval (cs.IR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this work, we present a new perspective on the origin and interpretation of adaptive filters. By applying Bayesian principles of recursive inference from the state-space model and using a series of simplifications regarding the structure of the solution, we can present, in a unified framework, derivations of many adaptive filters which depend on the probabilistic model of the observational noise. In particular, under a Gaussian model, we obtain solutions well-known in the literature (such as LMS, NLMS, or Kalman filter), while using non-Gaussian noise, we obtain new families of adaptive filter. Notably, under assumption of Laplacian noise, we obtain a family of robust filters of which the signed-error algorithm is a well-known member, while other algorithms, derived effortlessly in the proposed framework, are entirely new. Numerical examples are shown to illustrate the properties and provide a better insight into the performance of the derived adaptive filters.

[IR-1] Data Voids and Warning Banners on Google Search

链接: https://arxiv.org/abs/2502.17542
作者: Ronald E. Robertson,Evan M. Williams,Kathleen M. Carley,David Thiel
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The content moderation systems used by social media sites are a topic of widespread interest and research, but less is known about the use of similar systems by web search engines. For example, Google Search attempts to help its users navigate three distinct types of data voids–when the available search results are deemed low-quality, low-relevance, or rapidly-changing–by placing one of three corresponding warning banners at the top of the search page. Here we collected 1.4M unique search queries shared on social media to surface Google’s warning banners, examine when and why those banners were applied, and train deep learning models to identify data voids beyond Google’s classifications. Across three data collection waves (Oct 2023, Mar 2024, Sept 2024), we found that Google returned a warning banner for about 1% of our search queries, with substantial churn in the set of queries that received a banner across waves. The low-quality banners, which warn users that their results “may not have reliable information on this topic,” were especially rare, and their presence was associated with low-quality domains in the search results and conspiracy-related keywords in the search query. Low-quality banner presence was also inconsistent over short time spans, even when returning highly similar search results. In August 2024, low-quality banners stopped appearing on the SERPs we collected, but average search result quality remained largely unchanged, suggesting they may have been discontinued by Google. Using our deep learning models to analyze both queries and search results in context, we identify 29 to 58 times more low-quality data voids than there were low-quality banners, and find a similar number after the banners had disappeared. Our findings point to the need for greater transparency on search engines’ content moderation practices, especially around important events like elections.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-26

目录

概览 (2025-02-26)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载