本篇博文主要内容为 2025-02-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-02-10)

今日共更新481篇论文,其中:

  • 自然语言处理111篇(Computation and Language (cs.CL))
  • 人工智能156篇(Artificial Intelligence (cs.AI))
  • 计算机视觉90篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习194篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

【速读】: 该论文旨在解决大规模机器学习模型在内存约束条件下扩展性和效率的问题。论文的关键在于提出了适用于密集模型(Dense)和混合专家模型(Mixture of Experts, MoE)的联合缩放定律(scaling laws),这些定律综合考虑了激活参数数量、数据集大小和专家数量等因素。通过超过280次实验,验证了理论预测,并展示了MoE模型在固定内存和计算预算下,能够比密集模型更加高效地使用内存,这一发现颠覆了传统的认知。

链接: https://arxiv.org/abs/2502.05172
作者: Jan Ludziejewski,Maciej Pióro,Jakub Krajewski,Maciej Stefaniak,Michał Krutul,Jan Małaśnicki,Marek Cygan,Piotr Sankowski,Kamil Adamczewski,Piotr Miłoś,Sebastian Jaszczur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. To derive and validate the theoretical predictions of our scaling laws, we conduct over 280 experiments with up to 2.7B active parameters and up to 5B total parameters. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.
zh

[NLP-1] Scaling up Test-Time Compute with Latent Reasoning : A Recurrent Depth Approach

【速读】: 该论文旨在解决语言模型在推理任务中的计算扩展性问题。关键在于提出了一种新型语言模型架构,能够通过在隐空间中进行隐式推理来动态调整测试时的计算量。该模型通过迭代循环块并在测试时展开到任意深度,从而与主流方法通过生成更多tokens来增加计算量不同。此方法无需专门训练数据,可适应小上下文窗口,并能捕捉难以用文字表达的推理类型。

链接: https://arxiv.org/abs/2502.05171
作者: Jonas Geiping,Sean McLeish,Neel Jain,John Kirchenbauer,Siddharth Singh,Brian R. Bartoldson,Bhavya Kailkhura,Abhinav Bhatele,Tom Goldstein
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: The model is available at this https URL . Code and data recipe can be found at this https URL

点击查看摘要

Abstract:We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.
zh

[NLP-2] NoLiMa: Long-Context Evaluation Beyond Literal Matching

【速读】: 该论文旨在解决现有基准测试中大型语言模型(Large Language Models, LLMs)在长上下文(128K到1M tokens)下性能显著下降的问题。当前评估方法如“needle-in-a-haystack”测试存在一个局限:模型可以通过字面匹配来简化任务。为此,论文提出NoLiMa基准,通过精心设计的“针”集合,减少问题与目标信息之间的词汇重叠,迫使模型推断潜在关联以定位正确信息。关键在于引入的NoLiMa基准能够更有效地评估模型在缺乏直接字面匹配的情况下处理长上下文的能力。

链接: https://arxiv.org/abs/2502.05167
作者: Ali Modarressi,Hanieh Deilamsalehy,Franck Dernoncourt,Trung Bui,Ryan A. Rossi,Seunghyun Yoon,Hinrich Schütze
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a “needle” (relevant information) from a “haystack” (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.
zh

[NLP-3] DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

【速读】: 该论文旨在解决多语言环境中检测不安全和非法内容的需求不足问题,特别是在缺乏开源安全数据的其他语言中。解决方案的关键在于提出了一种新型的双玩家强化学习(Reinforcement Learning, RL)框架,通过生成器和守卫模型的对抗性共进化来产生高质量的合成数据,用于多语言守卫模型训练。这种方法能够有效地提高多语言环境下的安全任务性能,并显著减少了对低资源语言数据不平衡的影响。

链接: https://arxiv.org/abs/2502.05163
作者: Yihe Deng,Yu Yang,Junkai Zhang,Wei Wang,Bo Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, 9 figures, 5 tables

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at this https URL.
zh

[NLP-4] A Lightweight Method to Disrupt Memorized Sequences in LLM

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本时可能复制受版权保护内容的问题,这引发了法律和伦理上的担忧。论文的关键解决方案是TokenSwap,这是一种轻量级的后处理方法,通过替换与语法相关的标记的概率分布,使用小型辅助模型(如DistilGPT-2)的概率来替代。实验结果显示,TokenSwap方法能够有效减少已知的复制生成案例高达10倍,同时对下游任务几乎没有影响。这种方法为现实系统中的用户提供了一种独特且有效的解决方案。

链接: https://arxiv.org/abs/2502.05159
作者: Parjanya Prajakta Prashant,Kaustubh Ponkshe,Babak Salimi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) demonstrate impressive capabilities across many tasks yet risk reproducing copyrighted content verbatim, raising legal and ethical concerns. Although methods like differential privacy or neuron editing can reduce memorization, they typically require costly retraining or direct access to model weights and may degrade performance. To address these challenges, we propose TokenSwap, a lightweight, post-hoc approach that replaces the probabilities of grammar-related tokens with those from a small auxiliary model (e.g., DistilGPT-2). We run extensive experiments on commercial grade models such as Pythia-6.9b and LLaMA-3-8b and demonstrate that our method effectively reduces well-known cases of memorized generation by upto 10x with little to no impact on downstream tasks. Our approach offers a uniquely accessible and effective solution to users of real-world systems.
zh

[NLP-5] ransforming Science with Large Language Models : A Survey on AI-assisted Scientific Discovery Experimentation Content Generation and Evaluation DATE

【速读】: 该论文旨在探讨大型多模态语言模型在科学研究中的应用及其带来的技术变革。论文重点关注了五个方面:(1)文献检索;(2)研究想法的生成与实验设计;(3)文本内容生成;(4)多模态内容生成(如科学图表和示意图);以及(5)基于人工智能的自动同行评审。论文的关键在于提供这些新兴发展的深入综述,并讨论相关的数据集、方法、结果及评估,同时指出其局限性和未来研究的方向。特别地,论文强调了这些工具的伦理问题,包括潜在的滥用问题(如虚假研究、剽窃、损害科研诚信)。论文希望不仅成为新进入该领域的研究人员的参考指南,还能促进“AI4Science”领域新的基于人工智能的科研倡议。

链接: https://arxiv.org/abs/2502.05151
作者: Steffen Eger,Yong Cao,Jennifer D’Souza,Andreas Geiger,Christian Greisinger,Stephanie Gross,Yufang Hou,Brigitte Krenn,Anne Lauscher,Yizhi Li,Chenghua Lin,Nafise Sadat Moosavi,Wei Zhao,Tristan Miller
机构: University of Technology Nuremberg (UTN)(纽伦堡技术大学); University of Tübingen, Tübingen AI Center (图宾根大学, 图宾根人工智能中心); TIB Leibniz Information Centre for Science and Technology (TIB莱布尼茨科学与技术信息中心); Austrian Research Institute for Artificial Intelligence (奥地利人工智能研究所); IT:U Interdisciplinary Transformation University Austria (奥地利跨学科转型大学IT:U); University of Hamburg (汉堡大学); University of Manchester (曼彻斯特大学); University of Sheffield (谢菲尔德大学); University of Aberdeen (阿伯丁大学); University of Manitoba (曼尼托巴大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Work in progress. Will be updated soon

点击查看摘要

Abstract:With the advent of large multimodal language models, science is now at a threshold of an AI-based technological transformation. Recently, a plethora of new AI models and tools has been proposed, promising to empower researchers and academics worldwide to conduct their research more effectively and efficiently. This includes all aspects of the research cycle, especially (1) searching for relevant literature; (2) generating research ideas and conducting experimentation; generating (3) text-based and (4) multimodal content (e.g., scientific figures and diagrams); and (5) AI-based automatic peer review. In this survey, we provide an in-depth overview over these exciting recent developments, which promise to fundamentally alter the scientific research process for good. Our survey covers the five aspects outlined above, indicating relevant datasets, methods and results (including evaluation) as well as limitations and scope for future research. Ethical concerns regarding shortcomings of these tools and potential for misuse (fake science, plagiarism, harms to research integrity) take a particularly prominent place in our discussion. We hope that our survey will not only become a reference guide for newcomers to the field but also a catalyst for new AI-based initiatives in the area of “AI4Science”.
zh

[NLP-6] CodeSCM: Causal Analysis for Multi-Modal Code Generation NAACL2025

【速读】: 该论文旨在解决多模态代码生成中不同提示模态(自然语言、代码和输入-输出示例)对大型语言模型(LLMs)的影响分析问题。论文的关键解决方案是提出了CodeSCM,一个结构因果模型(SCM),通过引入潜在中介变量来分离多模态代码生成提示中的代码和自然语言语义,并利用因果中介分析量化这些中介变量上的直接效应,从而揭示模型的虚假倾向。

链接: https://arxiv.org/abs/2502.05150
作者: Mukur Gupta,Noopur Bhatt,Suman Jana
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:In this paper, we propose CodeSCM, a Structural Causal Model (SCM) for analyzing multi-modal code generation using large language models (LLMs). By applying interventions to CodeSCM, we measure the causal effects of different prompt modalities, such as natural language, code, and input-output examples, on the model. CodeSCM introduces latent mediator variables to separate the code and natural language semantics of a multi-modal code generation prompt. Using the principles of Causal Mediation Analysis on these mediators we quantify direct effects representing the model’s spurious leanings. We find that, in addition to natural language instructions, input-output examples significantly influence code generation.
zh

[NLP-7] An Annotated Reading of The Singer of Tales in the LLM Era

【速读】: 该论文旨在探讨Parry-Lord口传公式理论与大型语言模型(LLMs)及生成式人工智能(AI)之间的相似性和差异性。关键在于通过LLMs和生成式AI的视角重新解读口述叙事诗歌的学习、创作和传播机制,并讨论其对社会和AI政策的影响。

链接: https://arxiv.org/abs/2502.05148
作者: Kush R. Varshney
机构: IBM Research – Thomas J. Watson Research Center (IBM托马斯J.沃森研究中心)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Parry-Lord oral-formulaic theory was a breakthrough in understanding how oral narrative poetry is learned, composed, and transmitted by illiterate bards. In this paper, we provide an annotated reading of the mechanism underlying this theory from the lens of large language models (LLMs) and generative artificial intelligence (AI). We point out the the similarities and differences between oral composition and LLM generation, and comment on the implications to society and AI policy.
zh

[NLP-8] GiesKaNe: Bridging Past and Present in Grammatical Theory and Practical Application

【速读】: 该论文旨在解决构建综合性历史语料库(GiesKaNe项目)过程中所面临的方法论复杂性问题。解决方案的关键在于通过人机协作的方式平衡创新与标准遵循,具体措施包括文本的机器辅助分类以及从现有标注标准中推导出实际标准注释。这种方法不仅实现了高效的项目实施,还展示了利用现有研究基础设施的可能性,仅需简单的电子表格工具即可实现复杂的工作流程。

链接: https://arxiv.org/abs/2502.05113
作者: Volker Emmrich
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This article explores the requirements for corpus compilation within the GiesKaNe project (University of Giessen and Kassel, Syntactic Basic Structures of New High German). The project is defined by three central characteristics: it is a reference corpus, a historical corpus, and a syntactically deeply annotated treebank. As a historical corpus, GiesKaNe aims to establish connections with both historical and contemporary corpora, ensuring its relevance across temporal and linguistic contexts. The compilation process strikes the balance between innovation and adherence to standards, addressing both internal project goals and the broader interests of the research community. The methodological complexity of such a project is managed through a complementary interplay of human expertise and machine-assisted processes. The article discusses foundational topics such as tokenization, normalization, sentence definition, tagging, parsing, and inter-annotator agreement, alongside advanced considerations. These include comparisons between grammatical models, annotation schemas, and established de facto annotation standards as well as the integration of human and machine collaboration. Notably, a novel method for machine-assisted classification of texts along the continuum of conceptual orality and literacy is proposed, offering new perspectives on text selection. Furthermore, the article introduces an approach to deriving de facto standard annotations from existing ones, mediating between standardization and innovation. In the course of describing the workflow the article demonstrates that even ambitious projects like GiesKaNe can be effectively implemented using existing research infrastructure, requiring no specialized annotation tools. Instead, it is shown that the workflow can be based on the strategic use of a simple spreadsheet and integrates the capabilities of the existing infrastructure.
zh

[NLP-9] Flexible and Efficient Grammar-Constrained Decoding

【速读】: 该论文旨在解决大型语言模型(LLMs)在生成符合精确语法规则的结构化输出(如代码片段或格式化数据)时所面临的问题。现有解决方案中的语法约束解码(Grammar-Constrained Decoding, GCD)算法需要耗费大量时间来预处理常见的语法,这限制了其实际应用。论文提出的关键解决方案是一种新的GCD算法,该算法通过提供比现有方法快17.71倍的离线预处理速度,同时保持在线遮罩计算的最高效率,从而显著提升了处理速度。

链接: https://arxiv.org/abs/2502.05111
作者: Kanghee Park,Timothy Zhou,Loris D’Antoni
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.05111 [cs.CL] (or arXiv:2502.05111v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.05111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-10] Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLM s

【速读】: 该论文旨在解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)在理解时间和日期方面的能力挑战。论文的关键解决方案在于构建了一个包含两个子集的结构化数据集:\textit{ClockQA} 和 \textit{CalendarQA}。其中,\textit{ClockQA} 包含不同风格的钟面图像及相应的时间相关问题;\textit{CalendarQA} 则包含年历图像及其相关问题,涵盖从已知节日到计算得出的日期。通过这一数据集,研究旨在分析 MLLMs 在处理与时间相关的视觉数据时进行视觉识别、数值推理和时间推断的能力。研究表明,尽管取得了近期进展,可靠地理解时间仍然是 MLLMs 面临的重大挑战。

链接: https://arxiv.org/abs/2502.05092
作者: Rohit Saxena,Aryo Pradipta Gema,Pasquale Minervini
机构: ILCC, School of Informatics, University of Edinburgh(爱丁堡大学); Miniml.AI(未知)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Understanding time from visual representations is a fundamental cognitive skill, yet it remains a challenge for multimodal large language models (MLLMs). In this work, we investigate the capabilities of MLLMs in interpreting time and date through analogue clocks and yearly calendars. To facilitate this, we curated a structured dataset comprising two subsets: 1) \textitClockQA , which comprises various types of clock styles - standard, black-dial, no-second-hand, Roman numeral, and arrow-hand clocks - paired with time related questions; and 2) \textitCalendarQA , which consists of yearly calendar images with questions ranging from commonly known dates (e.g., Christmas, New Year’s Day) to computationally derived ones (e.g., the 100th or 153rd day of the year). We aim to analyse how MLLMs can perform visual recognition, numerical reasoning, and temporal inference when presented with time-related visual data. Our evaluations show that despite recent advancements, reliably understanding time remains a significant challenge for MLLMs.
zh

[NLP-11] Mitigating Unintended Memorization with LoRA in Federated Learning for LLM s

【速读】: 该论文旨在解决在联邦学习(Federated Learning, FL)过程中,大型语言模型可能通过前缀提示恢复其他参与者的训练数据的问题。论文的关键解决方案是采用低秩适应(Low-Rank Adaptation, LoRA)这一简单的微调策略,该方法能够将记忆效应降低多达十倍。研究通过医疗问答任务验证了LoRA在减少记忆效应方面的有效性,并发现其不仅适用于联邦学习,同样可以在集中式学习中提高记录级隐私保护,同时保持性能。

链接: https://arxiv.org/abs/2502.05087
作者: Thierry Bossy,Julien Vignoud,Tahseen Rabbani,Juan R. Troncoso Pastoriza,Martin Jaggi
机构: Tune Insight SA(探针洞察有限公司); EPFL(瑞士联邦理工学院); Yale University(耶鲁大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Federated learning (FL) is a popular paradigm for collaborative training which avoids direct data exposure between clients. However, data privacy issues still remain: FL-trained large language models are capable of memorizing and completing phrases and sentences contained in training data when given with their prefixes. Thus, it is possible for adversarial and honest-but-curious clients to recover training data of other participants simply through targeted prompting. In this work, we demonstrate that a popular and simple fine-tuning strategy, low-rank adaptation (LoRA), reduces memorization during FL up to a factor of 10. We study this effect by performing a medical question-answering fine-tuning task and injecting multiple replicas of out-of-distribution sensitive sequences drawn from an external clinical dataset. We observe a reduction in memorization for a wide variety of Llama 2 and 3 models, and find that LoRA can reduce memorization in centralized learning as well. Furthermore, we show that LoRA can be combined with other privacy-preserving techniques such as gradient clipping and Gaussian noising, secure aggregation, and Goldfish loss to further improve record-level privacy while maintaining performance.
zh

[NLP-12] ChallengeMe: An Adversarial Learning-enabled Text Summarization Framework

【速读】: 该论文旨在解决大型语言模型(LLMs)在垂直领域任务中生成内容时存在的幻觉现象和缺乏具体性的问题。关键解决方案在于构建了一个名为ChallengeMe的对抗学习基础提示框架,该框架包含三个递进步骤:生成提示、评估提示和反馈优化,并设计了七个核心优化维度及设定了对抗学习的阈值。

链接: https://arxiv.org/abs/2502.05084
作者: Xiaoyu Deng,Ye Zhang,Tianmin Guo,Yongzhe Zhang,Zhengjian Kang,Hang Yang
机构: Fordham University (福特汉姆大学); University of Pittsburgh (匹兹堡大学); New York University (纽约大学); California Institute of Technology (加州理工学院); New York University (纽约大学); University of Miami (迈阿密大学); University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The astonishing performance of large language models (LLMs) and their remarkable achievements in production and daily life have led to their widespread application in collaborative tasks. However, current large models face challenges such as hallucination and lack of specificity in content generation in vertical domain tasks. Inspired by the contrast and classification mechanisms in human cognitive processes, this paper constructs an adversarial learning-based prompt framework named ChallengeMe, which includes three cascaded solutions: generation prompts, evaluation prompts, and feedback optimization. In this process, we designed seven core optimization dimensions and set the threshold for adversarial learning. The results of mixed case studies on the text summarization task show that the proposed framework can generate more accurate and fluent text summaries compared to the current advanced mainstream LLMs.
zh

[NLP-13] Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain Tree and Graph Structures

【速读】: 该论文旨在解决大型语言模型(LLMs)在推理任务中依赖固定的提示策略和模型规模的问题。论文的关键解决方案是引入自适应图思维(Adaptive Graph of Thoughts, AGoT),这是一种动态的基于图的推理框架,能够在测试阶段增强LLMs的推理能力。与固定步骤方法(如Chain of Thought, CoT或Tree of Thoughts, ToT)不同,AGoT通过递归分解复杂查询为结构化的子问题,并形成一个动态有向无环图(DAG)来组织相互依赖的推理步骤。这种方法仅在需要进一步分析时才扩展子问题,从而统一了链式、树状和图式范式的优点,将计算资源分配到最需要的地方。

链接: https://arxiv.org/abs/2502.05078
作者: Tushar Pandey,Ara Ghukasyan,Oktay Goktas,Santosh Kumar Radha
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet their performance is highly dependent on the prompting strategy and model scale. While reinforcement learning and fine-tuning have been deployed to boost reasoning, these approaches incur substantial computational and data overhead. In this work, we introduce Adaptive Graph of Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM reasoning solely at test time. Rather than relying on fixed-step methods like Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes complex queries into structured subproblems, forming an dynamic directed acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding only those subproblems that require further analysis, AGoT unifies the strengths of chain, tree, and graph paradigms into a cohesive framework that allocates computation where it is most needed. We validate our approach on diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and mathematical problem-solving, achieving up to 46.2% improvement on scientific reasoning tasks (GPQA) - comparable to gains achieved through computationally intensive reinforcement learning approaches and outperforming state-of-the-art iterative approaches. These results suggest that dynamic decomposition and structured recursion offer a scalable, cost-effective alternative to post-training modifications, paving the way for more robust, general-purpose reasoning in LLMs.
zh

[NLP-14] Paying Attention to Facts: Quantifying the Knowledge Capacity of Attention Layers

【速读】: 该论文旨在从线性代数的角度探究单层注意力机制Transformer(Attention-Only Transformers)在数据库事实记忆方面的能力。论文通过将每个数据库关联到一个三阶张量,并提出该张量的秩作为衡量数据库大小的标准,提供了关于数据库性质的秩界。论文定义了与注意力层相对应的三阶张量,并通过玩具模型和随机数据库数据集上的实验证明了其秩与数据库秩之间的关系。研究的关键在于强调值输出(value-output)权重和查询-键(query-key)权重的作用,以及argmax和softmax操作对秩的影响,从而揭示了Transformer中事实回忆的“加法模式”,同时提出了在不增加参数数量的情况下提高层容量的方法。

链接: https://arxiv.org/abs/2502.05076
作者: Liang Ze Wong
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we investigate the ability of single-layer attention-only transformers (i.e. attention layers) to memorize facts contained in databases from a linear-algebraic perspective. We associate with each database a 3-tensor, propose the rank of this tensor as a measure of the size of the database, and provide bounds on the rank in terms of properties of the database. We also define a 3-tensor corresponding to an attention layer, and empirically demonstrate the relationship between its rank and database rank on a dataset of toy models and random databases. By highlighting the roles played by the value-output and query-key weights, and the effects of argmax and softmax on rank, our results shed light on the `additive motif’ of factual recall in transformers, while also suggesting a way of increasing layer capacity without increasing the number of parameters.
zh

[NLP-15] nvAgent : Automated Data Visualization from Natural Language via Collaborative Agent Workflow

【速读】: 该论文旨在解决自然语言到可视化(Natural Language to Visualization, NL2Vis)转换过程中,大型语言模型(LLMs)在处理复杂查询时难以跨越多个表格进行推理的问题。论文的关键解决方案是提出了一种协作代理工作流,称为nvAgent,它由处理器代理、作曲家代理和验证器代理组成,分别负责数据库处理与上下文过滤、可视化生成规划以及代码翻译与输出验证,从而显著提升了单表和多表场景下的性能表现。

链接: https://arxiv.org/abs/2502.05036
作者: Geliang Ouyang,Jingyao Chen,Zhihe Nie,Yi Gui,Yao Wan,Hongyu Zhang,Dongping Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language to Visualization (NL2Vis) seeks to convert natural-language descriptions into visual representations of given tables, empowering users to derive insights from large-scale data. Recent advancements in Large Language Models (LLMs) show promise in automating code generation to transform tabular data into accessible visualizations. However, they often struggle with complex queries that require reasoning across multiple tables. To address this limitation, we propose a collaborative agent workflow, termed nvAgent, for NL2Vis. Specifically, nvAgent comprises three agents: a processor agent for database processing and context filtering, a composer agent for planning visualization generation, and a validator agent for code translation and output verification. Comprehensive evaluations on the new VisEval benchmark demonstrate that nvAgent consistently surpasses state-of-the-art baselines, achieving a 7.88% improvement in single-table and a 9.23% improvement in multi-table scenarios. Qualitative analyses further highlight that nvAgent maintains nearly a 20% performance margin over previous models, underscoring its capacity to produce high-quality visual representations from complex, heterogeneous data sources.
zh

[NLP-16] Aligning Black-box Language Models with Human Judgments NAACL2025

【速读】: 该论文旨在解决大型语言模型(LLMs)作为自动化评估工具与人类评估结果不一致的问题。论文的关键解决方案在于提出了一种简单而有效的框架,通过学习LLMs输出与人类评估之间的线性映射,无需重新训练或微调LLM,即可实现超过142%的平均一致性提升,且该方法在零样本和少样本设置下表现出色,并使较小的LLM在性能上接近较大模型。

链接: https://arxiv.org/abs/2502.04997
作者: Gerrit J. J. van den Burg,Gen Suzuki,Wei Liu,Murat Sensoy
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication at NAACL 2025 (Findings)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs offer an efficient solution for continuous, automated evaluation. However, since the systems that are built and improved with these judgments are ultimately designed for human use, it is crucial that LLM judgments align closely with human evaluators to ensure such systems remain human-centered. On the other hand, aligning LLM judgments with human evaluators is challenging due to individual variability and biases in human judgments. We propose a simple yet effective framework to align LLM judgments with individual human evaluators or their aggregated judgments, without retraining or fine-tuning the LLM. Our approach learns a linear mapping between the LLM’s outputs and human judgments, achieving over 142% average improvement in agreement across 29 tasks with only a small number of calibration examples used for training. Notably, our method works in zero-shot and few-shot settings, exceeds inter-human agreement on four out of six tasks, and enables smaller LLMs to achieve performance comparable to that of larger models.
zh

[NLP-17] CoCoA: A Generalized Approach to Uncertainty Quantification by Integrating Confidence and Consistency of LLM Outputs

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)不确定性量化(Uncertainty Quantification, UQ)方法在某些任务中表现不佳的问题。论文的关键在于提出了一种新的合成模型置信度和输出一致性的方式,从而形成一系列高效且鲁棒的UQ方法。这一方法基于对LLMs作为概率模型特性的新发现,通过综合考虑模型的概率预测和输出的一致性,在问答、抽象摘要和机器翻译等任务中展示了相对于现有SOTA UQ方法的显著改进。

链接: https://arxiv.org/abs/2502.04964
作者: Roman Vashurin(1),Maiya Goloburda(1),Preslav Nakov(1),Artem Shelmanov(1),Maxim Panov(1) ((1) Mohamed bin Zayed University of Artificial Intelligence)
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompasses a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combined these two approaches and shown impressive performance in various applications. However, they sometimes fail to outperform much simpler baseline methods. Our investigation reveals distinctive characteristics of LLMs as probabilistic models, which help to explain why these UQ methods underperform in certain tasks. Based on these findings, we propose a new way of synthesizing model confidence and output consistency that leads to a family of efficient and robust UQ methods. We evaluate our approach across a variety of tasks such as question answering, abstractive summarization, and machine translation, demonstrating sizable improvements over state-of-the-art UQ approaches.
zh

[NLP-18] Commonality and Individuality! Integrating Humor Commonality with Speaker Individuality for Humor Recognition NAACL2025

【速读】: 该论文旨在解决幽默识别中的两个主要问题:一是现有方法仅关注幽默共性的单一方面,忽略了幽默的多面性;二是忽视了演讲者个体性在全面理解幽默表达中的关键作用。为解决这些问题,论文提出了一种名为Commonality and Individuality Incorporated Network for Humor Recognition (CIHR)的新模型。CIHR的关键在于通过整合多方面的幽默共性和演讲者的独特个体性来增强幽默识别能力,具体体现在其幽默共性分析模块和演讲者个体性提取模块的设计,以及静态和动态融合模块的有效结合。

链接: https://arxiv.org/abs/2502.04960
作者: Haohao Zhu,Junyu Lu,Zeyuan Zeng,Zewen Bai,Xiaokun Zhang,Liang Yang,Hongfei Lin
机构: School of Computer Science and Technology, Dalian University of Technology, China (计算机科学与技术学院, 大连理工大学, 中国)
类目: Computation and Language (cs.CL)
备注: Accepted by NAACL 2025

点击查看摘要

Abstract:Humor recognition aims to identify whether a specific speaker’s text is humorous. Current methods for humor recognition mainly suffer from two limitations: (1) they solely focus on one aspect of humor commonalities, ignoring the multifaceted nature of humor; and (2) they typically overlook the critical role of speaker individuality, which is essential for a comprehensive understanding of humor expressions. To bridge these gaps, we introduce the Commonality and Individuality Incorporated Network for Humor Recognition (CIHR), a novel model designed to enhance humor recognition by integrating multifaceted humor commonalities with the distinctive individuality of speakers. The CIHR features a Humor Commonality Analysis module that explores various perspectives of multifaceted humor commonality within user texts, and a Speaker Individuality Extraction module that captures both static and dynamic aspects of a speaker’s profile to accurately model their distinctive individuality. Additionally, Static and Dynamic Fusion modules are introduced to effectively incorporate the humor commonality with speaker’s individuality in the humor recognition process. Extensive experiments demonstrate the effectiveness of CIHR, underscoring the importance of concurrently addressing both multifaceted humor commonality and distinctive speaker individuality in humor recognition.
zh

[NLP-19] SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model NAACL2025

【速读】: 该论文旨在解决在大规模语言模型微调过程中参数更新效率低下的问题。随着模型规模的增加,更新所有模型参数变得不切实际。为此,论文提出了一种名为SSMLoRA(State Space Model Low-Rank Adaptation)的方法,作为LoRA的扩展,通过引入状态空间模型(State Space Model, SSM)来连接低秩矩阵,从而实现更高效的参数利用。SSMLoRA的关键在于不仅能够将输入映射到低秩空间以优化特征提取,还能复用先前低秩空间中的计算结果,从而在保持性能的同时减少插入参数的数量。

链接: https://arxiv.org/abs/2502.04958
作者: Jiayang Yu,Yihang Zhang,Bin Wang,Peiqin Lin,Yongkang Liu,Shi Feng
机构: Northeastern University, China(东北大学,中国); CIS, LMU Munich, Germany(慕尼黑大学计算机科学学院,德国); Munich Center for Machine Learning (MCML), Germany(慕尼黑机器学习中心,德国)
类目: Computation and Language (cs.CL)
备注: Has been accepted by NAACL 2025

点击查看摘要

Abstract:Fine-tuning is a key approach for adapting language models to specific downstream tasks, but updating all model parameters becomes impractical as model sizes increase. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address this challenge by introducing additional adaptation parameters into pre-trained weight matrices. However, LoRA’s performance varies across different insertion points within the model, highlighting potential parameter inefficiency due to unnecessary insertions. To this end, we propose SSMLoRA (State Space Model Low-Rank Adaptation), an extension of LoRA that incorporates a State Space Model (SSM) to interconnect low-rank matrices. SSMLoRA ensures that performance is maintained even with sparser insertions. SSMLoRA allows the model to not only map inputs to a low-rank space for better feature extraction but also leverage the computations from the previous low-rank space. Our method achieves comparable performance to LoRA on the General Language Understanding Evaluation (GLUE) benchmark while using only half the parameters. Additionally, due to its structure, SSMLoRA shows promise in handling tasks with longer input sequences. .You can find our code here:this https URL.
zh

[NLP-20] Claim Extraction for Fact-Checking: Data Models and Automated Metrics

【速读】: 该论文旨在解决Claim Extraction(声明提取)的问题,采用了一对多文本生成方法,比较了大型语言模型(LLMs)、针对任务微调的小型摘要模型以及先前基于命名实体识别(NER)的QACG基线。解决方案的关键在于开发了一个评价框架,该框架包括Atomicity(原子性)、Fluency(流畅性)、Decontextualization(去上下文化)、Faithfulness(忠实性)等指标,并验证了这些指标在难 Metrics 上与人工评分高度一致。此外,论文还发布了FEVERFact数据集,包含从4K条结构化维基百科句子中提取的17K个原子性事实声明。

链接: https://arxiv.org/abs/2502.04955
作者: Herbert Ullrich,Tomáš Mlynář,Jan Drchal
机构: AI Center @ CTU FEE (CTU FEE AI中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we explore the problem of Claim Extraction using one-to-many text generation methods, comparing LLMs, small summarization models finetuned for the task, and a previous NER-centric baseline QACG. As the current publications on Claim Extraction, Fact Extraction, Claim Generation and Check-worthy Claim Detection are quite scattered in their means and terminology, we compile their common objectives, releasing the FEVERFact dataset, with 17K atomic factual claims extracted from 4K contextualised Wikipedia sentences, adapted from the original FEVER. We compile the known objectives into an Evaluation framework of: Atomicity, Fluency, Decontextualization, Faithfulness checked for each generated claim separately, and Focus and Coverage measured against the full set of predicted claims for a single input. For each metric, we implement a scale using a reduction to an already-explored NLP task. We validate our metrics against human grading of generic claims, to see that the model ranking on F_fact , our hardest metric, did not change and the evaluation framework approximates human grading very closely in terms of F_1 and RMSE.
zh

[NLP-21] Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance

【速读】: 该论文旨在解决低资源语言如弗里西亚语及其方言在自动语音识别(ASR)性能上的不足。论文的关键解决方案是采用多语言(弗里西亚语、荷兰语、英语和德语)微调数据以及辅助的语言识别任务来优化基于自监督学习(SSL)的模型,从而提升弗里西亚语及其方言的ASR性能。此外,研究还发现方言语音识别性能显著下降,并且这种影响受到收集方言数据方法的影响。论文建议不应仅依赖标准语言数据进行ASR评估,特别是在存在大量方言变异的语言中。

链接: https://arxiv.org/abs/2502.04883
作者: Reihaneh Amooie,Wietse de Vries,Yun Hao,Jelske Dijkstra,Matt Coler,Martijn Wieling
机构: University of Groningen(格罗宁根大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) performance for low-resource languages is still far behind that of higher-resource languages such as English, due to a lack of sufficient labeled data. State-of-the-art methods deploy self-supervised transfer learning where a model pre-trained on large amounts of data is fine-tuned using little labeled data in a target low-resource language. In this paper, we present and examine a method for fine-tuning an SSL-based model in order to improve the performance for Frisian and its regional dialects (Clay Frisian, Wood Frisian, and South Frisian). We show that Frisian ASR performance can be improved by using multilingual (Frisian, Dutch, English and German) fine-tuning data and an auxiliary language identification task. In addition, our findings show that performance on dialectal speech suffers substantially, and, importantly, that this effect is moderated by the elicitation approach used to collect the dialectal data. Our findings also particularly suggest that relying solely on standard language data for ASR evaluation may underestimate real-world performance, particularly in languages with substantial dialectal variation.
zh

[NLP-22] pytopicgram: A library for data extraction and topic modeling from Telegram channels

【速读】: 该论文旨在解决通过Telegram平台收集、组织和分析大规模公共通信信息的问题。解决方案的关键在于开发了一个名为pytopicgram的Python库,它提供了包括简便的消息检索、详细的频道信息、参与度指标以及使用高级建模技术的主题识别等功能,从而简化了数据提取和分析过程,使用户能够更好地理解内容传播及受众互动的方式。

链接: https://arxiv.org/abs/2502.04882
作者: J. Gómez-Romero,J. Cantón Correa,R. Pérez Mercado,F. Prados Abad,M. Molina-Solana,W. Fajardo
机构: Universidad de Granada (格拉纳达大学), Spain (西班牙)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Telegram is a popular platform for public communication, generating large amounts of messages through its channels. pytopicgram is a Python library that helps researchers collect, organize, and analyze these Telegram messages. The library offers key features such as easy message retrieval, detailed channel information, engagement metrics, and topic identification using advanced modeling techniques. By simplifying data extraction and analysis, pytopicgram allows users to understand how content spreads and how audiences interact on Telegram. This paper describes the design, main features, and practical uses of \pytopicgram, showcasing its effectiveness for studying public conversations on Telegram.
zh

[NLP-23] Enhancing Disinformation Detection with Explainable AI and Named Entity Replacement

【速读】: 该论文旨在解决自动检测虚假信息在自然语言处理领域面临的重大挑战。论文的关键解决方案在于应用后验可解释性方法(SHAP, SHapley Additive exPlanations)来识别对分类模型有重大影响的误导性元素,并提出在训练前去除非信息性元素(如URLs和表情符号)以及对命名实体进行伪匿名化处理,以减少模型偏差并提高其泛化能力。这一方法显著提升了外部测试数据集中虚假信息分类方法的性能,平均提升了65.78%,同时未显著降低内部测试性能。

链接: https://arxiv.org/abs/2502.04863
作者: Santiago González-Silot,Andrés Montoro-Montarroso,Eugenio Martínez Cámara,Juan Gómez-Romero
机构: SINAI research group (SINAI研究组); Advanced Studies Center in ICT (CEATIC) (信息技术高级研究中心); Universidad de Jaén, Spain (西班牙哈恩大学); Department of Technologies and Information Systems (技术与信息系统系); Universidad de Castilla-La Mancha, Spain (西班牙拉曼查大学); Department of Computer Science and Artificial Intelligence (计算机科学与人工智能系); Universidad de Granada, Spain (西班牙格拉纳达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The automatic detection of disinformation presents a significant challenge in the field of natural language processing. This task addresses a multifaceted societal and communication issue, which needs approaches that extend beyond the identification of general linguistic patterns through data-driven algorithms. In this research work, we hypothesise that text classification methods are not able to capture the nuances of disinformation and they often ground their decision in superfluous features. Hence, we apply a post-hoc explainability method (SHAP, SHapley Additive exPlanations) to identify spurious elements with high impact on the classification models. Our findings show that non-informative elements (e.g., URLs and emoticons) should be removed and named entities (e.g., Rwanda) should be pseudo-anonymized before training to avoid models’ bias and increase their generalization capabilities. We evaluate this methodology with internal dataset and external dataset before and after applying extended data preprocessing and named entity replacement. The results show that our proposal enhances on average the performance of a disinformation classification method with external test data in 65.78% without a significant decrease of the internal test performance.
zh

[NLP-24] Lightweight Operations for Visual Speech Recognition

【速读】: 该论文旨在解决视觉语音识别(Visual Speech Recognition, VSR)在资源受限设备上的部署难题。由于视频数据的高维度特性,传统的VSR系统需要强大的硬件支持,导致计算成本高昂。为了解决这一问题,论文的关键在于开发轻量级的VSR架构。通过采用高效的运算设计范式,论文提出了紧凑且性能强大的模型,这些模型具有较低的资源需求,并且仅产生微小的精度损失。

链接: https://arxiv.org/abs/2502.04834
作者: Iason Ioannis Panagos,Giorgos Sfikas,Christophoros Nikou
机构: Department of Computer Science & Engineering, University of Ioannina (计算机科学与工程系, 伊奥安尼纳大学); Department of Surveying & Geoinformatics Engineering, University of West Attica (测量与地球信息技术系, 西阿提卡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages (double column format), 7 figures

点击查看摘要

Abstract:Visual speech recognition (VSR), which decodes spoken words from video data, offers significant benefits, particularly when audio is unavailable. However, the high dimensionality of video data leads to prohibitive computational costs that demand powerful hardware, limiting VSR deployment on resource-constrained devices. This work addresses this limitation by developing lightweight VSR architectures. Leveraging efficient operation design paradigms, we create compact yet powerful models with reduced resource requirements and minimal accuracy loss. We train and evaluate our models on a large-scale public dataset for recognition of words from video sequences, demonstrating their effectiveness for practical applications. We also conduct an extensive array of ablative experiments to thoroughly analyze the size and complexity of each model. Code and trained models will be made publicly available.
zh

[NLP-25] Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks ACL

【速读】: 该论文旨在解决自由文本解释(Free-text Explanations)在现有许多数据集中缺乏注释的问题,这使得训练能够进行可解释预测的模型变得困难。为了解决这一问题,研究的关键在于如何利用现有的解释数据集进行自我推理(self-rationalization),并评估模型在分布外(Out-of-Distribution, OOD)数据上的表现。通过微调(fine-tuning)T5-Large 和 OLMo-7B 模型,并评估微调数据质量、微调样本数量以及少量示例选择方法的影响,研究发现少量注释示例可以有效适应模型以生成 OOD 解释;模型源数据比采样选择策略对 OOD 性能影响更大;并且预测标签准确性更高的模型倾向于生成更好的解释。

链接: https://arxiv.org/abs/2502.04797
作者: Jing Yang,Max Glockner,Anderson Rocha,Iryna Gurevych
机构: Artificial Intelligence Lab., Recod.ai, Institute of Computing, University of Campinas, Brazil(人工智能实验室, Recod.ai, 计算机研究所, 坎皮纳斯大学, 巴西); UKP Lab, Department of Computer Science, Technical University of Darmstadt, Germany(英国研究实验室, 计算机科学系, 达姆施塔特工业大学, 德国)
类目: Computation and Language (cs.CL)
备注: Accepted at TACL; pre-MIT Press publication version

点击查看摘要

Abstract:Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data, making it challenging to train models for explainable predictions. To address this, we investigate how to use existing explanation datasets for self-rationalization and evaluate models’ out-of-distribution (OOD) performance. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization. For the generated explanation evaluation, we conduct a human study on 13 selected models and study its correlation with the Acceptability score (T5-11B) and three other LLM-based reference-free metrics. Human evaluation shows that the Acceptability score correlates most strongly with human judgments, demonstrating its effectiveness in evaluating free-text explanations. Our findings reveal: 1) few annotated examples effectively adapt models for OOD explanation generation; 2) compared to sample selection strategies, fine-tuning data source has a larger impact on OOD performance; and 3) models with higher label prediction accuracy tend to produce better explanations, as reflected by higher Acceptability scores.
zh

[NLP-26] Developmentally-plausible Working Memory Shapes a Critical Period for Language Acquisition

【速读】: 该论文旨在解决大型语言模型在语言习得效率方面与人类显著不同的问题。关键在于提出了一种方法,通过在训练初期限制工作记忆,并以指数方式逐渐放松这一约束,从而模拟人类在关键期内高效语言习得的发育特征。这种方法在针对性句法评估中表现出色,优于无记忆约束或具有静态记忆约束的传统模型。

链接: https://arxiv.org/abs/2502.04795
作者: Masato Mita,Ryo Yoshida,Yohei Oseki
机构: The University of Tokyo(东京大学); CyberAgent
类目: Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Large language models exhibit general linguistic abilities but significantly differ from humans in their efficiency of language acquisition. This study proposes a method for integrating the developmental characteristics of working memory during the critical period, a stage when human language acquisition is particularly efficient, into language models. The proposed method introduces a mechanism that initially constrains working memory during the early stages of training and gradually relaxes this constraint in an exponential manner as learning progresses. Targeted syntactic evaluation shows that the proposed method outperforms conventional models without memory constraints or with static memory constraints. These findings not only provide new directions for designing data-efficient language models but also offer indirect evidence supporting the underlying mechanisms of the critical period hypothesis in human language acquisition.
zh

[NLP-27] S2-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency

【速读】: 该论文旨在解决大型语言模型(LLMs)在复杂算术和逻辑推理任务中的性能挑战。论文的关键解决方案是一种新颖的稀疏化策略,旨在减少多智能体辩论(MAD)中的令牌成本。通过最小化无效的信息交换和无益的讨论,该方法显著提升了辩论过程的整体效率,同时将令牌成本降低了高达94.5%,并且性能下降控制在2.0%以内。

链接: https://arxiv.org/abs/2502.04790
作者: Yuting Zeng,Weizhe Huang,Lei Jiang,Tongxuan Liu,Xitai Jin,Chen Tianying Tiana,Jing Li,Xiaohua Xu
机构: University of Science and Technology of China(中国科学技术大学); JD.com(京东); Harbin Institute of Technology(哈尔滨工业大学); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various natural language processing (NLP) scenarios, but they still face challenges when handling complex arithmetic and logical reasoning tasks. While Chain-Of-Thought (CoT) reasoning, self-consistency (SC) and self-correction strategies have attempted to guide models in sequential, multi-step reasoning, Multi-agent Debate (MAD) has emerged as a viable approach for enhancing the reasoning capabilities of LLMs. By increasing both the number of agents and the frequency of debates, the performance of LLMs improves significantly. However, this strategy results in a significant increase in token costs, presenting a barrier to scalability. To address this challenge, we introduce a novel sparsification strategy designed to reduce token costs within MAD. This approach minimizes ineffective exchanges of information and unproductive discussions among agents, thereby enhancing the overall efficiency of the debate process. We conduct comparative experiments on multiple datasets across various models, demonstrating that our approach significantly reduces the token costs in MAD to a considerable extent. Specifically, compared to MAD, our approach achieves an impressive reduction of up to 94.5% in token costs while maintaining performance degradation below 2.0%.
zh

[NLP-28] Probing Internal Representations of Multi-Word Verbs in Large Language Models

【速读】: 该论文旨在探究基于Transformer的大型语言模型(LLMs)内部如何表示动词-小品词组合(multi-word verbs),特别是这些模型在不同神经网络层面上如何捕捉词汇和句法属性。研究重点在于分析BERT架构中的各层对于短语动词(如“give up”)和介词动词(如“look at”)的表征。解决方案的关键在于通过训练探针分类器(probing classifiers)来分类这两种构造在单词级和句子级上的类别,并使用广义区分值(Generalized Discrimination Value, GDV)进行数据可分性测试。研究表明,模型的中间层达到了最高的分类准确率,且尽管GDV结果显示两种动词类型之间线性可分性较弱,但探针分类器仍能实现高精度分类,这表明这些语言类别可能以非线性方式分离。

链接: https://arxiv.org/abs/2502.04789
作者: Hassane Kissane,Achim Schilling,Patrick Krauss
机构: Chair of English Philology and Linguistics, University Erlangen-Nuremberg (埃尔朗根-纽伦堡大学英语语法学系); Cognitive Computational Neuroscience Group, University Erlangen-Nuremberg (埃尔朗根-纽伦堡大学认知计算神经科学组); Neuroscience Lab, University Hospital Erlangen (埃尔朗根大学医院神经科学实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates the internal representations of verb-particle combinations, called multi-word verbs, within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic properties at different neural network layers. Using the BERT architecture, we analyze the representations of its layers for two different verb-particle constructions: phrasal verbs like ‘give up’ and prepositional verbs like ‘look at’. Our methodology includes training probing classifiers on the internal representations to classify these categories at both word and sentence levels. The results indicate that the model’s middle layers achieve the highest classification accuracies. To further analyze the nature of these distinctions, we conduct a data separability test using the Generalized Discrimination Value (GDV). While GDV results show weak linear separability between the two verb types, probing classifiers still achieve high accuracy, suggesting that representations of these linguistic categories may be non-linearly separable. This aligns with previous research indicating that linguistic distinctions in neural networks are not always encoded in a linearly separable manner. These findings computationally support usage-based claims on the representation of verb-particle constructions and highlight the complex interaction between neural network architectures and linguistic structures.
zh

[NLP-29] SeDi-Instruct: Enhancing Alignment of Language Models through Self-Directed Instruction Generation

【速读】: 该论文旨在解决高质量指令数据获取难题,以提升指令调优的效果。解决方案的关键在于提出了一种名为Self-Direct Instruction generation (SeDi-Instruct) 的新型数据生成框架。SeDi-Instruct通过基于多样性的过滤和迭代反馈任务生成技术,在保证模型精度的同时减少低质量指令的过度筛选,从而降低成本。此外,它还整合了指令生成与训练任务,利用训练过程中获得的信息生成高质量指令集,进而提升了AI模型的准确性并降低了数据生成成本。

链接: https://arxiv.org/abs/2502.04774
作者: Jungwoo Kim,Minsang Kim,Sungjin Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 12 figures

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) has enabled the industry to develop various AI-based services. Instruction tuning is considered essential in adapting foundation models for target domains to provide high-quality services to customers. A key challenge in instruction tuning is obtaining high-quality instruction data. Self-Instruct, which automatically generates instruction data using ChatGPT APIs, alleviates the data scarcity problem. To improve the quality of instruction data, Self-Instruct discards many of the instructions generated from ChatGPT, even though it is inefficient in terms of cost owing to many useless API calls. To generate high-quality instruction data at a low cost, we propose a novel data generation framework, Self-Direct Instruction generation (SeDi-Instruct), which employs diversity-based filtering and iterative feedback task generation. Diversity-based filtering maintains model accuracy without excessively discarding low-quality generated instructions by enhancing the diversity of instructions in a batch. This reduces the cost of synthesizing instruction data. The iterative feedback task generation integrates instruction generation and training tasks and utilizes information obtained during the training to create high-quality instruction sets. Our results show that SeDi-Instruct enhances the accuracy of AI models by 5.2%, compared with traditional methods, while reducing data generation costs by 36%.
zh

[NLP-30] ELITE: Enhanced Language-Image Toxicity Evaluation for Safety

【速读】: 该论文旨在解决现有视觉语言模型(Vision Language Models, VLMs)在恶意提示下产生有害输出的脆弱性问题。现有的VLM安全基准主要依赖自动化评估方法,但这些方法难以检测隐含的有害内容或导致不准确的评估结果。论文指出,这些问题导致现有基准存在低水平的有害内容、模糊的数据以及图像-文本对组合的有限多样性。为了解决这些问题,论文提出了ELITE评估基准,通过引入增强的评估方法——ELITE评估器,该评估器明确纳入了毒性评分以准确评估多模态环境下的有害性。关键解决方案在于使用ELITE评估器过滤掉模糊和低质量的图像-文本对,并生成多样化的安全与非安全图像-文本对组合。实验表明,ELITE评估器相较于先前的自动化方法更符合人类评估标准,从而提升了评估基准的质量和多样性。

链接: https://arxiv.org/abs/2502.04757
作者: Wonjun Lee,Doehyeon Lee,Eugene Choi,Sangyoon Yu,Ashkan Yousefpour,Haon Park,Bumsub Ham,Suhyun Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated evaluation methods, but these methods struggle to detect implicit harmful content or produce inaccurate evaluations. Therefore, we found that existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations. To address these issues, we propose the ELITE \em benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE \em evaluator. The ELITE evaluator explicitly incorporates a toxicity score to accurately assess harmfulness in multimodal contexts, where VLMs often provide specific, convincing, but unharmful descriptions of images. We filter out ambiguous and low-quality image-text pairs from existing benchmarks using the ELITE evaluator and generate diverse combinations of safe and unsafe image-text pairs. Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. By introducing ELITE, we pave the way for safer, more robust VLMs, contributing essential tools for evaluating and mitigating safety risks in real-world applications.
zh

[NLP-31] Concept Navigation and Classification via Open Source Large Language Model Processing

【速读】: 该论文旨在解决从文本数据中检测和分类潜在构念(latent constructs)的问题,包括框架(frames)、叙事(narratives)和主题(topics)。其解决方案的关键在于提出了一种结合自动化摘要与人机循环验证的混合方法(hybrid approach),通过迭代采样与专家精炼确保方法论的稳健性和概念的精确性。

链接: https://arxiv.org/abs/2502.04756
作者: Maël Kubli
机构: University of Zurich(苏黎世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages, 1 figure, 7 tabels

点击查看摘要

Abstract:This paper presents a novel methodological framework for detecting and classifying latent constructs, including frames, narratives, and topics, from textual data using Open-Source Large Language Models (LLMs). The proposed hybrid approach combines automated summarization with human-in-the-loop validation to enhance the accuracy and interpretability of construct identification. By employing iterative sampling coupled with expert refinement, the framework guarantees methodological robustness and ensures conceptual precision. Applied to diverse data sets, including AI policy debates, newspaper articles on encryption, and the 20 Newsgroups data set, this approach demonstrates its versatility in systematically analyzing complex political discourses, media framing, and topic classification tasks.
zh

[NLP-32] Holistically Guided Monte Carlo Tree Search for Intricate Information Seeking

【速读】: 该论文旨在解决在海量异构信息环境中复杂查询搜索任务的挑战,特别是传统搜索方法难以在局部精确性和全局理解之间取得平衡的问题。论文的关键解决方案是提出了一种基于大型语言模型(LLM)的搜索助手,采用了一种全新的全指导蒙特卡洛树搜索(Holistically Guided Monte Carlo Tree Search, HG-MCTS)范式。该方法通过知识记忆将任务重新定义为渐进的信息收集过程,并结合自适应检查列表与多视角奖励建模来优化蒙特卡洛树搜索(MCTS)。其中,自适应检查列表提供了明确的子目标以引导搜索过程,确保复杂查询的全面覆盖;而多视角奖励建模则提供了探索和检索奖励,并通过进度反馈动态调整检查列表,从而在局部扩展和全局指导之间达到平衡,减少搜索路径中的冗余,确保所有关键方面得到妥善处理。

链接: https://arxiv.org/abs/2502.04751
作者: Ruiyang Ren,Yuhao Wang,Junyi Li,Jinhao Jiang,Wayne Xin Zhao,Wenjie Wang,Tat-Seng Chua
机构: Gaoling School of Artificial Intelligence, Renmin University of China(Beolin School of Artificial Intelligence, Renmin University of China); National University of Singapore(Singapore); University of Science and Technology of China(Hefei University of Science and Technology)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the era of vast digital information, the sheer volume and heterogeneity of available information present significant challenges for intricate information seeking. Users frequently face multistep web search tasks that involve navigating vast and varied data sources. This complexity demands every step remains comprehensive, accurate, and relevant. However, traditional search methods often struggle to balance the need for localized precision with the broader context required for holistic understanding, leaving critical facets of intricate queries underexplored. In this paper, we introduce an LLM-based search assistant that adopts a new information seeking paradigm with holistically guided Monte Carlo tree search (HG-MCTS). We reformulate the task as a progressive information collection process with a knowledge memory and unite an adaptive checklist with multi-perspective reward modeling in MCTS. The adaptive checklist provides explicit sub-goals to guide the MCTS process toward comprehensive coverage of complex user queries. Simultaneously, our multi-perspective reward modeling offers both exploration and retrieval rewards, along with progress feedback that tracks completed and remaining sub-goals, refining the checklist as the tree search progresses. By striking a balance between localized tree expansion and global guidance, HG-MCTS reduces redundancy in search paths and ensures that all crucial aspects of an intricate query are properly addressed. Extensive experiments on real-world intricate information seeking tasks demonstrate that HG-MCTS acquires thorough knowledge collections and delivers more accurate final responses compared with existing baselines.
zh

[NLP-33] he “negative end” of change in grammar: terminology concepts and causes

【速读】: 该论文旨在探讨语言变化“负面末端”(negative end)的相关术语、概念及其成因。论文的关键在于分析导致语言构造在使用频率上逐渐或迅速且持续减少直至消失或仅存残余形式的各种可能原因。

链接: https://arxiv.org/abs/2502.04729
作者: Karolina Rudnicka
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages

点击查看摘要

Abstract:The topic of “negative end” of change is, contrary to the fields of innovation and emergence, largely under-researched. Yet, it has lately started to gain an increasing attention from language scholars worldwide. The main focus of this article is threefold, namely to discuss the i) terminology; ii) concepts and iii) causes associated with the “negative end” of change in grammar. The article starts with an overview of research conducted on the topic. It then moves to situating phenomena referred to as loss, decline or obsolescence among processes of language change, before elaborating on the terminology and concepts behind it. The last part looks at possible causes for constructions to display a (gradual or rapid, but very consistent) decrease in the frequency of use over time, which continues until the construction disappears or there are only residual or fossilised forms left. Keywords: loss, obsolescence, decline, competition, higher
zh

[NLP-34] Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

【速读】: 该论文旨在解决文本风格转换(Text Style Transfer, TST)输出评估的多维度挑战,包括风格转换准确性、内容保留和自然度。解决方案的关键在于考察并验证一系列现有及新颖的自动评估指标在情感转换和去毒化两个子任务中的有效性,涵盖英语、印地语和孟加拉语等多种语言。通过与人工判断的相关性分析,论文展示了这些指标单独使用和组合使用的有效性,并进一步探讨了大型语言模型(Large Language Models, LLMs)作为TST评估工具的潜力。研究表明,某些先进的自然语言处理(NLP)指标和实验性的混合技术能够提供比现有TST评估指标更准确、一致且可重复的结果。

链接: https://arxiv.org/abs/2502.04718
作者: Sourabrata Mukherjee,Atul Kr. Ojha,John P. McCrae,Ondrej Dusek
机构: Charles University(查理大学), Faculty of Mathematics and Physics(数学与物理学院), Prague, Czechia; Insight Research Ireland Centre for Data Analytics, DSI, University of Galway(爱尔兰高威大学), Ireland
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text Style Transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, same as in other natural language processing (NLP) tasks, however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks-sentiment transfer and detoxification-in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigate the potential of Large Language Models (LLMs) as tools for TST evaluation. Our findings highlight that certain advanced NLP metrics and experimental-hybrid-techniques, provide better insights than existing TST metrics for delivering more accurate, consistent, and reproducible TST evaluations.
zh

[NLP-35] Enhancing Impression Change Prediction in Speed Dating Simulations Based on Speakers Personalities

【速读】: 该论文旨在解决在文本对话模拟中,如何选择能够改善对话双方印象的合适语句。关键在于提出了一种方法,该方法不仅考虑个性特征,还能预测特定发言是否能改善对话伙伴对发言者的印象。这种方法通过引入个性因素,提升了对话语句选择的准确性,从而更有效地模拟出更受青睐的对话场景。

链接: https://arxiv.org/abs/2502.04706
作者: Kazuya Matsuo,Yoko Ishii,Atsushi Otsuka,Ryo Ishii,Hiroaki Sugiyama,Masahiro Mizukami,Tsunehiro Arimoto,Narichika Nomoto,Yoshihide Sato,Tetsuya Yamaguchi
机构: NTT Corporation (NTT株式会社)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper focuses on simulating text dialogues in which impressions between speakers improve during speed dating. This simulation involves selecting an utterance from multiple candidates generated by a text generation model that replicates a specific speaker’s utterances, aiming to improve the impression of the speaker. Accurately selecting an utterance that improves the impression is crucial for the simulation. We believe that whether an utterance improves a dialogue partner’s impression of the speaker may depend on the personalities of both parties. However, recent methods for utterance selection do not consider the impression per utterance or the personalities. To address this, we propose a method that predicts whether an utterance improves a partner’s impression of the speaker, considering the personalities. The evaluation results showed that personalities are useful in predicting impression changes per utterance. Furthermore, we conducted a human evaluation of simulated dialogues using our method. The results showed that it could simulate dialogues more favorably received than those selected without considering personalities.
zh

[NLP-36] ARR: Question Answering with Large Language Models via Analyzing Retrieving and Reasoning

【速读】: 该论文旨在解决大型语言模型(LLMs)在多选择题问答(QA)任务中的推理能力不足问题。论文的关键解决方案是提出ARR方法,通过显式包含意图分析、信息检索和逐步推理三个关键步骤,以增强LLMs的推理能力。实验结果表明,ARR方法不仅优于基线模型,还超越了传统的零样本链式思考(CoT)提示方法,并且在不同规模和系列的LLM以及不同的生成设置下均表现出良好的效果和鲁棒性。

链接: https://arxiv.org/abs/2502.04689
作者: Yuwei Yin,Giuseppe Carenini
机构: University of British Columbia (英属哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages

点击查看摘要

Abstract:Large language models (LLMs) achieve remarkable performance on challenging benchmarks that are often structured as multiple-choice question-answering (QA) tasks. Zero-shot Chain-of-Thought (CoT) prompting enhances reasoning in LLMs but provides only vague and generic guidance (“think step by step”). This paper introduces ARR, an intuitive and effective zero-shot prompting method that explicitly incorporates three key steps in QA solving: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Comprehensive experiments across diverse and challenging QA tasks demonstrate that ARR consistently improves the Baseline (without ARR prompting) and outperforms CoT. Ablation and case studies further validate the positive contributions of each component: analyzing, retrieving, and reasoning. Notably, intent analysis plays a vital role in ARR. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.
zh

[NLP-37] M-IFEval: Multilingual Instruction-Following Evaluation

【速读】: 该论文旨在解决现有评估基准仅限于英语指令的问题,从而限制了对大型语言模型(Large Language Models, LLMs)在其他语言中的表现进行评估的能力。为了解决这一问题,论文提出了多语种指令跟随评估(Multilingual Instruction Following Evaluation, M-IFEval)基准,该基准扩展了评估范围至法语、日语和西班牙语,并包含了通用及特定语言的指令。通过应用这一多语种基准,研究发现不同语言和指令类型下LLMs的表现存在显著差异,强调了在多元文化背景下采用多语种基准的重要性。

链接: https://arxiv.org/abs/2502.04688
作者: Antoine Dussolle,Andrea Cardeña Díaz,Shota Sato,Peter Devine
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.04688 [cs.CL] (or arXiv:2502.04688v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.04688 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-38] AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts NAACL2025

【速读】: 该论文旨在解决广告文本中吸引潜在顾客的有效语言选择问题,特别是探索影响人类偏好的语言特征。由于理解具体影响吸引力的语言特征存在挑战,包括复杂的人类偏好和缺乏包含人类偏好的公开广告文本数据集,论文提出了解决方案。关键在于引入AdParaphrase数据集,该数据集包含了语义等效但措辞和风格不同的广告文本对及其对应的用户偏好,从而允许专注于语言特征差异的偏好分析。研究发现,人类偏好的广告文本具有更高的流畅性、更长的长度、更多的名词以及使用括号符号。基于这些发现,一个考虑这些因素的广告文本生成模型显著提升了文本的吸引力。

链接: https://arxiv.org/abs/2502.04674
作者: Soichiro Murakami,Peinan Zhang,Hidetaka Kamigaito,Hiroya Takamura,Manabu Okumura
机构: CyberAgent, Inc. (赛博_AGENT); Nara Institute of Science and Technology (奈良理工学院); Institute of Science Tokyo (东京科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL2025 Findings

点击查看摘要

Abstract:Effective linguistic choices that attract potential customers play crucial roles in advertising success. This study aims to explore the linguistic features of ad texts that influence human preferences. Although the creation of attractive ad texts is an active area of research, progress in understanding the specific linguistic features that affect attractiveness is hindered by several obstacles. First, human preferences are complex and influenced by multiple factors, including their content, such as brand names, and their linguistic styles, making analysis challenging. Second, publicly available ad text datasets that include human preferences are lacking, such as ad performance metrics and human feedback, which reflect people’s interests. To address these problems, we present AdParaphrase, a paraphrase dataset that contains human preferences for pairs of ad texts that are semantically equivalent but differ in terms of wording and style. This dataset allows for preference analysis that focuses on the differences in linguistic features. Our analysis revealed that ad texts preferred by human judges have higher fluency, longer length, more nouns, and use of bracket symbols. Furthermore, we demonstrate that an ad text-generation model that considers these findings significantly improves the attractiveness of a given text. The dataset is publicly available at: this https URL.
zh

[NLP-39] Before Its Too Late: A State Space Model for the Early Prediction of Misinformation and Disinformation Engagement WWW2025

【速读】: 该论文旨在解决社交平台上迅速兴起的阴谋论和信息运动对社会及民主凝聚力的侵蚀问题。论文的关键解决方案是提出了一种名为IC-Mamba的新颖状态空间模型,该模型通过整合时间嵌入来建模区间删失数据,从而预测社交媒体上的参与度。IC-Mamba在帖子发布的最初15到30分钟内能够准确预测参与模式(均方根误差在0.118至0.143之间),并且在多个参与度指标(包括点赞、分享、评论和表情符号)上比现有技术提高了4.72%,从而实现了早期潜在问题内容的识别,并提供了设计和实施对策的关键预警时间。

链接: https://arxiv.org/abs/2502.04655
作者: Lin Tian,Emily Booth,Francesco Bailo,Julian Droogan,Marian-Andrei Rizoiu
机构: University of Technology Sydney(悉尼科技大学); University of Technology Sydney(悉尼科技大学); The University of Sydney(悉尼大学); Macquarie University(麦考瑞大学); University of Technology Sydney(悉尼科技大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 figures, 10 tables, Accepted by the Web Conference 2025 (WWW2025)

点击查看摘要

Abstract:In today’s digital age, conspiracies and information campaigns can emerge rapidly and erode social and democratic cohesion. While recent deep learning approaches have made progress in modeling engagement through language and propagation models, they struggle with irregularly sampled data and early trajectory assessment. We present IC-Mamba, a novel state space model that forecasts social media engagement by modeling interval-censored data with integrated temporal embeddings. Our model excels at predicting engagement patterns within the crucial first 15-30 minutes of posting (RMSE 0.118-0.143), enabling rapid assessment of content reach. By incorporating interval-censored modeling into the state space framework, IC-Mamba captures fine-grained temporal dynamics of engagement growth, achieving a 4.72% improvement over state-of-the-art across multiple engagement metrics (likes, shares, comments, and emojis). Our experiments demonstrate IC-Mamba’s effectiveness in forecasting both post-level dynamics and broader narrative patterns (F1 0.508-0.751 for narrative-level predictions). The model maintains strong predictive performance across extended time horizons, successfully forecasting opinion-level engagement up to 28 days ahead using observation windows of 3-10 days. These capabilities enable earlier identification of potentially problematic content, providing crucial lead time for designing and implementing countermeasures. Code is available at: this https URL. An interactive dashboard demonstrating our results is available at: this https URL.
zh

[NLP-40] Agent ic Reasoning : Reasoning LLM s with Tools for the Deep Research

【速读】: 该论文旨在解决复杂问题求解过程中大型语言模型(LLM)推理能力不足的问题。论文的关键解决方案是引入了一种名为“Agentic Reasoning”的框架,该框架通过整合外部工具使用代理(如网络搜索代理、代码执行代理和结构化推理上下文记忆)来增强LLM的推理能力。特别是,Mind Map代理构建了一个结构化的知识图谱以追踪逻辑关系,从而提升演绎推理能力。这种动态交互的方式不仅增强了推理准确性,还提升了决策质量。

链接: https://arxiv.org/abs/2502.04644
作者: Junde Wu,Jiayuan Zhu,Yuyuan Liu
机构: University of Oxford
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents. Unlike conventional LLM-based reasoning approaches, which rely solely on internal inference, Agentic Reasoning dynamically engages web search, code execution, and structured reasoning-context memory to solve complex problems requiring deep research and multi-step logical deduction. Our framework introduces the Mind Map agent, which constructs a structured knowledge graph to track logical relationships, improving deductive reasoning. Additionally, the integration of web-search and coding agents enables real-time retrieval and computational analysis, enhancing reasoning accuracy and decision-making. Evaluations on PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks demonstrate that our approach significantly outperforms existing models, including leading retrieval-augmented generation (RAG) systems and closed-source LLMs. Moreover, our results indicate that agentic reasoning improves expert-level knowledge synthesis, test-time scalability, and structured problem-solving. The code is at: this https URL.
zh

[NLP-41] Confidence Elicitation: A New Attack Vector for Large Language Models ICLR2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在黑盒攻击下的鲁棒性问题。论文的关键在于提出了一种通过诱导模型输出置信度来实现攻击指导的新范式。研究显示,当前的LLMs能够提供校准且非虚构的置信度反馈。通过最小化这种诱导置信度,可以增加误分类的概率。实验结果表明,与现有的硬标签黑盒攻击方法相比,新提出的范式在三个数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2502.04643
作者: Brian Formento,Chuan Sheng Foo,See-Kiong Ng
机构: Institute of Data Science, National University of Singapore (数据科学研究所,新加坡国立大学); Institute for Infocomm Research, ASTAR (信息通信研究院,ASTAR); Centre for Frontier AI Research, ASTAR (前沿人工智能研究中心,ASTAR)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Published in ICLR 2025. The code is publicly available at this https URL

点击查看摘要

Abstract:A fundamental issue in deep learning has been adversarial robustness. As these systems have scaled, such issues have persisted. Currently, large language models (LLMs) with billions of parameters suffer from adversarial attacks just like their earlier, smaller counterparts. However, the threat models have changed. Previously, having gray-box access, where input embeddings or output logits/probabilities were visible to the user, might have been reasonable. However, with the introduction of closed-source models, no information about the model is available apart from the generated output. This means that current black-box attacks can only utilize the final prediction to detect if an attack is successful. In this work, we investigate and demonstrate the potential of attack guidance, akin to using output probabilities, while having only black-box access in a classification setting. This is achieved through the ability to elicit confidence from the model. We empirically show that the elicited confidence is calibrated and not hallucinated for current LLMs. By minimizing the elicited confidence, we can therefore increase the likelihood of misclassification. Our new proposed paradigm demonstrates promising state-of-the-art results on three datasets across two models (LLaMA-3-8B-Instruct and Mistral-7B-Instruct-V0.3) when comparing our technique to existing hard-label black-box attack methods that introduce word-level substitutions.
zh

[NLP-42] Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization ACL

【速读】: 该论文致力于解决中古汉语辅音系统的语音重构问题。解决方案的关键在于将此问题形式化为混合整数规划(Mixed Integer Programming)问题,这种方法能够自动探索古代韵书中的同音信息以及现代汉语方言中的语音信息,从而实现更有效的语音重构。

链接: https://arxiv.org/abs/2502.04625
作者: Weiwei Sun(1),Xiaoxi Luo(2) ((1) Department of Computer Science and Technology, University of Cambridge, (2) Yuanpei College, Peking University)
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted by TACL

点击查看摘要

Abstract:This paper is concerned with phonetic reconstruction of the consonant system of Middle Chinese. We propose to cast the problem as a Mixed Integer Programming problem, which is able to automatically explore homophonic information from ancient rhyme dictionaries and phonetic information from modern Chinese dialects, the descendants of Middle Chinese. Numerical evaluation on a wide range of synthetic and real data demonstrates the effectiveness and robustness of the new method. We apply the method to information from Guangyun and 20 modern Chinese dialects to obtain a new phonetic reconstruction result. A linguistically-motivated discussion of this result is also provided.
zh

[NLP-43] Extracting and Understanding the Superficial Knowledge in Alignment

【速读】: 该论文旨在探讨大型语言模型(LLMs)与人类价值观及偏好的对齐是否主要表现为表面层次的知识。关键解决方案在于提出一种方法来提取和隔离对齐模型中的表面知识,并通过对比仅包含表面知识的模型与完全对齐的模型,量化表面对齐的比例。研究发现,虽然表面知识在安全和去毒化任务中占较大比例,但推理和上下文理解任务仍依赖于深层次知识。此外,论文展示了孤立表面知识的两个实际优势:可跨模型转移,以及可恢复性。

链接: https://arxiv.org/abs/2502.04602
作者: Runjin Chen,Gabriel Jacob Perin,Xuxi Chen,Xilun Chen,Yan Han,Nina S. T. Hirata,Junyuan Hong,Bhavya Kailkhura
机构: The University of Texas at Austin; University of São Paulo; LinkedIn; Lawrence Livermore National Laboratory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alignment of large language models (LLMs) with human values and preferences, often achieved through fine-tuning based on human feedback, is essential for ensuring safe and responsible AI behaviors. However, the process typically requires substantial data and computation resources. Recent studies have revealed that alignment might be attainable at lower costs through simpler methods, such as in-context learning. This leads to the question: Is alignment predominantly superficial? In this paper, we delve into this question and provide a quantitative analysis. We formalize the concept of superficial knowledge, defining it as knowledge that can be acquired through easily token restyling, without affecting the model’s ability to capture underlying causal relationships between tokens. We propose a method to extract and isolate superficial knowledge from aligned models, focusing on the shallow modifications to the final token selection process. By comparing models augmented only with superficial knowledge to fully aligned models, we quantify the superficial portion of alignment. Our findings reveal that while superficial knowledge constitutes a significant portion of alignment, particularly in safety and detoxification tasks, it is not the whole story. Tasks requiring reasoning and contextual understanding still rely on deeper knowledge. Additionally, we demonstrate two practical advantages of isolated superficial knowledge: (1) it can be transferred between models, enabling efficient offsite alignment of larger models using extracted superficial knowledge from smaller models, and (2) it is recoverable, allowing for the restoration of alignment in compromised models without sacrificing performance.
zh

[NLP-44] Position-aware Automatic Circuit Discovery

【速读】: 该论文旨在解决现有电路发现方法无法捕捉位置敏感交互或机制的问题,因为这些方法假设电路在位置上是不变的。关键解决方案包括两个方面:首先,将基于梯度的边缘归因修补(edge attribution patching)方法扩展以区分不同的标记位置;其次,引入数据集模式(dataset schema)的概念,定义具有相似语义的标记跨度,从而实现在包含变长样本的数据集中进行位置感知的电路发现。此外,还开发了一种自动化流水线来生成和应用这些模式,以实现完全自动化的发现位置敏感电路。这种方法相比先前工作,在电路大小和准确性之间实现了更好的权衡。

链接: https://arxiv.org/abs/2502.04577
作者: Tal Haklay,Hadas Orgad,David Bau,Aaron Mueller,Yonatan Belinkov
机构: Technion – Israel Institute of Technology(以色列理工学院); Northeastern University(东北大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model’s computation graph that executes a specific task. We identify a gap in existing circuit discovery methods: they assume circuits are position-invariant, treating model components as equally relevant across input positions. This limits their ability to capture cross-positional interactions or mechanisms that vary across positions. To address this gap, we propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples. First, we extend edge attribution patching, a gradient-based method for circuit discovery, to differentiate between token positions. Second, we introduce the concept of a dataset schema, which defines token spans with similar semantics across examples, enabling position-aware circuit discovery in datasets with variable length examples. We additionally develop an automated pipeline for schema generation and application using large language models. Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.
zh

[NLP-45] Self-Regulation and Requesting Interventions

【速读】: 该论文旨在解决大型语言模型(LLM)代理在缺乏元认知能力(metacognitive abilities)的情况下,如何在有限干预预算下决定何时请求帮助的问题。论文的关键解决方案在于提出了一种离线框架,通过结合基于大型语言模型的过程奖励模型(PRMs)与表格化强化学习(tabular reinforcement learning),训练一个“辅助”策略来确定最佳的干预时机。这种方法显著减少了训练过程中的昂贵干预调用,并增强了对非策略数据的鲁棒性,同时避免了深度强化学习的低效问题。

链接: https://arxiv.org/abs/2502.04576
作者: So Yeon Min,Yue Wu,Jimin Sun,Max Kaufmann,Fahim Tajwar,Yonatan Bisk,Ruslan Salakhutdinov
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human intelligence involves metacognitive abilities like self-regulation, recognizing limitations, and seeking assistance only when needed. While LLM Agents excel in many domains, they often lack this awareness. Overconfident agents risk catastrophic failures, while those that seek help excessively hinder efficiency. A key challenge is enabling agents with a limited intervention budget C is to decide when to request assistance. In this paper, we propose an offline framework that trains a “helper” policy to request interventions, such as more powerful models or test-time compute, by combining LLM-based process reward models (PRMs) with tabular reinforcement learning. Using state transitions collected offline, we score optimal intervention timing with PRMs and train the helper model on these labeled trajectories. This offline approach significantly reduces costly intervention calls during training. Furthermore, the integration of PRMs with tabular RL enhances robustness to off-policy data while avoiding the inefficiencies of deep RL. We empirically find that our method delivers optimal helper behavior.
zh

[NLP-46] My LLM might Mimic AAE – But When Should it? NAACL2025

【速读】: 该论文旨在探讨大型语言模型(Large Language Models, LLMs)在表示非裔美国人英语(African American English, AAE)方面的有效性,并评估非裔美国人对这些技术的看法。研究通过调查104名非裔美国人以及由228名非裔美国人标注的LLM生成的AAE文本,发现非裔美国人倾向于在确定何时LLM输出适合使用AAE时拥有选择和自主权。他们通常偏好在正式场合下LLM默认使用主流美国英语,在非正式场合下更希望看到AAE的产出。关键在于,当LLM得到适当的提示并提供上下文示例时,参与者认为其输出的AAE真实性与黑人美国人口语记录相当。

链接: https://arxiv.org/abs/2502.04564
作者: Sandra C. Sandoval,Christabel Acquaye,Kwesi Cobbina,Mohammad Nayeem Teli,Hal Daumé III
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:We examine the representation of African American English (AAE) in large language models (LLMs), exploring (a) the perceptions Black Americans have of how effective these technologies are at producing authentic AAE, and (b) in what contexts Black Americans find this desirable. Through both a survey of Black Americans ( n= 104) and annotation of LLM-produced AAE by Black Americans ( n= 228), we find that Black Americans favor choice and autonomy in determining when AAE is appropriate in LLM output. They tend to prefer that LLMs default to communicating in Mainstream U.S. English in formal settings, with greater interest in AAE production in less formal settings. When LLMs were appropriately prompted and provided in context examples, our participants found their outputs to have a level of AAE authenticity on par with transcripts of Black American speech. Select code and data for our project can be found here: this https URL
zh

[NLP-47] ruthFlow: Truthful LLM Generation via Representation Flow Correction

【速读】: 该论文旨在解决大型语言模型(LLMs)在生成一致真实响应方面存在的挑战。解决方案的关键在于提出了一种名为TruthFlow的新方法,该方法利用Flow Matching技术进行查询特定的真实表示校正。具体而言,TruthFlow首先使用流模型学习针对每个查询特定的校正向量,将表示从虚构状态转换到真实状态,然后在推理过程中生成这些校正向量以增强LLMs输出的真实性。

链接: https://arxiv.org/abs/2502.04556
作者: Hanyu Wang,Bochuan Cao,Yuanpu Cao,Jinghui Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are known to struggle with consistently generating truthful responses. While various representation intervention techniques have been proposed, these methods typically apply a universal representation correction vector to all input queries, limiting their effectiveness against diverse queries in practice. In this study, we introduce TruthFlow, a novel method that leverages the Flow Matching technique for query-specific truthful representation correction. Specifically, TruthFlow first uses a flow model to learn query-specific correction vectors that transition representations from hallucinated to truthful states. Then, during inference, the trained flow model generates these correction vectors to enhance the truthfulness of LLM outputs. Experimental results demonstrate that TruthFlow significantly improves performance on open-ended generation tasks across various advanced LLMs evaluated on TruthfulQA. Moreover, the trained TruthFlow model exhibits strong transferability, performing effectively on other unseen hallucination benchmarks.
zh

[NLP-48] Contextual Gradient Flow Modeling for Large Language Model Generalization in Multi-Scale Feature Spaces

【速读】: 该论文旨在解决现有优化方法在训练大规模神经架构时因采用统一梯度传播机制而无法与层级语言结构对齐的问题,从而限制了模型在多样化语言分布中的泛化能力。论文的关键解决方案在于引入了一种结构化梯度精炼框架,通过多尺度上下文调整和动态加权策略来改进参数适应性,减少梯度震荡,并使表示学习与更广泛的语言依赖关系而非孤立的token级关系对齐。这一方法不仅提高了优化效率和训练稳定性,还增强了模型在长距离依赖保留和跨领域适应方面的鲁棒性。

链接: https://arxiv.org/abs/2502.04548
作者: Daphne Quillington,Kingsley Fairbrother,Xavier Tattershall,Irin Kabakum
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Optimization methodologies for training large-scale neural architectures often rely on uniform gradient propagation mechanisms that fail to align with hierarchical linguistic structures, limiting their capacity to generalize across diverse language distributions. A structured gradient refinement framework was introduced to incorporate multi-scale contextual adjustments, improving parameter adaptation through dynamic weighting strategies that enhanced representation coherence. Empirical evaluations demonstrated that structured propagation mechanisms contributed to reductions in gradient oscillations, resulting in more stable training dynamics and improved optimization efficiency. The comparative performance assessment indicated that models incorporating hierarchical propagation strategies exhibited greater robustness in long-range dependency retention and cross-domain adaptation. The hierarchical adjustment of weight updates provided an alternative to conventional backpropagation, reducing sensitivity to initialization conditions while improving overall convergence efficiency. The experimental results confirmed that structured gradient propagation influenced representation learning trajectories, aligning parameter updates with broader linguistic dependencies rather than isolated token-level relationships. Statistical evaluations indicated that structured optimization strategies mitigated overfitting while preserving adaptability across heterogeneous text distributions. The findings established that structured gradient propagation provided an empirically validated framework for refining hierarchical representation learning, supporting more effective integration of linguistic dependencies into optimization dynamics.
zh

[NLP-49] Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation AACL2023

【速读】: 该论文旨在解决多语言神经机器翻译(MNMT)中非自回归模型效率提升的问题,同时避免昂贵的知识蒸馏(KD)过程。解决方案的关键在于提出了M-DAT方法,该方法利用了有向无环Transformer(DAT)的最新进展,无需知识蒸馏,并进一步引入了枢纽回译(PivotBT)技术以增强对未见翻译方向的泛化能力。

链接: https://arxiv.org/abs/2502.04537
作者: Chenyang Huang,Fei Huang,Zaixiang Zheng,Osmar R. Zaïane,Hao Zhou,Lili Mou
机构: Dept. of Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta (阿尔伯塔大学计算机科学系,艾伯塔省机器智能研究所); Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院); ByteDance Research (字节跳动研究); Damo Academy (达摩院), Alibaba (阿里巴巴)
类目: Computation and Language (cs.CL)
备注: In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023

点击查看摘要

Abstract:Multilingual neural machine translation (MNMT) aims at using one single model for multiple translation directions. Recent work applies non-autoregressive Transformers to improve the efficiency of MNMT, but requires expensive knowledge distillation (KD) processes. To this end, we propose an M-DAT approach to non-autoregressive multilingual machine translation. Our system leverages the recent advance of the directed acyclic Transformer (DAT), which does not require KD. We further propose a pivot back-translation (PivotBT) approach to improve the generalization to unseen translation directions. Experiments show that our M-DAT achieves state-of-the-art performance in non-autoregressive MNMT.
zh

[NLP-50] A Decoding Algorithm for Length-Control Summarization Based on Directed Acyclic Transformers EMNLP2024

【速读】: 该论文旨在解决长度控制摘要生成中的长度限制满足问题,即如何在保证摘要质量的同时严格控制其长度。关键解决方案在于提出了一种基于Directed Acyclic Transformer (DAT)的新型长度控制解码算法,并引入了Sequence Maximum a Posteriori (SeqMAP)解码算法,通过边际化不同可能路径来预测连接多个合理序列片段的方式,从而找到符合长度预算的最可能摘要。该算法采用束搜索(Beam Search),进一步利用重排序器(reranker)提升性能。

链接: https://arxiv.org/abs/2502.04535
作者: Chenyang Huang,Hao Zhou,Cameron Jen,Kangjie Zheng,Osmar R. Zaïane,Lili Mou
机构: Dept. of Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta (艾伯塔大学); Institute for AI Industry Research (AIR), Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Findings of the Association for Computational Linguistics: EMNLP 2024

点击查看摘要

Abstract:Length-control summarization aims to condense long texts into a short one within a certain length limit. Previous approaches often use autoregressive (AR) models and treat the length requirement as a soft constraint, which may not always be satisfied. In this study, we propose a novel length-control decoding algorithm based on the Directed Acyclic Transformer (DAT). Our approach allows for multiple plausible sequence fragments and predicts a \emphpath to connect them. In addition, we propose a Sequence Maximum a Posteriori (SeqMAP) decoding algorithm that marginalizes different possible paths and finds the most probable summary satisfying the length budget. Our algorithm is based on beam search, which further facilitates a reranker for performance improvement. Experimental results on the Gigaword and DUC2004 datasets demonstrate our state-of-the-art performance for length-control summarization.
zh

[NLP-51] Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection

【速读】: 该论文旨在解决由于大型语言模型(Large Language Models, LLMs)的发展导致难以区分人类撰写文本与AI生成文本的问题。现有的AI文本检测器通常使用固定的全局阈值(如(\theta=0.5))进行分类,但这种方法无法应对子群体之间的分布变化。论文指出,固定阈值会导致在较短的人类撰写的文本上产生更多的误报错误,并且在较长文本中对于神经质写作风格的识别比开放性写作风格更为频繁。这种不一致性可能导致不公平的误分类,尤其影响某些特定群体。

为了解决这一关键问题,论文提出了FairOPT算法,这是一种针对AI生成内容分类器的群组特定阈值优化方法。通过将数据划分为不同的子群体(基于属性如文本长度和写作风格),并为每个子群体学习决策阈值,FairOPT实现了在各子群体内性能和公平性指标的精细平衡。实验结果表明,在三个数据集上的四种AI文本分类器中,FairOPT提升了整体F1分数,并减少了跨子群体的平衡误差率(BER)差异。该框架为AI生成输出检测中的更稳健和公平的分类标准铺平了道路。

链接: https://arxiv.org/abs/2502.04528
作者: Minseok Jung,Cynthia Fuertes Panizo,Liam Dugan,May Fung,Pin-Yu Chen,Paul Pu Liang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The advancement of large language models (LLMs) has made it difficult to differentiate human-written text from AI-generated text. Several AI-text detectors have been developed in response, which typically utilize a fixed global threshold (e.g., \theta = 0.5) to classify machine-generated text. However, we find that one universal threshold can fail to account for subgroup-specific distributional variations. For example, when using a fixed threshold, detectors make more false positive errors on shorter human-written text than longer, and more positive classifications on neurotic writing styles than open among long text. These discrepancies can lead to misclassification that disproportionately affects certain groups. We address this critical limitation by introducing FairOPT, an algorithm for group-specific threshold optimization in AI-generated content classifiers. Our approach partitions data into subgroups based on attributes (e.g., text length and writing style) and learns decision thresholds for each group, which enables careful balancing of performance and fairness metrics within each subgroup. In experiments with four AI text classifiers on three datasets, FairOPT enhances overall F1 score and decreases balanced error rate (BER) discrepancy across subgroups. Our framework paves the way for more robust and fair classification criteria in AI-generated output detection.
zh

[NLP-52] Linear Correlation in LMs Compositional Generalization and Hallucination

【速读】: 该论文旨在探讨语言模型(Language Models, LMs)在知识组合过程中存在的线性相关现象,并揭示其潜在的线性变换机制。论文的关键解决方案在于发现并验证了语言模型内部存在着一种线性变换,能够将一个提示下的下一个词预测的对数几率(logits)映射到另一个相关提示下,这反映了人类知识组合中的线性关系。研究结果表明,这种线性变换在大规模微调后仍然稳健,并且能够推广更新的知识,前提是这些知识与现实世界的关系保持一致;反之,则可能导致幻觉生成。此外,论文指出这种线性相关可以通过单个前馈网络和预训练词汇表示来学习,从而表明语言模型的泛化能力很大程度上依赖于后者。

链接: https://arxiv.org/abs/2502.04520
作者: Letian Peng,Chenyang An,Shibo Hao,Chengyu Dong,Jingbo Shang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The generalization of language models (LMs) is undergoing active debates, contrasting their potential for general intelligence with their struggles with basic knowledge composition (e.g., reverse/transition curse). This paper uncovers the phenomenon of linear correlations in LMs during knowledge composition. For explanation, there exists a linear transformation between certain related knowledge that maps the next token prediction logits from one prompt to another, e.g., “X lives in the city of” \rightarrow “X lives in the country of” for every given X. This mirrors the linearity in human knowledge composition, such as Paris \rightarrow France. Our findings indicate that the linear transformation is resilient to large-scale fine-tuning, generalizing updated knowledge when aligned with real-world relationships, but causing hallucinations when it deviates. Empirical results suggest that linear correlation can serve as a potential identifier of LM’s generalization. Finally, we show such linear correlations can be learned with a single feedforward network and pre-trained vocabulary representations, indicating LM generalization heavily relies on the latter.
zh

[NLP-53] owards Cost-Effective Reward Guided Text Generation

【速读】: 该论文旨在解决基于奖励引导的文本生成(RGTG)方法在推理过程中因显著的测试时间开销及次优选择导致的问题。论文的关键解决方案在于提出了一种新颖的奖励模型架构,该架构使用Bradley-Terry损失函数进行训练,能够在生成过程中的每一步仅通过一次调用奖励模型就优选出序列的最佳扩展。这种方法能够同时为所有可能的候选令牌生成评分,从而实现高效的推理。理论分析表明,相比已有技术,该方法在推理过程中更倾向于选择最优序列。实证结果证明,所提出的奖励模型不仅加快了推理速度,还减少了对奖励模型的调用次数,并且与先前的RGTG方法和离线强化学习从人类反馈(RLHF)方法相比具有竞争力。

链接: https://arxiv.org/abs/2502.04517
作者: Ahmad Rashid,Ruotian Wu,Rongqi Fan,Hongliang Li,Agustinus Kristiadi,Pascal Poupart
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training like in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a \emphsingle call to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.
zh

[NLP-54] Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis

【速读】: 该论文旨在解决合成数据生成过程中保持高质量标准的挑战。传统方法通常在样本级别操作,逐个生成并应用反馈,而论文提出的关键解决方案是参考级反馈(Reference-Level Feedback)。通过从精心策划的种子数据中收集高质参考样本的反馈,该方法能够捕捉到新合成数据中期望的丰富特征信号。论文展示了采用此方法生成的REFED数据集(包含10K指令-响应对)能够显著提升模型性能,在AlpacaEval 2.0和Arena-Hard测试中表现出色,并且相比传统的样本级反馈方法,所需的反馈收集次数显著减少,同时改善了不同模型架构的性能。

链接: https://arxiv.org/abs/2502.04511
作者: Shuhaib Mehri,Xiusi Chen,Heng Ji,Dilek Hakkani-Tür
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs demonstrate remarkable capabilities in following natural language instructions, largely due to instruction-tuning on high-quality datasets. While synthetic data generation has emerged as a scalable approach for creating such datasets, maintaining consistent quality standards remains challenging. Recent approaches incorporate feedback to improve data quality, but typically operate at the sample level, generating and applying feedback for each response individually. In this work, we propose Reference-Level Feedback, a novel methodology that instead collects feedback based on high-quality reference samples from carefully curated seed data. We use this feedback to capture rich signals of desirable characteristics that can be propagated to newly synthesized data. We present REFED, a dataset of 10K instruction-response pairs synthesized using such feedback. We demonstrate the effectiveness of our approach by showing that Llama-3.1-8B-Instruct finetuned on REFED achieves state-of-the-art performance among similar-sized SFT-based models on AlpacaEval 2.0 and strong results on Arena-Hard. Through extensive experiments, we show that our approach consistently outperforms traditional sample-level feedback methods with significantly fewer feedback collections and improves performance across different model architectures.
zh

[NLP-55] Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems

【速读】: 该论文旨在解决多大语言模型(Large Language Models, LLMs)系统的设计与优化问题。论文的关键在于提出了一种名为Heterogeneous Swarms的算法,通过联合优化模型角色和权重来设计多LLM系统。具体而言,Heterogeneous Swarms将多LLM系统表示为有向无环图(DAG),并通过拓扑消息传递实现协作生成。其解决方案的关键包括两个迭代步骤:角色步(role-step)和权重步(weight-step)。角色步通过学习DAG来指定LLMs之间输入输出的流动,并使用粒子群优化(Particle Swarm Optimization, PSO)基于效用函数(如任务准确性)优化邻接矩阵。权重步则通过JFK分数量化每个LLM的贡献,并进一步利用PSO优化模型权重。实验表明,Heterogeneous Swarms在12项任务中平均比15个基准高出18.5%,从而证明了其有效性。

链接: https://arxiv.org/abs/2502.04510
作者: Shangbin Feng,Zifeng Wang,Palash Goyal,Yike Wang,Weijia Shi,Huang Xia,Hamid Palangi,Luke Zettlemoyer,Yulia Tsvetkov,Chen-Yu Lee,Tomas Pfister
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by jointly optimizing model roles and weights. We represent multi-LLM systems as directed acyclic graphs (DAGs) of LLMs with topological message passing for collaborative generation. Given a pool of LLM experts and a utility function, Heterogeneous Swarms employs two iterative steps: role-step and weight-step. For role-step, we interpret model roles as learning a DAG that specifies the flow of inputs and outputs between LLMs. Starting from a swarm of random continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs in topological order, evaluate on the utility function (e.g. accuracy on a task), and optimize the adjacency matrices with particle swarm optimization based on the utility score. For weight-step, we assess the contribution of individual LLMs in the multi-LLM systems and optimize model weights with swarm intelligence. We propose JFK-score to quantify the individual contribution of each LLM in the best-found DAG of the role-step, then optimize model weights with particle swarm optimization based on the JFK-score. Experiments demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based baselines by 18.5% on average across 12 tasks. Further analysis reveals that Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles and substantial collaborative gains, and benefits from the diversity of language models.
zh

[NLP-56] When One LLM Drools Multi-LLM Collaboration Rules

【速读】: 该论文旨在解决单一大型语言模型(Single Large Language Model, SLLM)在处理复杂、情境化和主观场景时的局限性问题。论文的关键解决方案在于提倡多模型协作(multi-LLM collaboration),通过不同层次的访问和信息交换方法(从API级别到权重级别),以更好地代表现实世界的数据分布、异构技能和多元化人群,从而提高系统的可靠性和包容性。

链接: https://arxiv.org/abs/2502.04506
作者: Shangbin Feng,Wenxuan Ding,Alisa Liu,Zifeng Wang,Weijia Shi,Yike Wang,Zejiang Shen,Xiaochuang Han,Hunter Lang,Chen-Yu Lee,Tomas Pfister,Yejin Choi,Yulia Tsvetkov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This position paper argues that in many realistic (i.e., complex, contextualized, subjective) scenarios, one LLM is not enough to produce a reliable output. We challenge the status quo of relying solely on a single general-purpose LLM and argue for multi-LLM collaboration to better represent the extensive diversity of data, skills, and people. We first posit that a single LLM underrepresents real-world data distributions, heterogeneous skills, and pluralistic populations, and that such representation gaps cannot be trivially patched by further training a single LLM. We then organize existing multi-LLM collaboration methods into a hierarchy, based on the level of access and information exchange, ranging from API-level, text-level, logit-level, to weight-level collaboration. Based on these methods, we highlight how multi-LLM collaboration addresses challenges that a single LLM struggles with, such as reliability, democratization, and pluralism. Finally, we identify the limitations of existing multi-LLM methods and motivate future work. We envision multi-LLM collaboration as an essential path toward compositional intelligence and collaborative AI development.
zh

[NLP-57] ULPT: Prompt Tuning with Ultra-Low-Dimensional Optimization

【速读】: 该论文旨在解决大规模语言模型(LLMs)在微调过程中因参数量庞大而导致的成本高昂问题。解决方案的关键在于提出Ultra-Low-dimensional Prompt Tuning (ULPT),通过在低维空间(如2D)优化提示词,并使用随机但冻结的矩阵进行升维投影。为了增强对齐效果,引入可学习的偏移和尺度嵌入。ULPT显著减少了可训练参数数量,例如在2D情况下仅使用传统提示调优的2%,同时保持了在21个自然语言处理任务中的大部分性能。

链接: https://arxiv.org/abs/2502.04501
作者: Zijun Wu,Yongchang Hao,Lili Mou
机构: Dept. Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta (艾伯塔大学); Canada CIFAR AI Chair (加拿大 CIFAR AI 教椅), Amii (艾伯塔机器智能研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models achieve state-of-the-art performance but are costly to fine-tune due to their size. Parameter-efficient fine-tuning methods, such as prompt tuning, address this by reducing trainable parameters while maintaining strong performance. However, prior methods tie prompt embeddings to the model’s dimensionality, which may not scale well with larger LLMs and more customized LLMs. In this paper, we propose Ultra-Low-dimensional Prompt Tuning (ULPT), which optimizes prompts in a low-dimensional space (e.g., 2D) and use a random but frozen matrix for the up-projection. To enhance alignment, we introduce learnable shift and scale embeddings. ULPT drastically reduces the trainable parameters, e.g., 2D only using 2% parameters compared with vanilla prompt tuning while retaining most of the performance across 21 NLP tasks. Our theoretical analysis shows that random projections can capture high-rank structures effectively, and experimental results demonstrate ULPT’s competitive performance over existing parameter-efficient methods.
zh

[NLP-58] Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesnt Matter (Much)

【速读】: 该论文旨在探讨在知识蒸馏(Knowledge Distillation, KD)的中间层匹配(intermediate-layer matching)过程中,选择教师模型(teacher model)特定层进行匹配的策略是否重要。研究发现,即使采用看似无意义的层选择策略,如逆序匹配教师模型的层,学生模型(student model)的表现仍然令人惊讶地好。关键在于,通过分析从学生模型视角看教师模型各层之间的角度,论文提供了一种解释这一现象的机制。

链接: https://arxiv.org/abs/2502.04499
作者: Zony Yu,Yuqiao Wen,Lili Mou
机构: Dept. Computing Science, University of Alberta(计算机科学系,阿尔伯塔大学); Canada CIFAR AI Chair(加拿大 CIFAR AI 主席)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a popular method of transferring knowledge from a large “teacher” model to a small “student” model. KD can be divided into two categories: prediction matching and intermediate-layer matching. We explore an intriguing phenomenon: layer-selection strategy does not matter (much) in intermediate-layer matching. In this paper, we show that seemingly nonsensical matching strategies such as matching the teacher’s layers in reverse still result in surprisingly good student performance. We provide an interpretation for this phenomenon by examining the angles between teacher layers viewed from the student’s perspective.
zh

[NLP-59] Verifiable Format Control for Large Language Model Generations NAACL2025

【速读】: 该论文旨在解决小规模预训练语言模型(Small LLMs)在遵循细粒度格式(如JSON格式)方面的能力不足问题,这严重阻碍了它们的应用进展。现有方法主要集中在评估通用指令跟随能力,而忽视了如何提升小规模模型的特定格式跟随能力。此外,这些方法通常依赖于高级语言模型(如GPT-4)进行评估,这不仅引入了内在偏差,而且由于API调用成本高昂。论文的关键解决方案是创建了一个完全可验证的格式跟随数据集VFF,并利用其可验证特性合成大量数据以逐步训练小规模语言模型,从而提升其格式跟随能力。

链接: https://arxiv.org/abs/2502.04498
作者: Zhaoyang Wang,Jinqi Jiang,Huichi Zhou,Wenhao Zheng,Xuchao Zhang,Chetan Bansal,Huaxiu Yao
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Imperial College London (帝国理工学院); Microsoft Research (微软研究)
类目: Computation and Language (cs.CL)
备注: To appear at Findings of NAACL 2025

点击查看摘要

Abstract:Recent Large Language Models (LLMs) have demonstrated satisfying general instruction following ability. However, small LLMs with about 7B parameters still struggle fine-grained format following (e.g., JSON format), which seriously hinder the advancements of their applications. Most existing methods focus on benchmarking general instruction following while overlook how to improve the specific format following ability for small LLMs. Besides, these methods often rely on evaluations based on advanced LLMs (e.g., GPT-4), which can introduce the intrinsic bias of LLMs and be costly due to the API calls. In this paper, we first curate a fully verifiable format following dataset VFF. In contrast to existing works often adopting external LLMs for instruction-following validations, every sample of VFF can be easily validated with a Python function. Further, we propose to leverage this verifiable feature to synthesize massive data for progressively training small LLMs, in order to improve their format following abilities. Experimental results highlight the prevalent limitations in the format following capabilities of 7B level open-source LLMs and demonstrate the effectiveness of our method in enhancing this essential ability.
zh

[NLP-60] Multi-Agent Reinforcement Learning with Focal Diversity Optimization

【速读】: 该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在大型语言模型(LLMs)中的协作与选择问题。论文的关键解决方案在于提出了一种名为MARL-Focal的方法,它包含三个核心特性:首先,开发了一个代理融合框架以促进基于LLM的多个代理之间的协作;其次,设计了一种焦点多样性优化的代理选择算法,以根据代理间的互补性选出一小部分代理;最后,提出了一种冲突解决方法来检测并融合多个代理输出的一致性。这些创新使得MARL-Focal不仅成本效益高而且具有对抗鲁棒性,在五个基准测试中表现出显著性能提升和更强的鲁棒性。

链接: https://arxiv.org/abs/2502.04492
作者: Selim Furkan Tekin,Fatih Ilhan,Tiansheng Huang,Sihao Hu,Zachary Yahn,Ling Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of Large Language Models (LLMs) and their finetuning strategies has triggered the renewed interests in multi-agent reinforcement learning. In this paper, we introduce a focal diversity-optimized multi-agent reinforcement learning approach, coined as MARL-Focal, with three unique characteristics. First, we develop an agent-fusion framework for encouraging multiple LLM based agents to collaborate in producing the final inference output for each LLM query. Second, we develop a focal-diversity optimized agent selection algorithm that can choose a small subset of the available agents based on how well they can complement one another to generate the query output. Finally, we design a conflict-resolution method to detect output inconsistency among multiple agents and produce our MARL-Focal output through reward-aware and policy-adaptive inference fusion. Extensive evaluations on five benchmarks show that MARL-Focal is cost-efficient and adversarial-robust. Our multi-agent fusion model achieves performance improvement of 5.51% compared to the best individual LLM-agent and offers stronger robustness over the TruthfulQA benchmark. Code is available at this https URL
zh

[NLP-61] Building A Unified AI-centric Language System: analysis framework and future work

【速读】: 该论文旨在解决自然语言处理中的几个核心问题:性别偏见、形态不规则性和语境歧义性,并探讨这些挑战在当前Transformer架构中的放大效应。论文的关键解决方案在于设计一种统一的人工智能中心语言系统,该系统通过将多样的自然语言输入转换为一种简洁且计算效率更高的AI友好型语言,从而实现更高效的模型训练与推理,同时减少内存占用。这一方法借鉴了新兴的人工通信系统及构造语言(如世界语Esperanto和罗曼语Lojban)的理念,以提供一个更为清晰、公平且性能更优的通用交互格式,革新AI与AI之间以及人类与AI之间的交流方式。

链接: https://arxiv.org/abs/2502.04488
作者: Edward Hong Wang,Cynthia Xin Wen
机构: Harvard University; The University of Sydney
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large language models have demonstrated that extended inference through techniques can markedly improve performance, yet these gains come with increased computational costs and the propagation of inherent biases found in natural languages. This paper explores the design of a unified AI-centric language system that addresses these challenges by offering a more concise, unambiguous, and computationally efficient alternative to traditional human languages. We analyze the limitations of natural language such as gender bias, morphological irregularities, and contextual ambiguities and examine how these issues are exacerbated within current Transformer architectures, where redundant attention heads and token inefficiencies prevail. Drawing on insights from emergent artificial communication systems and constructed languages like Esperanto and Lojban, we propose a framework that translates diverse natural language inputs into a streamlined AI-friendly language, enabling more efficient model training and inference while reducing memory footprints. Finally, we outline a pathway for empirical validation through controlled experiments, paving the way for a universal interchange format that could revolutionize AI-to-AI and human-to-AI interactions by enhancing clarity, fairness, and overall performance.
zh

[NLP-62] Active Task Disambiguation with LLM s

【速读】: 该论文旨在解决大型语言模型(LLMs)在处理实际交互中频繁遇到的模糊任务说明的问题。论文的关键在于引入了一种通过贝叶斯实验设计视角来定义和解决任务消歧的方法。通过生成有针对性的问题以获取额外的任务规格信息,从而逐步缩小可行解空间并减少生成不满意输出的风险。这种方法使LLM代理能够最大化信息增益,从而将推理重心从隐式转为显式,有效提升了任务消歧的效果。

链接: https://arxiv.org/abs/2502.04485
作者: Katarzyna Kobalczyk,Nicolas Astorga,Tennison Liu,Mihaela van der Schaar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the impressive performance of large language models (LLMs) across various benchmarks, their ability to address ambiguously specified problems–frequent in real-world interactions–remains underexplored. To address this gap, we introduce a formal definition of task ambiguity and frame the problem of task disambiguation through the lens of Bayesian Experimental Design. By posing clarifying questions, LLM agents can acquire additional task specifications, progressively narrowing the space of viable solutions and reducing the risk of generating unsatisfactory outputs. Yet, generating effective clarifying questions requires LLM agents to engage in a form of meta-cognitive reasoning, an ability LLMs may presently lack. Our proposed approach of active task disambiguation enables LLM agents to generate targeted questions maximizing the information gain. Effectively, this approach shifts the load from implicit to explicit reasoning about the space of viable solutions. Empirical results demonstrate that this form of question selection leads to more effective task disambiguation in comparison to approaches relying on reasoning solely within the space of questions.
zh

[NLP-63] raining Language Models to Reason Efficiently

【速读】: 该论文旨在解决通过单纯扩大模型规模和训练数据来提升大型语言模型性能所面临的边际效益递减问题,特别是在需要高级推理的任务中。论文的关键解决方案是利用强化学习(Reinforcement Learning, RL)训练大型推理模型在推理阶段动态分配计算资源,依据任务复杂度进行调整。这种方法激励模型在保持精度的同时减少不必要的计算开销,从而实现显著的效率提升,并允许通过单一超参数控制推理效率的不同水平。实验结果表明,该方法能够在保持大部分准确性的同时显著降低推理成本。

链接: https://arxiv.org/abs/2502.04463
作者: Daman Arora,Andrea Zanette
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling model size and training data has led to great advances in the performance of Large Language Models (LLMs). However, the diminishing returns of this approach necessitate alternative methods to improve model capabilities, particularly in tasks requiring advanced reasoning. Large reasoning models, which leverage long chain-of-thoughts, bring unprecedented breakthroughs in problem-solving capabilities but at a substantial deployment cost associated to longer generations. Reducing inference costs is crucial for the economic feasibility, user experience, and environmental sustainability of these models. In this work, we propose to train large reasoning models to reason efficiently. More precisely, we use reinforcement learning (RL) to train reasoning models to dynamically allocate inference-time compute based on task complexity. Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy, thereby achieving substantial efficiency gains. It enables the derivation of a family of reasoning models with varying efficiency levels, controlled via a single hyperparameter. Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2502.04463 [cs.LG] (or arXiv:2502.04463v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.04463 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-64] “In order that” – a data driven study of symptoms and causes of obsolescence

【速读】: 该论文旨在研究语法过时现象,特别关注从20世纪初开始逐渐减少使用的连接词“in order that”。论文采用数据驱动的方法,结合语言学分析和统计方法,探究这一现象的可能症状及不同成因。解决方案的关键在于识别并分析两种高层次过程(higher-order processes)的作用:一是由19至20世纪显著社会文化变迁所驱动的外部高层次过程;二是由不定式(to-infinitive)的兴起所代表的内部高层次过程。这些发现有助于全面理解“in order that”使用频率下降的现象。

链接: https://arxiv.org/abs/2502.04457
作者: Karolina Rudnicka
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages

点击查看摘要

Abstract:The paper is an empirical case study of grammatical obsolescence in progress. The main studied variable is the purpose subordinator in order that, which is shown to be steadily decreasing in the frequency of use starting from the beginning of the twentieth century. This work applies a data-driven approach for the investigation and description of obsolescence, recently developed by the Rudnicka (2019). The methodology combines philological analysis with statistical methods used on data acquired from mega-corpora. Moving from the description of possible symptoms of obsolescence to different causes for it, the paper aims at presenting a comprehensive account of the studied phenomenon. Interestingly, a very significant role in the decline of in order that can be ascribed to the so-called higher-order processes, understood as processes influencing the constructional level from above. Two kinds of higher-order processes are shown to play an important role, namely i) an externally-motivated higher-order process exemplified by the drastic socio-cultural changes of the 19th and 20th centuries; ii) an internally-motivated higher-order processes instantiated by the rise of the to-infinitive (rise of infinite clauses).
zh

[NLP-65] Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization

【速读】: 该论文旨在解决在边缘设备上部署小型语言模型(Small Language Models, SLMs)时,处理复杂查询所导致的准确性不足问题。论文的关键解决方案是引入基于不确定性的SLM路由策略,即当SLM对某些高风险查询的响应置信度较低时,将其卸载到大型语言模型(Large Language Models, LLMs)以提高响应的可靠性。然而,这种策略需要在效率与效果之间找到平衡,并且如何有效地将此路由策略推广到新的数据集仍然是一个挑战。论文通过构建校准数据集来增强不同不确定性量化方法下路由策略的泛化能力,从而优化这一平衡。

链接: https://arxiv.org/abs/2502.04428
作者: Yu-Neng Chuang,Leisheng Yu,Guanchu Wang,Lizhe Zhang,Zirui Liu,Xuanting Cai,Yang Sui,Vladimir Braverman,Xia Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed and democratized on edge devices. To improve the efficiency of on-device deployment, small language models (SLMs) are often adopted due to their efficient decoding latency and reduced energy consumption. However, these SLMs often generate inaccurate responses when handling complex queries. One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM. This follows the principle of “If you lack confidence, seek stronger support” to enhance reliability. Relying on more powerful LLMs is yet effective but increases invocation costs. Therefore, striking a routing balance between efficiency and efficacy remains a critical challenge. Additionally, efficiently generalizing the routing strategy to new datasets remains under-explored. In this paper, we conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings. Our findings highlight: First, uncertainty-correctness alignment in different uncertainty quantification (UQ) methods significantly impacts routing performance. Second, uncertainty distributions depend more on both the specific SLM and the chosen UQ method, rather than downstream data. Building on the insight, we propose a calibration data construction instruction pipeline and open-source a constructed hold-out set to enhance routing generalization on new downstream scenarios. The experimental results indicate calibration data effectively bootstraps routing performance without any new data.
zh

[NLP-66] Decoding AI Judgment: How LLM s Assess News Credibility and Bias

【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)评估新闻可信度的内部过程。研究的关键在于分析这些模型在评估新闻可信度时所依赖的内在机制和语言特征,并引入一个框架来评估它们如何通过检索外部信息、查询其他模型以及调整响应来动态优化其可信度评估。

链接: https://arxiv.org/abs/2502.04426
作者: Edoardo Loru,Jacopo Nudo,Niccolò Di Marco,Matteo Cinelli,Walter Quattrociocchi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to assess news credibility, yet little is known about how they make these judgments. While prior research has examined political bias in LLM outputs or their potential for automated fact-checking, their internal evaluation processes remain largely unexamined. Understanding how LLMs assess credibility provides insights into AI behavior and how credibility is structured and applied in large-scale language models. This study benchmarks the reliability and political classifications of state-of-the-art LLMs - Gemini 1.5 Flash (Google), GPT-4o mini (OpenAI), and LLaMA 3.1 (Meta) - against structured, expert-driven rating systems such as NewsGuard and Media Bias Fact Check. Beyond assessing classification performance, we analyze the linguistic markers that shape LLM decisions, identifying which words and concepts drive their evaluations. We uncover patterns in how LLMs associate credibility with specific linguistic features by examining keyword frequency, contextual determinants, and rank distributions. Beyond static classification, we introduce a framework in which LLMs refine their credibility assessments by retrieving external information, querying other models, and adapting their responses. This allows us to investigate whether their assessments reflect structured reasoning or rely primarily on prior learned associations.
zh

[NLP-67] EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

【速读】: 该论文旨在解决现有静态、基于文本或图文基准测试无法充分评估多模态大型语言模型(Multimodal Large Language Models, MLLMs)在情感智能(Emotional Intelligence, EI)方面能力的问题。论文的关键解决方案是构建了一个名为EmoBench-M的新基准,用于从基础情感识别、对话情感理解和社会复杂情感分析三个维度全面评估MLLMs的情感智能能力。通过在EmoBench-M上的评估,发现开源和闭源MLLMs与人类之间存在显著性能差距,从而强调了进一步提升其情感智能能力的重要性。所有基准资源均已公开发布。

链接: https://arxiv.org/abs/2502.04424
作者: He Hu,Yucheng Zhou,Lianzhong You,Hongbo Xu,Qianning Wang,Zheng Lian,Fei Richard Yu,Fei Ma,Laizhong Cui
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the integration of Multimodal large language models (MLLMs) into robotic systems and various AI applications, embedding emotional intelligence (EI) capabilities into these models is essential for enabling robots to effectively address human emotional needs and interact seamlessly in real-world scenarios. Existing static, text-based, or text-image benchmarks overlook the multimodal complexities of real-world interactions and fail to capture the dynamic, multimodal nature of emotional expressions, making them inadequate for evaluating MLLMs’ EI. Based on established psychological theories of EI, we build EmoBench-M, a novel benchmark designed to evaluate the EI capability of MLLMs across 13 valuation scenarios from three key dimensions: foundational emotion recognition, conversational emotion understanding, and socially complex emotion analysis. Evaluations of both open-source and closed-source MLLMs on EmoBench-M reveal a significant performance gap between them and humans, highlighting the need to further advance their EI capabilities. All benchmark resources, including code and datasets, are publicly available at this https URL.
zh

[NLP-68] Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions

【速读】: 该论文旨在解决转诊工作流程中的低效问题,特别是误诊转诊和延迟导致的患者预后不佳及医疗成本增加。研究的关键解决方案是通过分析初级保健诊断条目来预测手术需求,使用基于Base General Embeddings (BGE)的机器学习模型进行语义提取,并结合噪声容忍实验和过采样技术处理类别不平衡。所选模型在预测准确性(ROC-AUC: 0.874, Matthews相关系数(MCC): 0.540)方面表现出色,能够有效区分需要手术干预的患者。通过阈值敏感性分析确定最优决策阈值(0.30),以平衡精确度与召回率,从而最大化转诊效率。最终,这种方法将手术比例从11.27%提升至60.1%,显著提高了运营效率和医疗收入。

链接: https://arxiv.org/abs/2502.04423
作者: Khushboo Verma,Alan Michels,Ergi Gumusaneli,Shilpa Chitnis,Smita Sinha Kumar,Christopher Thompson,Lena Esmail,Guruprasath Srinivasan,Chandini Panchada,Sushovan Guha,Satwant Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Referral workflow inefficiencies, including misaligned referrals and delays, contribute to suboptimal patient outcomes and higher healthcare costs. In this study, we investigated the possibility of predicting procedural needs based on primary care diagnostic entries, thereby improving referral accuracy, streamlining workflows, and providing better care to patients. A de-identified dataset of 2,086 orthopedic referrals from the University of Texas Health at Tyler was analyzed using machine learning models built on Base General Embeddings (BGE) for semantic extraction. To ensure real-world applicability, noise tolerance experiments were conducted, and oversampling techniques were employed to mitigate class imbalance. The selected optimum and parsimonious embedding model demonstrated high predictive accuracy (ROC-AUC: 0.874, Matthews Correlation Coefficient (MCC): 0.540), effectively distinguishing patients requiring surgical intervention. Dimensionality reduction techniques confirmed the model’s ability to capture meaningful clinical relationships. A threshold sensitivity analysis identified an optimal decision threshold (0.30) to balance precision and recall, maximizing referral efficiency. In the predictive modeling analysis, the procedure rate increased from 11.27% to an optimal 60.1%, representing a 433% improvement with significant implications for operational efficiency and healthcare revenue. The results of our study demonstrate that referral optimization can enhance primary and surgical care integration. Through this approach, precise and timely predictions of procedural requirements can be made, thereby minimizing delays, improving surgical planning, and reducing administrative burdens. In addition, the findings highlight the potential of clinical decision support as a scalable solution for improving patient outcomes and the efficiency of the healthcare system. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.6; I.2.7; J.3; H.2.8 Cite as: arXiv:2502.04423 [cs.LG] (or arXiv:2502.04423v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.04423 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-69] KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

【速读】: 该论文旨在解决KV缓存量化在长上下文和大批次场景中提升大型语言模型(LLMs)推理吞吐量和降低延迟的同时,存在的三个主要问题:忽略层间对KV缓存量化敏感性的差异、在线细粒度决策的高开销以及对不同LLMs和约束条件的低灵活性。论文的关键解决方案是提出了一种名为KVTuner的框架,通过自适应搜索硬件友好的层间KV量化精度对,利用多目标优化实现粗粒度KV缓存量化,并直接使用离线搜索到的配置进行在线推理。此外,为了减少离线校准的计算成本,采用了层内KV精度对剪枝和层间聚类以缩小搜索空间。

链接: https://arxiv.org/abs/2502.04420
作者: Xing Li,Zeyu Xing,Yiming Li,Linping Qu,Hui-Ling Zhen,Wulong Liu,Yiwu Yao,Sinno Jialin Pan,Mingxuan Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we thoroughly analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 38.3% compared with KV8 quantization over various context lengths.
zh

[NLP-70] Understanding and Mitigating the Bias Inheritance in LLM -based Data Augmentation on Downstream Tasks

【速读】: 该论文旨在解决大型语言模型(LLMs)在生成合成数据时所继承的偏见问题。论文的关键在于通过微调LLMs来系统性地理解和分析不同类型的偏见在合成数据中的表现,并提出三种缓解策略:基于标记的方法、基于掩码的方法和基于损失的方法。这些方法旨在根据不同任务和偏见类型的效果来减轻偏见的继承问题。

链接: https://arxiv.org/abs/2502.04419
作者: Miaomiao Li,Hao Chen,Yang Wang,Tingyuan Zhu,Weijia Zhang,Kaijie Zhu,Kam-Fai Wong,Jindong Wang
机构: The Chinese University of Hong Kong(香港中文大学); Carnegie Mellon University(卡内基梅隆大学); Institute of Science Tokyo(东京科学研究所); University of Illinois Urbana-Champaign(伊利诺伊大学香槟分校); UC Santa Barbra(加州大学圣塔芭芭拉分校); William & Mary(威廉与玛丽学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical report; 31 pages

点击查看摘要

Abstract:Generating synthetic datasets via large language models (LLMs) themselves has emerged as a promising approach to improve LLM performance. However, LLMs inherently reflect biases present in their training data, leading to a critical challenge: when these models generate synthetic data for training, they may propagate and amplify their inherent biases that can significantly impact model fairness and robustness on downstream tasks–a phenomenon we term bias inheritance. This work presents the first systematic investigation in understanding, analyzing, and mitigating bias inheritance. We study this problem by fine-tuning LLMs with a combined dataset consisting of original and LLM-augmented data, where bias ratio represents the proportion of augmented data. Through systematic experiments across 10 classification and generation tasks, we analyze how 6 different types of biases manifest at varying bias ratios. Our results reveal that bias inheritance has nuanced effects on downstream tasks, influencing both classification tasks and generation tasks differently. Then, our analysis identifies three key misalignment factors: misalignment of values, group data, and data distributions. Based on these insights, we propose three mitigation strategies: token-based, mask-based, and loss-based approaches. Experiments demonstrate that these strategies also work differently on various tasks and bias, indicating the substantial challenges to fully mitigate bias inheritance. We hope this work can provide valuable insights to the research of LLM data augmentation.
zh

[NLP-71] MedRAG : Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot

【速读】: 该论文旨在解决现有启发式检索增强生成(Retrieval-augmented generation, RAG)模型在医疗领域诊断准确性与特异性不足的问题,特别是在症状表现相似的疾病诊断中的不足。论文的关键解决方案是提出MedRAG模型,通过知识图谱(Knowledge Graph, KG)引导的推理机制来增强RAG模型。MedRAG系统性构建了一个四层分级诊断知识图谱,涵盖了不同疾病的诊断差异,并将这些差异与从电子健康记录(EHR)数据库中检索到的相似病例动态整合,在大规模语言模型中进行推理,从而提供更准确和具体的决策支持,并主动提出跟进问题以增强个性化医疗决策。

链接: https://arxiv.org/abs/2502.04413
作者: Xuejiao Zhao,Siyan Liu,Su-Yin Yang,Chunyan Miao
机构: LILY Research Centre, Nanyang Technological University (南洋理工大学); Singapore

LILY Research Centre, Nanyang Technological University (南洋理工大学); Singapore

LILY Research Centre, Nanyang Technological University (南洋理工大学); Singapore

Tan Tock Seng Hospital (陈笃生医院); Woodlands Health (裕廊健康); Singapore

LILY Research Centre, Nanyang Technological University (南洋理工大学); Singapore
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a well-suited technique for retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a key module of the healthcare copilot, helping reduce misdiagnosis for healthcare practitioners and patients. However, the diagnostic accuracy and specificity of existing heuristic-based RAG models used in the medical domain are inadequate, particularly for diseases with similar manifestations. This paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited reasoning for the medical domain that retrieves diagnosis and treatment recommendations based on manifestations. MedRAG systematically constructs a comprehensive four-tier hierarchical diagnostic KG encompassing critical diagnostic differences of various diseases. These differences are dynamically integrated with similar EHRs retrieved from an EHR database, and reasoned within a large language model. This process enables more accurate and specific decision support, while also proactively providing follow-up questions to enhance personalized medical decision-making. MedRAG is evaluated on both a public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) collected from Tan Tock Seng Hospital, and its performance is compared against various existing RAG methods. Experimental results show that, leveraging the information integration and relational abilities of the KG, our MedRAG provides more specific diagnostic insights and outperforms state-of-the-art models in reducing misdiagnosis rates. Our code will be available at this https URL
zh

[NLP-72] Decoder-Only LLM s are Better Controllers for Diffusion Models

【速读】: 该论文旨在解决文本到图像生成模型在理解和转换文本提示时存在的语义理解局限性问题。解决方案的关键在于引入大型语言模型(Large Language Models, LLMs)的语义理解能力,并设计了一个简单的适配器模块,使扩散模型能够兼容解码器-only架构,从而提升文本到图像生成的质量和可靠性。

链接: https://arxiv.org/abs/2502.04412
作者: Ziyi Dong,Yao Xiao,Pengxu Wei,Liang Lin
机构: Sun Yat-sen University(中山大学); Pengcheng Laboratory(鹏城实验室); Sun Yat-sen University(中山大学), Pengcheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Groundbreaking advancements in text-to-image generation have recently been achieved with the emergence of diffusion models. These models exhibit a remarkable ability to generate highly artistic and intricately detailed images based on textual prompts. However, obtaining desired generation outcomes often necessitates repetitive trials of manipulating text prompts just like casting spells on a magic mirror, and the reason behind that is the limited capability of semantic understanding inherent in current image generation models. Specifically, existing diffusion models encode the text prompt input with a pre-trained encoder structure, which is usually trained on a limited number of image-caption pairs. The state-of-the-art large language models (LLMs) based on the decoder-only structure have shown a powerful semantic understanding capability as their architectures are more suitable for training on very large-scale unlabeled data. In this work, we propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models, and devise a simple yet effective adapter to allow the diffusion models to be compatible with the decoder-only structure. Meanwhile, we also provide a supporting theoretical analysis with various architectures (e.g., encoder-only, encoder-decoder, and decoder-only), and conduct extensive empirical evaluations to verify its effectiveness. The experimental results show that the enhanced models with our adapter module are superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
zh

[NLP-73] Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing

【速读】: 该论文旨在解决大型语言模型(LLMs)在不同任务微调后合并过程中出现的参数冲突导致性能下降的问题。论文的关键解决方案在于根据不同层的参数冲突程度进行分层处理:对于参数冲突较小的层采用平均策略,而对于冲突较大的层则采用基于任务级别的专家路由。此外,通过将多个微调专家解耦为一个密集型专家和若干个稀疏型专家,进一步减少了存储成本。针对分布外样本,论文还提出根据输入数据的任务不确定性选择和合并适当的专家。实验结果表明,所提方法在多种规模的LLaMA和Qwen模型上均能显著提升性能,并且系统成本低于现有方法。

链接: https://arxiv.org/abs/2502.04411
作者: Kunfeng Lai,Zhenheng Tang,Xinglin Pan,Peijie Dong,Xiang Liu,Haolan Chen,Li Shen,Bo Li,Xiaowen Chu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: work in progress. arXiv admin note: text overlap with arXiv:2405.09673 by other authors

点击查看摘要

Abstract:Model merging aggregates Large Language Models (LLMs) finetuned on different tasks into a stronger one. However, parameter conflicts between models leads to performance degradation in averaging. While model routing addresses this issue by selecting individual models during inference, it imposes excessive storage and compute costs, and fails to leverage the common knowledge from different models. In this work, we observe that different layers exhibit varying levels of parameter conflicts. Building on this insight, we average layers with minimal parameter conflicts and use a novel task-level expert routing for layers with significant conflicts. To further reduce storage costs, inspired by task arithmetic sparsity, we decouple multiple fine-tuned experts into a dense expert and several sparse experts. Considering the out-of-distribution samples, we select and merge appropriate experts based on the task uncertainty of the input data. We conduct extensive experiments on both LLaMA and Qwen with varying parameter scales, and evaluate on real-world reasoning tasks. Results demonstrate that our method consistently achieves significant performance improvements while requiring less system cost compared to existing methods.
zh

[NLP-74] FAS: Fast ANN-SNN Conversion for Spiking Large Language Models

【速读】: 该论文旨在解决现有生成式人工神经网络(Generative ANN)到尖峰人工神经网络(Spiking ANN)转换方法中存在的性能下降和高计算成本问题。关键解决方案在于提出了一种新颖的快速ANN-SNN转换策略(FAS),该策略通过两阶段的方法进行转换:第一阶段采用全参数微调预训练模型,无需从零开始直接训练;第二阶段引入粗细粒度校准方法以减少转换误差并提高精度。

链接: https://arxiv.org/abs/2502.04405
作者: Long Chen,Xiaotian Song,Andy Song,BaDong Chen,Jiancheng Lv,Yanan Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Spiking Large Language Models have been shown as a good alternative to LLMs in various scenarios. Existing methods for creating Spiking LLMs, i.e., direct training and ANN-SNN conversion, often suffer from performance degradation and relatively high computational costs. To address these issues, we propose a novel Fast ANN-SNN conversion strategy (FAS) that transforms LLMs into spiking LLMs in two stages. The first stage employs a full-parameter fine-tuning of pre-trained models, so it does not need any direct training from scratch. The second stage introduces a coarse-to-fine calibration method to reduce conversion errors and improve accuracy. Our experiments on both language and vision-language tasks across four different scales of LLMs demonstrate that FAS can achieve state-of-the-art performance yet with significantly reduced inference latency and computational costs. For example, FAS only takes 8 timesteps to achieve an accuracy of 3% higher than that of the OPT-7B model, while reducing energy consumption by 96.63%.
zh

[NLP-75] Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models

【速读】: 该论文旨在解决大型语言模型(LLMs)在实现二级认知智能推理器(Level 2 AGI Reasoners)过程中遇到的低效过度思考及过度依赖辅助奖励模型的问题。论文的关键在于引入自我回溯机制(self-backtracking mechanism),使LLMs能够在训练和推理阶段自主决定何时以及何处进行回溯,从而提升推理能力和效率。这一机制通过改进自我优化过程将慢速思考转化为快速思考,实验证明该方法相比最优路径监督微调方法性能提升了超过40%。

链接: https://arxiv.org/abs/2502.04404
作者: Xiao-Wen Yang,Xuan-Yi Zhu,Wen-Da Wei,Ding-Chu Zhang,Jie-Jing Shao,Zhi Zhou,Lan-Zhe Guo,Yu-Feng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This is a preprint under review, 15 pages, 13 figures

点击查看摘要

Abstract:The integration of slow-thinking mechanisms into large language models (LLMs) offers a promising way toward achieving Level 2 AGI Reasoners, as exemplified by systems like OpenAI’s o1. However, several significant challenges remain, including inefficient overthinking and an overreliance on auxiliary reward models. We point out that these limitations stem from LLMs’ inability to internalize the search process, a key component of effective reasoning. A critical step toward addressing this issue is enabling LLMs to autonomously determine when and where to backtrack, a fundamental operation in traditional search algorithms. To this end, we propose a self-backtracking mechanism that equips LLMs with the ability to backtrack during both training and inference. This mechanism not only enhances reasoning ability but also efficiency by transforming slow-thinking processes into fast-thinking through self-improvement. Empirical evaluations demonstrate that our proposal significantly enhances the reasoning capabilities of LLMs, achieving a performance gain of over 40 percent compared to the optimal-path supervised fine-tuning method. We believe this study introduces a novel and promising pathway for developing more advanced and robust Reasoners.
zh

[NLP-76] Multimodal Medical Code Tokenizer

【速读】: 该论文旨在解决现有电子健康记录(EHRs)tokenizer在处理医疗代码时仅将其视为孤立文本标记的问题。论文的关键解决方案是引入MedTok,一种多模态医疗代码tokenizer,它利用代码的文本描述及其关系上下文进行处理。MedTok通过语言模型编码器处理文本,并使用图编码器编码关系结构,从而在统一的标记空间中量化这两种模态,保留模态特定和跨模态信息。这一方法提升了多种基于EHR的任务表现,特别是在药物推荐方面,展示了MedTok作为医疗代码统一tokenizer的潜力。

链接: https://arxiv.org/abs/2502.04397
作者: Xiaorui Su,Shvat Messica,Yepeng Huang,Ruth Johnson,Lukas Fesser,Shanghua Gao,Faryad Sahneh,Marinka Zitnik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: conference

点击查看摘要

Abstract:Foundation models trained on patient electronic health records (EHRs) require tokenizing medical data into sequences of discrete vocabulary items. Existing tokenizers treat medical codes from EHRs as isolated textual tokens. However, each medical code is defined by its textual description, its position in ontological hierarchies, and its relationships to other codes, such as disease co-occurrences and drug-treatment associations. Medical vocabularies contain more than 600,000 codes with critical information for clinical reasoning. We introduce MedTok, a multimodal medical code tokenizer that uses the text descriptions and relational context of codes. MedTok processes text using a language model encoder and encodes the relational structure with a graph encoder. It then quantizes both modalities into a unified token space, preserving modality-specific and cross-modality information. We integrate MedTok into five EHR models and evaluate it on operational and clinical tasks across in-patient and out-patient datasets, including outcome prediction, diagnosis classification, drug recommendation, and risk stratification. Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate using MedTok tokenizer with medical QA systems. Our results demonstrate the potential of MedTok as a unified tokenizer for medical codes, improving tokenization for medical foundation models.
zh

[NLP-77] DECT: Harnessing LLM -assisted Fine-Grained Linguistic Knowledge and Label-Switched and Label-Preserved Data Generation for Diagnosis of Alzheimers Disease

【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)早期诊断中语言障碍检测的难题。由于患者与访谈者的对话常常混有模糊、噪声及无关信息,这使得AD检测任务变得困难。此外,AD语音样本的有限可用性和说话风格的变化也带来了显著挑战。为应对这些挑战,论文提出了一种名为DECT的新型方法,该方法利用大规模语言模型(Large Language Models, LLMs)进行细粒度的语言分析和标签切换保留数据生成。关键在于利用LLMs的总结能力从噪声语音转录中识别并提炼出关键的认知-语言信息,以及其内在的语言知识来提取无结构和异构音频转录中的语言标记,并通过其组合能力生成包含多样语言模式的AD语音转录,从而克服数据稀缺问题并增强AD检测模型的鲁棒性。

链接: https://arxiv.org/abs/2502.04394
作者: Tingyu Mo,Jacqueline C. K. Lam,Victor O.K. Li,Lawrence Y. L. Cheung
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is an irreversible neurodegenerative disease affecting 50 million people worldwide. Low-cost, accurate identification of key markers of AD is crucial for timely diagnosis and intervention. Language impairment is one of the earliest signs of cognitive decline, which can be used to discriminate AD patients from normal control individuals. Patient-interviewer dialogues may be used to detect such impairments, but they are often mixed with ambiguous, noisy, and irrelevant information, making the AD detection task difficult. Moreover, the limited availability of AD speech samples and variability in their speech styles pose significant challenges in developing robust speech-based AD detection models. To address these challenges, we propose DECT, a novel speech-based domain-specific approach leveraging large language models (LLMs) for fine-grained linguistic analysis and label-switched label-preserved data generation. Our study presents four novelties: We harness the summarizing capabilities of LLMs to identify and distill key Cognitive-Linguistic information from noisy speech transcripts, effectively filtering irrelevant information. We leverage the inherent linguistic knowledge of LLMs to extract linguistic markers from unstructured and heterogeneous audio transcripts. We exploit the compositional ability of LLMs to generate AD speech transcripts consisting of diverse linguistic patterns to overcome the speech data scarcity challenge and enhance the robustness of AD detection models. We use the augmented AD textual speech transcript dataset and a more fine-grained representation of AD textual speech transcript data to fine-tune the AD detection model. The results have shown that DECT demonstrates superior model performance with an 11% improvement in AD detection accuracy on the datasets from DementiaBank compared to the baselines.
zh

[NLP-78] Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents

【速读】: 该论文旨在解决在资源受限的本地设备上部署大规模语言模型(Large Language Models, LLMs)的挑战。解决方案的关键在于提出了一种名为Division-of-Thoughts (DoT) 的协同推理框架,该框架通过利用本地部署的小规模语言模型(Smaller-scale Language Models, SLMs)与基于云的LLMs之间的协同作用,将用户查询分解为更小的任务,并通过任务调度分析子任务间的依赖关系以实现并行推理。此外,DoT采用可插拔适配器(Plug-and-Play Adapter)来根据子任务难度分配合适的模型,并通过自强化训练方法提升任务分配能力。这些措施显著降低了LLM的成本,同时保持了竞争性的推理准确性。

链接: https://arxiv.org/abs/2502.04392
作者: Chenyang Shao,Xinyuan Hu,Yutang Lin,Fengli Xu
机构: Tsinghua University(清华大学); BNRist, Tsinghua University(清华大学脑与智能实验室); Department of Electronic Engineering; Emory University(埃默里大学); Department of Quantitative Theory & Methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid expansion of web content has made on-device AI assistants indispensable for helping users manage the increasing complexity of online tasks. The emergent reasoning ability in large language models offer a promising path for next-generation on-device AI agents. However, deploying full-scale Large Language Models (LLMs) on resource-limited local devices is challenging. In this paper, we propose Division-of-Thoughts (DoT), a collaborative reasoning framework leveraging the synergy between locally deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT leverages a Task Decomposer to elicit the inherent planning abilities in language models to decompose user queries into smaller sub-tasks, which allows hybrid language models to fully exploit their respective strengths. Besides, DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks and create a dependency graph, facilitating parallel reasoning of sub-tasks and the identification of key steps. To allocate the appropriate model based on the difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an additional task head attached to the SLM that does not alter the SLM’s parameters. To boost adapter’s task allocation capability, we propose a self-reinforced training method that relies solely on task execution feedback. Extensive experiments on various benchmarks demonstrate that our DoT significantly reduces LLM costs while maintaining competitive reasoning accuracy. Specifically, DoT reduces the average reasoning time and API costs by 66.12% and 83.57%, while achieving comparable reasoning accuracy with the best baseline methods.
zh

[NLP-79] In Praise of Stubbornness: The Case for Cognitive-Dissonance-Aware Knowledge Updates in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在持续更新知识时面临的灾难性遗忘问题。论文的关键在于引入了一种受认知启发的实验范式,包含两个核心组件:(1) 不和谐与熟悉度意识,通过分析模型行为将信息分类为新颖、熟悉或不和谐;(2) 针对性网络更新,追踪神经活动以识别频繁使用(固执)和很少使用(可塑)的神经元。研究发现,通过简单的激活和梯度特征可以实现不和谐检测,而非不和谐更新能够保持先前的知识,而不和谐更新则会破坏模型的知识库,这揭示了神经网络处理矛盾的根本局限性,并强调了开发更符合人类认知机制的知识更新方法的需求。

链接: https://arxiv.org/abs/2502.04390
作者: Simone Clemente,Zied Ben Houidi,Alexis Huet,Dario Rossi,Giulio Franzese,Pietro Michiardi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Despite remarkable capabilities, large language models (LLMs) struggle to continually update their knowledge without catastrophic forgetting. In contrast, humans effortlessly integrate new information, detect conflicts with existing beliefs, and selectively update their mental models. This paper introduces a cognitive-inspired investigation paradigm to study continual knowledge updating in LLMs. We implement two key components inspired by human cognition: (1) Dissonance and Familiarity Awareness, analyzing model behavior to classify information as novel, familiar, or dissonant; and (2) Targeted Network Updates, which track neural activity to identify frequently used (stubborn) and rarely used (plastic) neurons. Through carefully designed experiments in controlled settings, we uncover a number of empirical findings demonstrating the potential of this approach. First, dissonance detection is feasible using simple activation and gradient features, suggesting potential for cognitive-inspired training. Second, we find that non-dissonant updates largely preserve prior knowledge regardless of targeting strategy, revealing inherent robustness in LLM knowledge integration. Most critically, we discover that dissonant updates prove catastrophically destructive to the model’s knowledge base, indiscriminately affecting even information unrelated to the current updates. This suggests fundamental limitations in how neural networks handle contradictions and motivates the need for new approaches to knowledge updating that better mirror human cognitive mechanisms.
zh

[NLP-80] FedP2EFT: Federated Learning to Personalize Parameter Efficient Fine-Tuning for Multilingual LLM s

【速读】: 该论文旨在解决在联邦学习(Federated Learning, FL)框架下,如何有效地对多语言大语言模型(Multilingual Large Language Models, LLMs)进行个性化参数高效微调(Parameter-Efficient Fine-Tuning, PEFT),以提升各客户端性能。论文的关键在于提出了一种名为FedP²EFT的方法,通过协作学习最优的个性化PEFT结构,采用贝叶斯稀疏秩选择(Bayesian Sparse Rank Selection),从而避免过拟合低数据量情况,显著提升了现有个性化微调方法的效果。

链接: https://arxiv.org/abs/2502.04387
作者: Royson Lee,Minyoung Kim,Fady Rezk,Rui Li,Stylianos I. Venieris,Timothy Hospedales
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Federated learning (FL) has enabled the training of multilingual large language models (LLMs) on diverse and decentralized multilingual data, especially on low-resource languages. To improve client-specific performance, personalization via the use of parameter-efficient fine-tuning (PEFT) modules such as LoRA is common. This involves a personalization strategy (PS), such as the design of the PEFT adapter structures (e.g., in which layers to add LoRAs and what ranks) and choice of hyperparameters (e.g., learning rates) for fine-tuning. Instead of manual PS configuration, we propose FedP ^2 EFT, a federated learning-to-personalize method for multilingual LLMs in cross-device FL settings. Unlike most existing PEFT structure selection methods, which are prone to overfitting low-data regimes, FedP ^2 EFT collaboratively learns the optimal personalized PEFT structure for each client via Bayesian sparse rank selection. Evaluations on both simulated and real-world multilingual FL benchmarks demonstrate that FedP ^2 EFT largely outperforms existing personalized fine-tuning methods, while complementing a range of existing FL methods.
zh

[NLP-81] Enhancing Reasoning to Adapt Large Language Models for Domain-Specific Applications NEURIPS2024

【速读】: 该论文旨在解决大型语言模型(LLMs)在领域特定应用中的适应性不足问题,特别是在半导体布局设计等任务中的空间推理及领域知识应用挑战。论文的关键解决方案是提出了SOLOMON架构,这是一种神经启发的大规模语言模型推理网络,通过利用提示工程(Prompt Engineering)和上下文学习(In-Context Learning)技术,显著提升了基础模型在特定领域的快速适应能力。实验结果表明,SOLOMON实例在相关任务上的表现不仅优于基准LLMs,且可媲美最先进的推理模型o1-preview。

链接: https://arxiv.org/abs/2502.04384
作者: Bo Wen,Xin Zhang
机构: IBM T. J. Watson Research Center (IBM T. J. 沃森研究中心), Yorktown Heights, NY, USA; MIT-IBM Watson AI Lab (MIT-IBM 沃森人工智能实验室), Cambridge, MA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: NeurIPS 2024 Workshop AFM (Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning)

点击查看摘要

Abstract:This paper presents SOLOMON, a novel Neuro-inspired Large Language Model (LLM) Reasoning Network architecture that enhances the adaptability of foundation models for domain-specific applications. Through a case study in semiconductor layout design, we demonstrate how SOLOMON enables swift adaptation of general-purpose LLMs to specialized tasks by leveraging Prompt Engineering and In-Context Learning techniques. Our experiments reveal the challenges LLMs face in spatial reasoning and applying domain knowledge to practical problems. Results show that SOLOMON instances significantly outperform their baseline LLM counterparts and achieve performance comparable to state-of-the-art reasoning model, o1-preview. We discuss future research directions for developing more adaptive AI systems that can continually learn, adapt, and evolve in response to new information and changing requirements.
zh

[NLP-82] Sparse Autoencoders for Hypothesis Generation

【速读】: 该论文旨在解决如何从文本数据(如新闻标题)与目标变量(如点击量)之间假设可解释的关系。关键解决方案在于HypotheSAEs方法,它通过训练稀疏自编码器生成可解释的数据特征,并利用这些特征预测目标变量,最后借助大型语言模型(LLM)生成自然语言解释,从而形成关于目标变量预测因素的假设。这种方法在合成数据集上能更好地识别基准假设(F1值至少提高0.06),在真实数据集上产生更具有预测性的假设(显著发现数量约是其他方法的两倍),同时所需计算资源减少1到2个数量级。

链接: https://arxiv.org/abs/2502.04382
作者: Rajiv Movva,Kenny Peng,Nikhil Garg,Jon Kleinberg,Emma Pierson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: First two authors contributed equally; working paper

点击查看摘要

Abstract:We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., “mentions being surprised or shocked”) using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
zh

[NLP-83] Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning

【速读】: 该论文旨在探讨大型语言模型(Large Language Models, LLMs)在临床推理中的局限性,特别是在开放性临床场景下的表现。为了探查LLMs在临床问题解决中的潜在失效模式,作者引入了医学抽象与推理语料库(Medical Abstraction and Reasoning Corpus, M-ARC)。M-ARC通过设计的情景来利用定势效应(Einstellung effect),即先前经验导致思维僵化,从而揭示LLMs倾向于从训练数据中进行不灵活的模式匹配而非灵活推理的归纳偏置。论文的关键在于通过M-ARC评估LLMs的表现,并发现它们在面对需要常识医学推理的任务时表现不佳,且容易产生幻觉,同时表现出过度自信但准确性有限。

链接: https://arxiv.org/abs/2502.04381
作者: Jonathan Kim,Anna Podlasek,Kie Shidara,Feng Liu,Ahmed Alaa,Danilo Bernardo
机构: Stanford University (斯坦福大学); University of Dundee (邓迪大学); University of California, San Francisco (加州大学旧金山分校); Stevens Institute of Technology (史蒂文斯理工学院); University of California Berkeley (加州大学伯克利分校); University of California, San Francisco (加州大学旧金山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have attained human-level accuracy on medical question-answer (QA) benchmarks. However, their limitations in navigating open-ended clinical scenarios have recently been shown, raising concerns about the robustness and generalizability of LLM reasoning across diverse, real-world medical tasks. To probe potential LLM failure modes in clinical problem-solving, we present the medical abstraction and reasoning corpus (M-ARC). M-ARC assesses clinical reasoning through scenarios designed to exploit the Einstellung effect – the fixation of thought arising from prior experience, targeting LLM inductive biases toward inflexible pattern matching from their training data rather than engaging in flexible reasoning. We find that LLMs, including current state-of-the-art o1 and Gemini models, perform poorly compared to physicians on M-ARC, often demonstrating lack of commonsense medical reasoning and a propensity to hallucinate. In addition, uncertainty estimation analyses indicate that LLMs exhibit overconfidence in their answers, despite their limited accuracy. The failure modes revealed by M-ARC in LLM medical reasoning underscore the need to exercise caution when deploying these models in clinical settings.
zh

[NLP-84] Diversity as a Reward: Fine-Tuning LLM s on a Mixture of Domain-Undetermined Data

【速读】: 该论文旨在解决在微调大型语言模型 (LLMs) 时,现有方法因数据域标签缺失、不精确或未标准化而难以处理数据的问题,以及基于数据选择的方法在平衡多域性能时遇到的困难。为应对这些挑战,论文提出了一种新方法,赋予 LLM 双重身份:一个用于认知探测和基于多样性奖励选择数据的输出模型,另一个则使用选定数据进行微调的输入模型。关键在于通过实验构建对比数据池并理论推导出多样性的解释,从而显著提升未确定数据域和一系列基础下游任务中的模型性能。

链接: https://arxiv.org/abs/2502.04380
作者: Zhenqing Ling,Daoyuan Chen,Liuyi Yao,Yaliang Li,Ying Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 15 figures, 11 tables

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or non-normalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this paper, we study the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations for both inter- and intra-diversity. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-development for LLMs.
zh

[NLP-85] Can Large Language Models Capture Video Game Engagement?

【速读】: 该论文旨在探究预训练的大规模语言模型(Large Language Models, LLMs)能否成功检测视频中的人类情感。研究通过综合评估LLMs在多模态输入下标注和预测连续情感注释的能力来解决问题,特别测试了LLMs在GameVibe语料库中的20款第一人称射击游戏中80分钟的游戏视频片段中识别游戏参与度变化的能力。研究的关键在于通过超过2,400次实验,系统性地分析LLM架构、模型大小、输入模态、提示策略以及真实标签处理方法对参与度预测的影响,从而揭示LLMs在捕捉人类提供的连续体验注释方面的局限性和潜在优势,并为未来利用LLMs进行自动化情感标注的研究提供方向。

链接: https://arxiv.org/abs/2502.04379
作者: David Melhart,Matthew Barthet,Georgios N. Yannakakis
机构: Institute of Digital Games, University of Malta (数字游戏研究所, 马耳他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Can out-of-the-box pretrained Large Language Models (LLMs) detect human affect successfully when observing a video? To address this question, for the first time, we evaluate comprehensively the capacity of popular LLMs to annotate and successfully predict continuous affect annotations of videos when prompted by a sequence of text and video frames in a multimodal fashion. Particularly in this paper, we test LLMs’ ability to correctly label changes of in-game engagement in 80 minutes of annotated videogame footage from 20 first-person shooter games of the GameVibe corpus. We run over 2,400 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction. Our findings suggest that while LLMs rightfully claim human-like performance across multiple domains, they generally fall behind capturing continuous experience annotations provided by humans. We examine some of the underlying causes for the relatively poor overall performance, highlight the cases where LLMs exceed expectations, and draw a roadmap for the further exploration of automated emotion labelling via LLMs.
zh

[NLP-86] MEETING DELEGATE: Benchmarking LLM s on Attending Meetings on Our Behalf

【速读】: 该论文旨在探讨大型语言模型(Large Language Models, LLMs)在会议中的有效代理作用。研究的关键在于开发一个基于LLM的会议代理系统原型,并通过使用真实会议记录创建全面基准来评估其性能。研究发现,不同LLM在主动与谨慎参与策略之间表现出不同的平衡性,但整体上约有60%的响应能够涵盖至少一个关键点。解决方案的关键在于实现LLMs在会议中的有效代理,同时需改进减少无关或重复内容以及增强对实际应用场景中常见转录错误的容忍度。

链接: https://arxiv.org/abs/2502.04376
作者: Lingxiang Hu,Shurun Yuan,Xiaoting Qin,Jue Zhang,Qingwei Lin,Dongmei Zhang,Saravan Rajmohan,Qi Zhang
机构: Northeastern University, China(东北大学,中国); Peking University, China(北京大学,中国); Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In contemporary workplaces, meetings are essential for exchanging ideas and ensuring team alignment but often face challenges such as time consumption, scheduling conflicts, and inefficient participation. Recent advancements in Large Language Models (LLMs) have demonstrated their strong capabilities in natural language generation and reasoning, prompting the question: can LLMs effectively delegate participants in meetings? To explore this, we develop a prototype LLM-powered meeting delegate system and create a comprehensive benchmark using real meeting transcripts. Our evaluation reveals that GPT-4/4o maintain balanced performance between active and cautious engagement strategies. In contrast, Gemini 1.5 Pro tends to be more cautious, while Gemini 1.5 Flash and Llama3-8B/70B display more active tendencies. Overall, about 60% of responses address at least one key point from the ground-truth. However, improvements are needed to reduce irrelevant or repetitive content and enhance tolerance for transcription errors commonly found in real-world settings. Additionally, we implement the system in practical settings and collect real-world feedback from demos. Our findings underscore the potential and challenges of utilizing LLMs as meeting delegates, offering valuable insights into their practical application for alleviating the burden of meetings.
zh

[NLP-87] An Analysis for Reasoning Bias of Language Models with Small Initialization

【速读】: 该论文旨在探究参数初始化尺度对大规模语言模型(Large Language Models, LLMs)训练行为和任务偏好影响。研究发现较小的初始化尺度促使模型更倾向于推理任务,而较大的初始化尺度则导致模型更偏向于记忆任务。解决方案的关键在于通过实证数据和精心设计的锚函数验证这一推理偏差,并进一步分析初始训练动态,揭示嵌入空间和自注意力机制在形成这些学习偏差中的重要作用。此外,论文提供了从模型训练动力学角度解释这些现象的理论框架,并通过真实语言任务实验验证了这些理论见解。

链接: https://arxiv.org/abs/2502.04375
作者: Junjie Yao,Zhongwang Zhang,Zhi-Qin John Xu
机构: Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University (自然科学研究所,教育部重点实验室,上海交通大学); School of Mathematical Sciences, Shanghai Jiao Tong University (数学科学学院,上海交通大学); School of Artificial Intelligence, Shanghai Jiao Tong University (人工智能学院,上海交通大学); Key Laboratory of Marine Intelligent Equipment and System, Ministry of Education, P.R. China (教育部海洋智能装备与系统重点实验室,中华人民共和国); Center for LLM, Institute for Advanced Algorithms Research, Shanghai (大型语言模型研究中心,先进算法研究院,上海)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 30 pages, 14 figures

点击查看摘要

Abstract:Transformer-based Large Language Models (LLMs) have revolutionized Natural Language Processing by demonstrating exceptional performance across diverse tasks. This study investigates the impact of the parameter initialization scale on the training behavior and task preferences of LLMs. We discover that smaller initialization scales encourage models to favor reasoning tasks, whereas larger initialization scales lead to a preference for memorization tasks. We validate this reasoning bias via real datasets and meticulously designed anchor functions. Further analysis of initial training dynamics suggests that specific model components, particularly the embedding space and self-attention mechanisms, play pivotal roles in shaping these learning biases. We provide a theoretical framework from the perspective of model training dynamics to explain these phenomena. Additionally, experiments on real-world language tasks corroborate our theoretical insights. This work enhances our understanding of how initialization strategies influence LLM performance on reasoning tasks and offers valuable guidelines for training models.
zh

[NLP-88] Mining Unstructured Medical Texts With Conformal Active Learning

【速读】: 该论文旨在解决从电子健康记录(EHRs)中提取相关数据以识别症状和自动化流行病监测过程的问题。解决方案的关键在于提出了一种灵活且高效的框架,用于从非结构化文本中挖掘数据,显著减少了对大量人工标注的需求。实验表明,即使面对复杂的分类问题,该框架仅需少量(如200条)人工标注的数据即可实现强劲性能,并且能够使用简单的轻量级模型获得与资源密集型深度学习模型相媲美的结果。这种方法不仅加速了处理时间,还保护了患者隐私,使得数据可以在本地较弱的硬件上进行处理。

链接: https://arxiv.org/abs/2502.04372
作者: Juliano Genari,Guilherme Tegoni Goedert
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The extraction of relevant data from Electronic Health Records (EHRs) is crucial to identifying symptoms and automating epidemiological surveillance processes. By harnessing the vast amount of unstructured text in EHRs, we can detect patterns that indicate the onset of disease outbreaks, enabling faster, more targeted public health responses. Our proposed framework provides a flexible and efficient solution for mining data from unstructured texts, significantly reducing the need for extensive manual labeling by specialists. Experiments show that our framework achieving strong performance with as few as 200 manually labeled texts, even for complex classification problems. Additionally, our approach can function with simple lightweight models, achieving competitive and occasionally even better results compared to more resource-intensive deep learning models. This capability not only accelerates processing times but also preserves patient privacy, as the data can be processed on weaker on-site hardware rather than being transferred to external systems. Our methodology, therefore, offers a practical, scalable, and privacy-conscious approach to real-time epidemiological monitoring, equipping health institutions to respond rapidly and effectively to emerging health threats.
zh

[NLP-89] PerPO: Perceptual Preference Optimization via Discriminative Rewarding

【速读】: 该论文旨在解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)在视觉辨别方面面临的挑战。论文提出了一种名为感知偏好优化(Perceptual Preference Optimization, PerPO)的方法,通过采用区分性奖励收集多样化的负样本,并利用列表排序偏好优化来对这些样本进行排名。关键在于使用奖励作为定量边界进行排序,从而有效地将生成式偏好优化与判别式经验风险最小化相结合。这种方法显著提升了MLLMs的视觉辨别能力,同时保持了其生成能力,并确保了跨视觉任务的一致性能。

链接: https://arxiv.org/abs/2502.04371
作者: Zining Zhu,Liang Zhao,Kangheng Lin,Jinze Yang,En Yu,Chenglong Liu,Haoran Wei,Jianjian Sun,Zheng Ge,Xiangyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents Perceptual Preference Optimization (PerPO), a perception alignment method aimed at addressing the visual discrimination challenges in generative pre-trained multimodal large language models (MLLMs). To align MLLMs with human visual perception process, PerPO employs discriminative rewarding to gather diverse negative samples, followed by listwise preference optimization to rank this http URL utilizing the reward as a quantitative margin for ranking, our method effectively bridges generative preference optimization and discriminative empirical risk minimization. PerPO significantly enhances MLLMs’ visual discrimination capabilities while maintaining their generative strengths, mitigates image-unconditional reward hacking, and ensures consistent performance across visual tasks. This work marks a crucial step towards more perceptually aligned and versatile MLLMs. We also hope that PerPO will encourage the community to rethink MLLM alignment strategies.
zh

[NLP-90] DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization

【速读】: 该论文旨在解决现有文本到三维(Text-to-3D)生成方法难以与人类偏好对齐的问题,从而限制其应用和灵活性。论文的关键解决方案是提出了一种基于优化的框架——DreamDPO,它通过直接偏好优化将人类偏好整合到三维生成过程中。DreamDPO利用成对比较来反映偏好,减少了对精确逐点质量评估的依赖,并通过偏好引导的优化实现了更精细的可控性。

链接: https://arxiv.org/abs/2502.04370
作者: Zhenglin Zhou,Xiaobo Xia,Fan Ma,Hehe Fan,Yi Yang,Tat-Seng Chua
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 20 pages, 12 figures

点击查看摘要

Abstract:Text-to-3D generation automates 3D content creation from textual descriptions, which offers transformative potential across various fields. However, existing methods often struggle to align generated content with human preferences, limiting their applicability and flexibility. To address these limitations, in this paper, we propose DreamDPO, an optimization-based framework that integrates human preferences into the 3D generation process, through direct preference optimization. Practically, DreamDPO first constructs pairwise examples, then compare their alignment with human preferences using reward or large multimodal models, and lastly optimizes the 3D representation with a preference-driven loss function. By leveraging pairwise comparison to reflect preferences, DreamDPO reduces reliance on precise pointwise quality evaluations while enabling fine-grained controllability through preference-guided optimization. Experiments demonstrate that DreamDPO achieves competitive results, and provides higher-quality and more controllable 3D content compared to existing methods. The code and models will be open-sourced.
zh

[NLP-91] Contrastive Token-level Explanations for Graph-based Rumour Detection

【速读】: 该论文旨在解决社交媒ti体谣言传播的问题,并特别关注基于图神经网络(Graph Neural Network, GNN)的谣言检测方法缺乏透明度,导致预测难以解读。论文的关键解决方案是提出了一种名为对比令牌层级相关传播(Contrastive Token Layerwise Relevance Propagation, CT-LRP)的新框架,该框架通过提供细粒度且可解释的令牌级解释,增强了GNN基ti于谣言检测的可解释性。

链接: https://arxiv.org/abs/2502.04366
作者: Daniel Wai Kit Chin,Roy Ka-Wei Lee
机构: Singapore University of Technology and Design (新加坡科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The widespread use of social media has accelerated the dissemination of information, but it has also facilitated the spread of harmful rumours, which can disrupt economies, influence political outcomes, and exacerbate public health crises, such as the COVID-19 pandemic. While Graph Neural Network (GNN)-based approaches have shown significant promise in automated rumour detection, they often lack transparency, making their predictions difficult to interpret. Existing graph explainability techniques fall short in addressing the unique challenges posed by the dependencies among feature dimensions in high-dimensional text embeddings used in GNN-based models. In this paper, we introduce Contrastive Token Layerwise Relevance Propagation (CT-LRP), a novel framework designed to enhance the explainability of GNN-based rumour detection. CT-LRP extends current graph explainability methods by providing token-level explanations that offer greater granularity and interpretability. We evaluate the effectiveness of CT-LRP across multiple GNN models trained on three publicly available rumour detection datasets, demonstrating that it consistently produces high-fidelity, meaningful explanations, paving the way for more robust and trustworthy rumour detection systems.
zh

[NLP-92] LLM s can be easily Confused by Instructional Distractions

【速读】: 该论文旨在解决大型语言模型(LLMs)在面对指令性干扰(instructional distraction)时的表现问题。指令性干扰是指即使有明确提示区分任务指令与输入文本,LLMs仍可能因输入文本类似指令而产生混淆的现象。论文的关键解决方案是引入了一个名为DIM-Bench的新基准,用于评估LLMs在指令性干扰下的性能,并通过涵盖重写、校对、翻译和风格转换等指令任务以及推理、代码生成、数学推理、偏见检测和问答等输入任务来分类和测试实际场景中的指令性干扰实例。实验结果表明,最先进的LLMs也容易受到指令性干扰的影响,经常无法准确遵循用户意图。

链接: https://arxiv.org/abs/2502.04362
作者: Yerin Hwang,Yongil Kim,Jahyun Koo,Taegwan Kang,Hyunkyung Bae,Kyomin Jung
机构: IPAI, Seoul National University(首尔国立大学IPAI); LG AI Research(乐金AI研究院); Dept. of ECE, Seoul National University(首尔国立大学电子与计算机工程系); SNU-LG AI Research Center(首尔国立大学-乐金AI研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Despite the fact that large language models (LLMs) show exceptional skill in instruction following tasks, this strength can turn into a vulnerability when the models are required to disregard certain instructions. Instruction-following tasks typically involve a clear task description and input text containing the target data to be processed. However, when the input itself resembles an instruction, confusion may arise, even if there is explicit prompting to distinguish between the task instruction and the input. We refer to this phenomenon as instructional distraction. In this paper, we introduce a novel benchmark, named DIM-Bench, specifically designed to assess LLMs’ performance under instructional distraction. The benchmark categorizes real-world instances of instructional distraction and evaluates LLMs across four instruction tasks: rewriting, proofreading, translation, and style transfer – alongside five input tasks: reasoning, code generation, mathematical reasoning, bias detection, and question answering. Our experimental results reveal that even the most advanced LLMs are susceptible to instructional distraction, often failing to accurately follow user intent in such cases.
zh

[NLP-93] MARAG E: Transferable Multi-Model Adversarial Attack for Retrieval-Augmented Generation Data Extraction

【速读】: 该论文旨在解决 Retrieval-Augmented Generation (RAG) 系统在使用私有资源构建外部数据存储时面临的提取攻击风险。现有提取攻击方法通常依赖于手工设计的提示语,这限制了其有效性。论文提出的关键解决方案是 MARAGE 框架,它通过优化对抗字符串来增强攻击的转移性,并采用一种策略强调目标 RAG 数据的初始令牌以提高攻击的泛化能力。MARAGE 利用多模型梯度集成的连续优化方案,从而在多种语言模型和 RAG 数据集上持续超越手工和优化基准方法,同时保持对未见过模型的鲁棒性。

链接: https://arxiv.org/abs/2502.04360
作者: Xiao Hu,Eric Liu,Weizhou Wang,Xiangyu Guo,David Lie
机构: University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) offers a solution to mitigate hallucinations in Large Language Models (LLMs) by grounding their outputs to knowledge retrieved from external sources. The use of private resources and data in constructing these external data stores can expose them to risks of extraction attacks, in which attackers attempt to steal data from these private databases. Existing RAG extraction attacks often rely on manually crafted prompts, which limit their effectiveness. In this paper, we introduce a framework called MARAGE for optimizing an adversarial string that, when appended to user queries submitted to a target RAG system, causes outputs containing the retrieved RAG data verbatim. MARAGE leverages a continuous optimization scheme that integrates gradients from multiple models with different architectures simultaneously to enhance the transferability of the optimized string to unseen models. Additionally, we propose a strategy that emphasizes the initial tokens in the target RAG data, further improving the attack’s generalizability. Evaluations show that MARAGE consistently outperforms both manual and optimization-based baselines across multiple LLMs and RAG datasets, while maintaining robust transferability to previously unseen models. Moreover, we conduct probing tasks to shed light on the reasons why MARAGE is more effective compared to the baselines and to analyze the impact of our approach on the model’s internal state.
zh

[NLP-94] Exploring Spatial Language Grounding Through Referring Expressions

【速读】: 该论文旨在解决视觉-语言模型(Vision-language models, VLMs)在空间推理(Spatial Reasoning)方面的困难。具体而言,作者提出使用指代表达理解任务(Referring Expression Comprehension task)作为评估VLMs空间推理能力的新平台。关键在于通过引入具有对象检测模糊性、复杂的空间表达以及否定表达的任务场景,深入分析VLMs在处理这些特定情况下的优势与不足。研究表明,不同模型在面对不同类型的空间语义(如拓扑、方向、邻近等)时表现出不同的行为模式。

链接: https://arxiv.org/abs/2502.04359
作者: Akshar Tumu,Parisa Kordjamshidi
机构: University of California San Diego(加州大学圣地亚哥分校); Michigan State University(密歇根州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation (‘not’). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.
zh

[NLP-95] Position: Scaling LLM Agents LLM Agents Requires Asymptotic Analysis with LLM Primitives

【速读】: 该论文旨在探讨将复杂问题分解为子问题时,基于大规模语言模型(Large Language Models, LLMs)的角色分配方法是否接近最优。论文的关键在于提出使用渐近分析(asymptotic analysis)来评估这些分解系统的效率,并通过将LLM前向传递视为计算成本的基本单位,以分离特定LLM内部工作的复杂性与一组LLM协同解决问题的固有效率。简而言之,论文主张应利用LLM原语的渐近分析,而非拟人化LLMs,以推理和开发更有效的大型问题分解方法。

链接: https://arxiv.org/abs/2502.04358
作者: Elliot Meyerson,Xin Qiu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 12 pages including references

点击查看摘要

Abstract:Decomposing hard problems into subproblems often makes them easier and more efficient to solve. With large language models (LLMs) crossing critical reliability thresholds for a growing slate of capabilities, there is an increasing effort to decompose systems into sets of LLM-based agents, each of whom can be delegated sub-tasks. However, this decomposition (even when automated) is often intuitive, e.g., based on how a human might assign roles to members of a human team. How close are these role decompositions to optimal? This position paper argues that asymptotic analysis with LLM primitives is needed to reason about the efficiency of such decomposed systems, and that insights from such analysis will unlock opportunities for scaling them. By treating the LLM forward pass as the atomic unit of computational cost, one can separate out the (often opaque) inner workings of a particular LLM from the inherent efficiency of how a set of LLMs are orchestrated to solve hard problems. In other words, if we want to scale the deployment of LLMs to the limit, instead of anthropomorphizing LLMs, asymptotic analysis with LLM primitives should be used to reason about and develop more powerful decompositions of large problems into LLM agents.
zh

[NLP-96] Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs

【速读】: 该论文旨在解决在应用强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)于大型语言模型(LLMs)以提升其在聊天机器人和内容生成等更广泛领域的表现时所面临的挑战。这些挑战包括计算成本高昂的训练、昂贵的评估以及由此导致的可重复性差等问题。论文的关键解决方案是建议使用基于嵌入(embedding-based)的方法来改进奖励模型的研究,通过这种方法可以提高可重复性、降低硬件的计算需求、增强训练稳定性,并显著减少训练和评估的成本,从而促进这一研究领域的公平和高效比较。

链接: https://arxiv.org/abs/2502.04357
作者: Hao Sun,Yunyi Shen,Jean-Francois Ton,Mihaela van der Schaar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made substantial strides in structured tasks through Reinforcement Learning (RL), demonstrating proficiency in mathematical reasoning and code generation. However, applying RL in broader domains like chatbots and content generation – through the process known as Reinforcement Learning from Human Feedback (RLHF) – presents unique challenges. Reward models in RLHF are critical, acting as proxies that evaluate the alignment of LLM outputs with human intent. Despite advancements, the development of reward models is hindered by challenges such as computational heavy training, costly evaluation, and therefore poor reproducibility. We advocate for using embedding-based input in reward model research as an accelerated solution to those challenges. By leveraging embeddings for reward modeling, we can enhance reproducibility, reduce computational demands on hardware, improve training stability, and significantly reduce training and evaluation costs, hence facilitating fair and efficient comparisons in this active research area. We then show a case study of reproducing existing reward model ensemble research using embedding-based reward models. We discussed future avenues for research, aiming to contribute to safer and more effective LLM deployments.
zh

[NLP-97] Open Foundation Models in Healthcare: Challenges Paradoxes and Opportunities with GenAI Driven Personalized Prescription

【速读】: 该论文旨在探索开源大型语言模型(Large Language Models, LLMs)和人工智能基础模型(AI Foundation Models, AIFMs)在开发医疗应用中的前景。论文的关键解决方案在于提出了一种综合调查,涵盖了当前最先进的开源医疗LLMs和AIFMs,并引入了这些开源AIFMs的分类法,以评估其在各类医疗任务中的效用。此外,通过个性化处方案例研究,论文展示了开源模型在实际应用中的性能,并与专有模型进行了对比,尤其是在带有和不带检索增强生成(Retrieval-Augmented Generation, RAG)的情况下。研究表明,尽管不如专有模型精细,但当结合接地技术如RAG时,开源LLMs可以实现与专有模型相当的性能。

链接: https://arxiv.org/abs/2502.04356
作者: Mahdi Alkaeed,Sofiat Abioye,Adnan Qayyum,Yosra Magdi Mekki,Ilhem Berrou,Mohamad Abdallah,Ala Al-Fuqaha,Muhammad Bilal,Junaid Qadir
机构: Department of Computer Science and Engineering, College of Engineering, Qatar University (卡塔尔大学); Birmingham City University (伯明翰城市大学); College of Science and Engineering, Hamad Bin Khalifa University (哈马德·本·哈利法大学); College of Medicine, Qatar University (卡塔尔大学); University of the West of England (UWE) (西英格兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In response to the success of proprietary Large Language Models (LLMs) such as OpenAI’s GPT-4, there is a growing interest in developing open, non-proprietary LLMs and AI foundation models (AIFMs) for transparent use in academic, scientific, and non-commercial applications. Despite their inability to match the refined functionalities of their proprietary counterparts, open models hold immense potential to revolutionize healthcare applications. In this paper, we examine the prospects of open-source LLMs and AIFMs for developing healthcare applications and make two key contributions. Firstly, we present a comprehensive survey of the current state-of-the-art open-source healthcare LLMs and AIFMs and introduce a taxonomy of these open AIFMs, categorizing their utility across various healthcare tasks. Secondly, to evaluate the general-purpose applications of open LLMs in healthcare, we present a case study on personalized prescriptions. This task is particularly significant due to its critical role in delivering tailored, patient-specific medications that can greatly improve treatment outcomes. In addition, we compare the performance of open-source models with proprietary models in settings with and without Retrieval-Augmented Generation (RAG). Our findings suggest that, although less refined, open LLMs can achieve performance comparable to proprietary models when paired with grounding techniques such as RAG. Furthermore, to highlight the clinical significance of LLMs-empowered personalized prescriptions, we perform subjective assessment through an expert clinician. We also elaborate on ethical considerations and potential risks associated with the misuse of powerful LLMs and AIFMs, highlighting the need for a cautious and responsible implementation in healthcare.
zh

[NLP-98] LLM -ProS: Analyzing Large Language Models Performance in Competitive Problem Solving

【速读】: 该论文旨在评估当前最先进的大型语言模型(Large Language Models, LLMs)在国际大学生程序设计竞赛(International Collegiate Programming Contest, ICPC)问题上的表现。为此,论文引入了一种新的评估技术LLM-ProS,并使用从2011年到2024年的166个世界总决赛问题组成的精选数据集进行基准测试。关键在于通过正确性、资源利用率和响应校准等关键指标,评估包括GPT-4o、Mistral Large、Llama-3.1-405B以及o1家族(包含o1-mini和o1-preview)在内的五种模型的推理能力、准确性和效率。研究还探讨了训练方法、数据集污染和思维链推理对模型性能的影响,从而为优化LLMs在算法任务中的应用提供了新见解。

链接: https://arxiv.org/abs/2502.04355
作者: Md Sifat Hossain,Anika Tabassum,Md. Fahim Arefin,Tarannum Shaila Zaman
机构: Department of Computer Science and Engineering, University of Dhaka (计算机科学与工程系, 达卡大学); Department of Information Systems, University of Maryland, Baltimore County (信息系统系, 马里兰大学巴尔的摩郡分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in LLM4Code 2025 workshop proceedings

点击查看摘要

Abstract:The rapid advancement of large language models has opened new avenues for automating complex problem-solving tasks such as algorithmic coding and competitive programming. This paper introduces a novel evaluation technique, LLM-ProS, to assess the performance of state-of-the-art LLMs on International Collegiate Programming Contest (ICPC) problems. Using a curated dataset of 166 World Finals problems from 2011 to 2024, we benchmark the models’ reasoning, accuracy, and efficiency. We evaluate the five models-GPT-4o, Mistral Large, Llama-3.1-405B, and the o1 family, consisting of o1-mini and o1-preview, across critical metrics like correctness, resource utilization, and response calibration. Our results reveal significant differences in the models’ abilities to generalize, adapt, and solve novel problems. We also investigated the impact of training methodologies, dataset contamination, and chain-of-thought reasoning on model performance. The findings provide new insights into optimizing LLMs for algorithmic tasks, highlighting both strengths and limitations of current models.
zh

[NLP-99] Reviving The Classics: Active Reward Modeling in Large Language Model Alignment

【速读】: 该论文旨在解决在基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和大型语言模型对齐研究中,如何从有限且高成本的人类标注数据中选择最具信息量的样本对进行标注的问题。关键在于提出了一种基于Fisher信息的选择策略,通过将经典实验设计理论应用于深度神经网络奖励建模任务的最终线性层,以实现对表征空间的有效探索及适度奖励差异样本对的有信息量比较,从而提高标注效率和模型性能。

链接: https://arxiv.org/abs/2502.04354
作者: Yunyi Shen,Hao Sun,Jean-François Ton
机构: Massachusetts Institute of Technology (麻省理工学院); University of Cambridge (剑桥大学); ByteDance Research (字节跳动研究部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Building neural reward models from human preferences is a pivotal component in reinforcement learning from human feedback (RLHF) and large language model alignment research. Given the scarcity and high cost of human annotation, how to select the most informative pairs to annotate is an essential yet challenging open problem. In this work, we highlight the insight that an ideal comparison dataset for reward modeling should balance exploration of the representation space and make informative comparisons between pairs with moderate reward differences. Technically, challenges arise in quantifying the two objectives and efficiently prioritizing the comparisons to be annotated. To address this, we propose the Fisher information-based selection strategies, adapt theories from the classical experimental design literature, and apply them to the final linear layer of the deep neural network-based reward modeling tasks. Empirically, our method demonstrates remarkable performance, high computational efficiency, and stability compared to other selection methods from deep learning and classical statistical literature across multiple open-source LLMs and datasets. Further ablation studies reveal that incorporating cross-prompt comparisons in active reward modeling significantly enhances labeling efficiency, shedding light on the potential for improved annotation strategies in RLHF.
zh

[NLP-100] CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements

【速读】: 该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)和多模态大型语言模型(Multimodal Large Language Models, MLLMs)自动化分析艺术作品的技术和表达特征,并快速评估大量艺术品的模式随时间演变的问题。解决方案的关键在于开发了一种自动化形式艺术分析框架,通过该框架可以高效处理大量的艺术数据,并揭示不同时期的艺术品在视觉元素、构图和技术方面的新兴模式。

链接: https://arxiv.org/abs/2502.04353
作者: Afshin Khadangi,Amir Sartipi,Igor Tchappi,Gilbert Fridgen
机构: SnT, University of Luxembourg (卢森堡大学SnT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Art, as a universal language, can be interpreted in diverse ways, with artworks embodying profound meanings and nuances. The advent of Large Language Models (LLMs) and the availability of Multimodal Large Language Models (MLLMs) raise the question of how these transformative models can be used to assess and interpret the artistic elements of artworks. While research has been conducted in this domain, to the best of our knowledge, a deep and detailed understanding of the technical and expressive features of artworks using LLMs has not been explored. In this study, we investigate the automation of a formal art analysis framework to analyze a high-throughput number of artworks rapidly and examine how their patterns evolve over time. We explore how LLMs can decode artistic expressions, visual elements, composition, and techniques, revealing emerging patterns that develop across periods. Finally, we discuss the strengths and limitations of LLMs in this context, emphasizing their ability to process vast quantities of art-related data and generate insightful interpretations. Due to the exhaustive and granular nature of the results, we have developed interactive data visualizations, available online this https URL, to enhance understanding and accessibility.
zh

[NLP-101] Investigating the Robustness of Deductive Reasoning with Large Language Models

【速读】: 该论文旨在解决大型语言模型(LLMs)在逻辑推理任务中的鲁棒性问题,特别是在非正式和自动形式化方法下的表现。论文的关键解决方案在于提出了一种包含对抗噪声和反事实陈述两种扰动类型的框架,并生成了七个扰动数据集。通过组织LLM推理器的全景图,根据推理格式、形式化语法以及错误恢复反馈进行分类,研究分析了这些设计组件的影响。研究表明,对抗噪声主要影响自动形式化,而反事实陈述则影响所有方法。尽管详细的反馈可以减少语法错误,但并未提高整体准确性,这表明LLM方法在自我校正方面存在挑战。

链接: https://arxiv.org/abs/2502.04352
作者: Fabian Hoppe,Filip Ilievski,Jan-Christoph Kalo
机构: Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); Universiteit van Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to achieve impressive results for many reasoning-based Natural Language Processing (NLP) tasks, suggesting a degree of deductive reasoning capability. However, it remains unclear to which extent LLMs, in both informal and autoformalisation methods, are robust on logical deduction tasks. Moreover, while many LLM-based deduction methods have been proposed, there is a lack of a systematic study that analyses the impact of their design components. Addressing these two challenges, we propose the first study of the robustness of LLM-based deductive reasoning methods. We devise a framework with two families of perturbations: adversarial noise and counterfactual statements, which jointly generate seven perturbed datasets. We organize the landscape of LLM reasoners according to their reasoning format, formalisation syntax, and feedback for error recovery. The results show that adversarial noise affects autoformalisation, while counterfactual statements influence all approaches. Detailed feedback does not improve overall accuracy despite reducing syntax errors, pointing to the challenge of LLM-based methods to self-correct effectively.
zh

[NLP-102] NER4all or Context is All You Need: Using LLM s for low-effort high-performance NER on historical texts. A humanities informed approach

【速读】: 该论文旨在解决历史研究中文本中人名、地名等实体识别(Named Entity Recognition, NER)的难题。由于历史文献语言多样性和体裁差异大,拼写的多样性有限,且需要深厚的历史领域知识,传统的自然语言处理(NLP)方法不仅成本高昂,而且在召回率和精确度方面表现不佳。论文的关键解决方案是利用现成的最先进大型语言模型(LLMs),通过提供历史背景和角色建模来调整提示策略,从而显著提高了历史文档中的NER性能,比领先的NLP框架spaCy和flair高出7%到22%的F1分数。此外,研究发现少量示例(少于16个)的提示策略已足够,无需增加更多训练样本。这一方法通过自然语言提示和消费级工具及前端界面,降低了历史学家使用NER技术的门槛。

链接: https://arxiv.org/abs/2502.04351
作者: Torsten Hiltmann,Martin Dröge,Nicole Dresselhaus,Till Grallert,Melanie Althage,Paul Bayer,Sophie Eckenstaler,Koray Mendi,Jascha Marijn Schmitz,Philipp Schneider,Wiebke Sczeponik,Anica Skibba
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Named entity recognition (NER) is a core task for historical research in automatically establishing all references to people, places, events and the like. Yet, do to the high linguistic and genre diversity of sources, only limited canonisation of spellings, the level of required historical domain knowledge, and the scarcity of annotated training data, established approaches to natural language processing (NLP) have been both extremely expensive and yielded only unsatisfactory results in terms of recall and precision. Our paper introduces a new approach. We demonstrate how readily-available, state-of-the-art LLMs significantly outperform two leading NLP frameworks, spaCy and flair, for NER in historical documents by seven to twentytwo percent higher F1-Scores. Our ablation study shows how providing historical context to the task and a bit of persona modelling that turns focus away from a purely linguistic approach are core to a successful prompting strategy. We also demonstrate that, contrary to our expectations, providing increasing numbers of examples in few-shot approaches does not improve recall or precision below a threshold of 16-shot. In consequence, our approach democratises access to NER for all historians by removing the barrier of scripting languages and computational skills required for established NLP tools and instead leveraging natural language prompts and consumer-grade tools and frontends.
zh

[NLP-103] CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

【速读】: 该论文旨在解决现有方法无法有效引导大语言模型(Large Language Models, LLMs)在文本推理与代码生成之间切换的问题,导致符号计算能力未能充分利用。论文的关键解决方案是引入CodeSteer,这是一种有效的指导LLM进行代码/文本生成的方法。通过构建一个包含37个具有可调复杂度的符号任务的综合基准SymBench,并合成12k个多轮指导/生成轨迹和5.5k个指导比较对的数据集,论文采用了一种新的多轮监督微调(SFT)和直接偏好优化(DPO)来微调Llama-3-8B模型。此外,结合所提出的符号检查器和自我答案检查器,增强后的模型CodeSteerLLM能够更有效地指导代码和文本生成。

链接: https://arxiv.org/abs/2502.04350
作者: Yongchao Chen,Yilun Hao,Yueying Liu,Yang Zhang,Chuchu Fan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC); Software Engineering (cs.SE)
备注: 27 pages, 12 figures

点击查看摘要

Abstract:Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-round guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-round supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at this https URL.
zh

[NLP-104] Dynamic benchmarking framework for LLM -based conversational data capture

【速读】: 该论文旨在解决现有评价框架在评估大型语言模型(LLMs)驱动的对话系统时仅关注单一任务,无法全面捕捉多轮对话动态性的问题。解决方案的关键在于引入一种动态基准测试框架,通过与合成用户的交互来评估对话系统的性能,并结合生成式代理仿真技术,从信息抽取、上下文感知和自适应互动等维度进行综合评价。该框架提供了一种可扩展、自动化且灵活的评估方法。

链接: https://arxiv.org/abs/2502.04349
作者: Pietro Alessandro Aluffi,Patrick Zietkiewicz,Marya Bazzi,Matt Arderne,Vladimirs Murevics
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) has transformed conversational agents, enabling complex human-machine interactions. However, evaluation frameworks often focus on single tasks, failing to capture the dynamic nature of multi-turn dialogues. This paper introduces a dynamic benchmarking framework to assess LLM-based conversational agents through interactions with synthetic users. The framework integrates generative agent simulation to evaluate performance on key dimensions: information extraction, context awareness, and adaptive engagement. By simulating various aspects of user behavior, our work provides a scalable, automated, and flexible benchmarking approach. Experimental evaluation - within a loan application use case - demonstrates the framework’s effectiveness under one-shot and few-shot extraction conditions. Results show that adaptive strategies improve data extraction accuracy, especially when handling ambiguous responses. Future work will extend its applicability to broader domains and incorporate additional metrics (e.g., conversational coherence, user engagement). This study contributes a structured, scalable approach to evaluating LLM-based conversational agents, facilitating real-world deployment.
zh

[NLP-105] Prompt-based Depth Pruning of Large Language Models

【速读】: 该论文旨在解决在深度剪枝(Depth Pruning)过程中,如何有效地减少大型语言模型(Large Language Model)的推理成本(Inference Cost),同时保持其任务相关性能。论文的关键在于提出了一种动态深度剪枝算法——PuDDing(提示路由动态深度剪枝,Prompt-routed Dynamic Depth Pruning)。该算法通过训练一个轻量级的路由器来根据输入提示(Input Prompt)预测最佳的剪枝选项集,从而动态决定哪些Transformer块可以被移除。这种方法使得模型能够根据具体任务需求进行自适应调整,从而在不同任务上实现更好的性能表现。

链接: https://arxiv.org/abs/2502.04348
作者: Juyun Wee,Minjae Park,Jaeho Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent – a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.
zh

[NLP-106] SCALM: Detecting Bad Practices in Smart Contracts Through LLM s

【速读】: 该论文旨在解决以太坊智能合约中普遍存在的不良编写实践问题,这些实践虽不直接导致安全漏洞,但会增加出现问题的风险。论文的关键解决方案是提出了一种基于大型语言模型(Large Language Models, LLMs)的框架SCALM,该框架结合了Step-Back Prompting和 Retrieval-Augmented Generation (RAG) 技术,能够有效地识别和应对多种不良实践。实验结果表明,SCALM在检测智能合约中的不良实践方面优于现有工具。

链接: https://arxiv.org/abs/2502.04347
作者: Zongwei Li,Xiaoqi Li,Wenkai Li,Xin Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:As the Ethereum platform continues to mature and gain widespread usage, it is crucial to maintain high standards of smart contract writing practices. While bad practices in smart contracts may not directly lead to security issues, they do elevate the risk of encountering problems. Therefore, to understand and avoid these bad practices, this paper introduces the first systematic study of bad practices in smart contracts, delving into over 35 specific issues. Specifically, we propose a large language models (LLMs)-based framework, SCALM. It combines Step-Back Prompting and Retrieval-Augmented Generation (RAG) to identify and address various bad practices effectively. Our extensive experiments using multiple LLMs and datasets have shown that SCALM outperforms existing tools in detecting bad practices in smart contracts.
zh

[NLP-107] Multi-Lingual Cyber Threat Detection in Tweets/X Using ML DL and LLM : A Comparative Analysis

【速读】: 该论文旨在解决多语言推特网络威胁检测的问题,特别是针对伪装在推文中的虚假信息和有害内容。解决方案的关键在于采用多种高级模型,并通过三个阶段的研究方法:首先收集并标注包含四种语言(英语、中文、俄语和阿拉伯语)的推文数据集;其次分别使用机器学习和深度学习模型评估其在不同语言上的性能;最后整合所有数据集,应用深度学习及大规模语言模型架构,以评估其在识别跨语言网络威胁方面的有效性。研究结果显示Bi-LSTM架构在所有数据集中均表现出色,强调了其在多语言网络威胁检测中的有效性。

链接: https://arxiv.org/abs/2502.04346
作者: Saydul Akbar Murad,Ashim Dahal,Nick Rahimi
机构: University of Southern Mississippi (南密西西比大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cyber threat detection has become an important area of focus in today’s digital age due to the growing spread of fake information and harmful content on social media platforms such as Twitter (now ‘X’). These cyber threats, often disguised within tweets, pose significant risks to individuals, communities, and even nations, emphasizing the need for effective detection systems. While previous research has explored tweet-based threats, much of the work is limited to specific languages, domains, or locations, or relies on single-model approaches, reducing their applicability to diverse real-world scenarios. To address these gaps, our study focuses on multi-lingual tweet cyber threat detection using a variety of advanced models. The research was conducted in three stages: (1) We collected and labeled tweet datasets in four languages English, Chinese, Russian, and Arabic employing both manual and polarity-based labeling methods to ensure high-quality annotations. (2) Each dataset was analyzed individually using machine learning (ML) and deep learning (DL) models to assess their performance on distinct languages. (3) Finally, we combined all four datasets into a single multi-lingual dataset and applied DL and large language model (LLM) architectures to evaluate their efficacy in identifying cyber threats across various languages. Our results show that among machine learning models, Random Forest (RF) attained the highest performance; however, the Bi-LSTM architecture consistently surpassed other DL and LLM architectures across all datasets. These findings underline the effectiveness of Bi-LSTM in multilingual cyber threat detection. The code for this paper can be found at this link: this https URL.
zh

[NLP-108] JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment

【速读】: 该论文旨在解决传统中医(TCM)在实际应用中所需的广泛医学知识和临床经验需求,以及现有中医大型语言模型(LLMs)在全面医疗咨询、诊断及基于证候辨别的治疗中的关键局限性。为了解决这些问题,论文提出了一个名为“经方(JingFang, JF)”的新型中医大型语言模型。其关键是创新了一种多代理动态协作链式思维机制(MDCCTM),用于医疗咨询,使JF具备有效的准确诊断能力。此外,通过开发证候代理和双阶段检索方案(DSRS),显著提升了JF在基于证候辨别治疗方面的效能。

链接: https://arxiv.org/abs/2502.04345
作者: Yehan Yan,Tianhao Ma,Ruotai Li,Xinhan Zheng,Guodong Shan,Chisheng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traditional Chinese medicine (TCM) plays a vital role in health protection and disease treatment, but its practical application requires extensive medical knowledge and clinical experience. Existing TCM Large Language Models (LLMs) exhibit critical limitations of uncomprehensive medical consultation and diagnoses, and inaccurate syndrome differentiation-based treatment. To address these issues, this study establishes JingFang (JF): a novel TCM Large Language Model that demonstrates the expert-level capability of medical diagnosis and syndrome differentiation-based treatment. We innovate a Multi-agent Dynamic Collaborative Chain-of-Thought Mechanism (MDCCTM) for medical consultation, enabling JF with effective and accurate diagnostic ability. In addition, a Syndrome Agent and a Dual-Stage Retrieval Scheme (DSRS) are developed to significantly enhance the capacity of JF for disease treatment based on syndrome differentiation. JingFang not only facilitates the application of LLMs but also promotes the effective practice of TCM in human health protection and disease treatment.
zh

[NLP-109] utorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection

【速读】: 该论文旨在解决在社交媒体平台上应用机器学习和深度学习方法进行心理健康检测时所面临的常见挑战。关键解决方案包括策略以处理多样化的数据集、改进文本预处理、以及应对数据不平衡和模型评估等问题。通过提供真实世界案例和逐步指导,论文强调透明性、可重复性和伦理考量,从而帮助研究者构建更可靠且广泛应用的心理健康检测模型,以促进早期识别和干预工具的发展。

链接: https://arxiv.org/abs/2502.04342
作者: Yeyubei Zhang,Zhongyan Wang,Zhanyi Ding,Yexin Tian,Jianglai Dai,Xiaorui Shen,Yunchong Liu,Yuchen Cao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Social media has become an important source for understanding mental health, providing researchers with a way to detect conditions like depression from user-generated posts. This tutorial provides practical guidance to address common challenges in applying machine learning and deep learning methods for mental health detection on these platforms. It focuses on strategies for working with diverse datasets, improving text preprocessing, and addressing issues such as imbalanced data and model evaluation. Real-world examples and step-by-step instructions demonstrate how to apply these techniques effectively, with an emphasis on transparency, reproducibility, and ethical considerations. By sharing these approaches, this tutorial aims to help researchers build more reliable and widely applicable models for mental health research, contributing to better tools for early detection and intervention.
zh

[NLP-110] Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data NAACL2025

【速读】: 该论文旨在解决通过现有数据提升CLIP模型性能的问题。解决方案的关键在于引入HELIP策略,通过利用现有数据集中的困难文本图像对进行连续训练,从而无需额外的数据或广泛的重新训练即可改进CLIP模型。HELIP可以轻松集成到当前的训练流程中,并且只需要少量的代码修改,实现了快速且无缝的实施。

链接: https://arxiv.org/abs/2305.05208
作者: Haonan Wang,Minbin Huang,Runhui Huang,Lanqing Hong,Hang Xu,Tianyang Hu,Xiaodan Liang,Zhenguo Li,Hong Cheng,Kenji Kawaguchi
机构: National University of Singapore; The Chinese University of Hong Kong; Sun Yat-sen University; Huawei Noah’s Ark Lab; Department of Systems Engineering and Engineering Management, and Shun Hing Institute of Advanced Engineering, The Chinese University of Hong Kong, Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to NAACL 2025, main conference. 20 pages, 10 figures, 10 tables

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning. Improving CLIP typically requires additional data and retraining with new loss functions, but these demands raise resource and time costs, limiting practical use. In this work, we introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training. This eliminates the need for additional data or extensive retraining. Moreover, HELIP integrates effortlessly into current training pipelines with minimal code modifications, allowing for quick and seamless implementation. On comprehensive benchmarks, HELIP consistently boosts existing models. In particular, within just two epochs of training, it improves zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M, and YFCC15M datasets by 3.05%, 4.47%, and 10.1% , respectively. In addition, on fine-grained classification datasets, HELIP improves the zero-shot performance of CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%. The code is publicly available at: this https URL.
zh

计算机视觉

[CV-0] FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

【速读】:该论文旨在解决文本到视频生成过程中高保真度与计算效率之间的矛盾。论文的关键解决方案在于提出了一种新颖的两阶段框架——FlashVideo,通过在不同阶段合理分配模型容量和函数评估次数(NFEs),平衡生成保真度与质量。在第一阶段,利用大参数和充足的NFEs进行低分辨率生成以优先保证文本一致性;第二阶段则通过流匹配技术在低分辨率和高分辨率之间建立联系,从而以较少的NFEs生成精细细节,最终实现高质量高分辨率视频生成的同时保持高效计算性能。

链接: https://arxiv.org/abs/2502.05179
作者: Shilong Zhang,Wenbo Li,Shoufa Chen,Chongjian Ge,Peize Sun,Yida Zhang,Yi Jiang,Zehuan Yuan,Binyue Peng,Ping Luo
机构: The University of Hong Kong; The Chinese University of Hong Kong; ByteDance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Model and Weight: this https URL

点击查看摘要

Abstract:DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output before committing to full resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability .
zh

[CV-1] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

【速读】:该论文旨在解决视觉表征学习与语言图像对齐之间的平衡问题,并提出了一种名为Quantized Language-Image Pretraining (QLIP)的方法。QLIP通过动态平衡重建损失和语言图像对齐损失,在单一模型中实现了高质量的图像重建和零样本图像理解。关键在于采用基于二值球面量化(Binary-Spherical-Quantization)的自动编码器,并设计了一个两阶段训练管道,有效结合了大规模批次预训练的需求和重建目标带来的内存瓶颈问题。

链接: https://arxiv.org/abs/2502.05178
作者: Yue Zhao,Fuzhao Xue,Scott Reed,Linxi Fan,Yuke Zhu,Jan Kautz,Zhiding Yu,Philipp Krähenbühl,De-An Huang
机构: UT Austin(德克萨斯大学奥斯汀分校); NVIDIA(英伟达); Google DeepMind(DeepMind)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report. Project page: this https URL

点击查看摘要

Abstract:We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
zh

[CV-2] Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray

【速读】:该论文旨在解决长上下文多模态理解任务中的能力限制问题。解决方案的关键在于引入Long-VITA模型,该模型能够同时处理和分析长达4000帧或100万个标记的图像、视频和文本数据,并通过有效的多模态训练方案实现卓越的性能。此方案包括从大型语言模型开始,经过视觉-语言对齐、通用知识学习以及两个连续阶段的长序列微调。此外,论文还实现了上下文并行分布式推理和带logits掩码的语言建模头,以支持模型在推理过程中处理无限长的图像和文本输入。

链接: https://arxiv.org/abs/2502.05177
作者: Yunhang Shen,Chaoyou Fu,Shaoqi Dong,Xiong Wang,Peixian Chen,Mengdan Zhang,Haoyu Cao,Ke Li,Xiawu Zheng,Yan Zhang,Yiyi Zhou,Rongrong Ji,Xing Sun
机构: Tencent Youtu Lab(腾讯优图实验室); Nanjing University(南京大学); Xiamen University(厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Establishing the long-context capability of large vision-language models is crucial for video understanding, high-resolution image understanding, multi-modal agents and reasoning. We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17 M samples from public datasets only and demonstrates the state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.
zh

[CV-3] AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360deg Unbounded Scene Inpainting

【速读】:该论文旨在解决三维场景修复在360°无界场景中面临的视图一致性与几何精度难题。论文的关键解决方案在于提出AuraFusion360方法,该方法通过深度感知的未见遮挡掩模生成(Depth-aware Unseen Mask Generation)、自适应引导深度扩散(Adaptive Guided Depth Diffusion)以及基于SDEdit的细节增强(SDEdit-based Detail Enhancement),实现了高质量的对象移除和空洞填充,同时保持了多视角一致性与几何准确性。

链接: https://arxiv.org/abs/2502.05176
作者: Chung-Ho Wu,Yang-Jung Chen,Ying-Huan Chen,Jie-Ying Lee,Bo-Hsu Ke,Chun-Wei Tuan Mu,Yi-Chuan Huang,Chin-Yang Lin,Min-Hung Chen,Yen-Yu Lin,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (阳明交通大学); NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360° unbounded scenes. We present AuraFusion360, a novel reference-based method that enables high-quality object removal and hole filling in 3D scenes represented by Gaussian Splatting. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion, a zero-shot method for accurate initial point placement without requiring additional training, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for 360° unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes. See our project page for video results and the dataset at this https URL.
zh

[CV-4] Fillerbuster: Multi-View Scene Completion for Casual Captures

【速读】:该论文旨在解决三维场景中未知区域的补全问题,特别是在输入帧稀疏且缺乏已知相机参数的情况下。现有方法主要关注于利用稀疏视角先验使已知像素看起来更好,或仅从一到两张照片创建物体缺失的侧面。论文提出的关键解决方案是训练一个生成模型(Generative Model),该模型能够处理大量输入帧的上下文信息,同时生成未知视角的目标视图,并在需要时恢复图像姿态。这一方法不仅完成了部分捕捉任务,还在未校准场景补全任务中实现了姿态预测和新内容生成的统一。

链接: https://arxiv.org/abs/2502.05175
作者: Ethan Weber,Norman Müller,Yash Kant,Vasu Agrawal,Michael Zollhöfer,Angjoo Kanazawa,Christian Richardt
机构: Meta Reality Labs (META 实景实验室); UC Berkeley (加州大学伯克利分校); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page at this https URL

点击查看摘要

Abstract:We present Fillerbuster, a method that completes unknown regions of a 3D scene by utilizing a novel large-scale multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for handling this challenge as they focus on making the known pixels look good with sparse-view priors, or on creating the missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Additionally, the images often do not have known camera parameters. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when desired. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. Our model is the first to predict many images and poses together for scene completion.
zh

[CV-5] VideoRoPE: What Makes for Good Video Rotary Position Embedding?

【速读】:该论文旨在解决将一维旋转位置嵌入(RoPE)扩展到复杂时空结构的视频中的挑战。论文的关键在于引入了VideoRoPE,这是一种具有三维结构的设计,能够保持时空关系。VideoRoPE通过低频时间分配减少周期性振荡、采用对角布局保持空间对称性,并通过可调时间间隔解耦时间和空间索引。这些特性使得VideoRoPE在长视频检索、视频理解和视频幻觉等多样化下游任务中超越了之前的RoPE变体。

链接: https://arxiv.org/abs/2502.05173
作者: Xilin Wei,Xiaoran Liu,Yuhang Zang,Xiaoyi Dong,Pan Zhang,Yuhang Cao,Jian Tong,Haodong Duan,Qipeng Guo,Jiaqi Wang,Xipeng Qiu,Dahua Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbfVideoRoPE, with a \textit3D structure designed to preserve spatio-temporal relationships. VideoRoPE features \textitlow-frequency temporal allocation to mitigate periodic oscillations, a \textitdiagonal layout to maintain spatial symmetry, and \textitadjustable temporal spacing to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \hrefthis https URLthis https URL.
zh

[CV-6] Flopping for FLOPs: Leverag ing equivariance for computational efficiency

【速读】:该论文旨在解决在神经网络中引入几何不变性以提升参数效率但通常会增加计算成本的问题。关键解决方案在于通过参数化镜像对称和镜像反对称特征(即翻转群的不可约表示),将线性层分解为块对角结构,从而将每次参数的浮点运算次数(FLOPs)减半,同时保持与标准非等变网络相当的计算复杂度。这种方法不仅减少了FLOPs,还缩短了实际运行时间,为高效且可扩展的对称感知架构提供了实用方案。

链接: https://arxiv.org/abs/2502.05169
作者: Georg Bökman,David Nordström,Fredrik Kahl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Incorporating geometric invariance into neural networks enhances parameter efficiency but typically increases computational costs. This paper introduces new equivariant neural networks that preserve symmetry while maintaining a comparable number of floating-point operations (FLOPs) per parameter to standard non-equivariant networks. We focus on horizontal mirroring (flopping) invariance, common in many computer vision tasks. The main idea is to parametrize the feature spaces in terms of mirror-symmetric and mirror-antisymmetric features, i.e., irreps of the flopping group. This decomposes the linear layers to be block-diagonal, requiring half the number of FLOPs. Our approach reduces both FLOPs and wall-clock time, providing a practical solution for efficient, scalable symmetry-aware architectures.
zh

[CV-7] Multitwine: Multi-Object Compositing with Text and Layout Control

【速读】:该论文旨在解决同时进行多对象合成的问题,并且这一过程需由文本和布局引导。关键在于开发了一种能够处理从简单位置关系到复杂动作所需的重新构图任务的生成模型。此外,该模型能够在需要时自动生成附加道具,以支持特定的交互行为。通过联合训练合成和基于主体的生成(即定制),实现了文本和视觉输入的更平衡整合。这一解决方案使得模型在文本驱动的对象合成任务中表现出色。

链接: https://arxiv.org/abs/2502.05165
作者: Gemma Canet Tarrés,Zhe Lin,Zhifei Zhang,He Zhang,Andrew Gilbert,John Collomosse,Soo Ye Kim
机构: University of Surrey(萨里大学); Adobe Research(Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like `taking a selfie’, our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize multimodal, aligned training data.
zh

[CV-8] Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment ICLR2025

【速读】:该论文旨在解决扩散模型在场景感知任务(如视觉问答 VQA 和人机交互 HOI 推理)中生成图像时难以保持场景属性一致性的问题。为了解决这一问题,论文引入了Hummingbird,这是一种基于扩散的图像生成器,它能够在给定多模态上下文的情况下生成高度多样化的图像,并通过准确保留场景属性(如对象交互和空间关系)来确保高保真度。Hummingbird的关键创新在于其采用了新型的多模态上下文评估器,该评估器同时优化全局语义一致性和细粒度一致性奖励,从而确保生成的图像既能保持参考图像的场景属性,又能维持多样性。

链接: https://arxiv.org/abs/2502.05153
作者: Minh-Quan Le,Gaurav Mittal,Tianjian Meng,A S M Iftekhar,Vishwas Suryanarayanan,Barun Patra,Dimitris Samaras,Mei Chen
机构: Microsoft; Stony Brook University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025. Project page: this https URL

点击查看摘要

Abstract:While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e. a reference image with accompanying text guidance query. To address this, we introduce Hummingbird, the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images preserve the scene attributes of reference images in relation to the text guidance while maintaining diversity. As the first model to address the task of maintaining both diversity and fidelity given a multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI datasets. Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird’s potential as a robust multimodal context-aligned image generator in complex visual tasks.
zh

[CV-9] LP-DETR: Layer-wise Progressive Relations for Object Detection

【速读】:该论文旨在通过多尺度关系建模提升基于DETR的目标检测性能。关键解决方案在于引入了一种分层渐进的DETR(LP-DETR),通过一种与关系感知自注意力机制相关的学习方法,使模型能够自适应地平衡不同尺度的关系(局部、中等和全局)在解码器层之间的关系。这种方法使得模型能够在检测流程中有效地捕捉不断演化的空间依赖性。

链接: https://arxiv.org/abs/2502.05147
作者: Zhengjian Kang,Ye Zhang,Xiaoyu Deng,Xintao Li,Yongzhe Zhang
机构: New York University (纽约大学); University of Pittsburgh (匹兹堡大学); Fordham University (福德汉姆大学); Georgia Institute of Technology (乔治亚理工学院); California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:This paper presents LP-DETR (Layer-wise Progressive DETR), a novel approach that enhances DETR-based object detection through multi-scale relation modeling. Our method introduces learnable spatial relationships between object queries through a relation-aware self-attention mechanism, which adaptively learns to balance different scales of relations (local, medium and global) across decoder layers. This progressive design enables the model to effectively capture evolving spatial dependencies throughout the detection pipeline. Extensive experiments on COCO 2017 dataset demonstrate that our method improves both convergence speed and detection accuracy compared to standard self-attention module. The proposed method achieves competitive results, reaching 52.3% AP with 12 epochs and 52.5% AP with 24 epochs using ResNet-50 backbone, and further improving to 58.0% AP with Swin-L backbone. Furthermore, our analysis reveals an interesting pattern: the model naturally learns to prioritize local spatial relations in early decoder layers while gradually shifting attention to broader contexts in deeper layers, providing valuable insights for future research in object detection.
zh

[CV-10] Latent Swap Joint Diffusion for Long-Form Audio Generation

【速读】:该论文旨在解决全景音频生成中多视角联合扩散方法存在的严重重叠失真和高跨视角一致性成本的问题。关键在于提出了一种帧级潜空间交换框架——Swap Forward (SaFa),通过在相邻视图之间应用双向自循环潜空间交换,并在参考视图与各子视图非重叠区域间应用单向参考引导潜空间交换,以适应性增强高频成分,同时保持低频成分稳定,从而实现全局一致且细节丰富的长音频生成。

链接: https://arxiv.org/abs/2502.05130
作者: Yusheng Dai,Chenxi Wang,Chang Li,Chen Wang,Jun Du,Kewei Li,Ruoyu Wang,Jiefeng Ma,Lei Sun,Jianqing Gao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Previous work on long-form audio generation using global-view diffusion or iterative generation demands significant training or inference costs. While recent advancements in multi-view joint diffusion for panoramic generation provide an efficient option, they struggle with spectrum generation with severe overlap distortions and high cross-view consistency costs. We initially explore this phenomenon through the connectivity inheritance of latent maps and uncover that averaging operations excessively smooth the high-frequency components of the latent map. To address these issues, we propose Swap Forward (SaFa), a frame-level latent swap framework that synchronizes multiple diffusions to produce a globally coherent long audio with more spectrum details in a forward-only manner. At its core, the bidirectional Self-Loop Latent Swap is applied between adjacent views, leveraging stepwise diffusion trajectory to adaptively enhance high-frequency components without disrupting low-frequency components. Furthermore, to ensure cross-view consistency, the unidirectional Reference-Guided Latent Swap is applied between the reference and the non-overlap regions of each subview during the early stages, providing centralized trajectory guidance. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based long audio generation models. Moreover, we find that it also adapts well to panoramic generation, achieving comparable state-of-the-art performance with greater efficiency and model generalizability. Project page is available at this https URL.
zh

[CV-11] Counting Fish with Temporal Representations of Sonar Video ECCV2024

【速读】:该论文旨在解决鲑鱼逃逸估计(Salmon Escapement Estimation)的准确性问题,特别是在资源有限的野外部署环境中。解决方案的关键在于提出了一种基于分析回声图(Echograms)的轻量级计算机视觉方法,通过使用ResNet-18模型直接从回声图预测上下游计数,并引入特定领域的图像增强和弱监督训练协议以进一步提高精度。

链接: https://arxiv.org/abs/2502.05129
作者: Kai Van Brunt,Justin Kay,Timm Haucke,Pietro Perona,Grant Van Horn,Sara Beery
机构: MIT(麻省理工学院); Caltech(加州理工学院); UMass Amherst(马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2024. 6 pages, 2 figures

点击查看摘要

Abstract:Accurate estimates of salmon escapement - the number of fish migrating upstream to spawn - are key data for conservation and fishery management. Existing methods for salmon counting using high-resolution imaging sonar hardware are non-invasive and compatible with computer vision processing. Prior work in this area has utilized object detection and tracking based methods for automated salmon counting. However, these techniques remain inaccessible to many sonar deployment sites due to limited compute and connectivity in the field. We propose an alternative lightweight computer vision method for fish counting based on analyzing echograms - temporal representations that compress several hundred frames of imaging sonar video into a single image. We predict upstream and downstream counts within 200-frame time windows directly from echograms using a ResNet-18 model, and propose a set of domain-specific image augmentations and a weakly-supervised training protocol to further improve results. We achieve a count error of 23% on representative data from the Kenai River in Alaska, demonstrating the feasibility of our approach.
zh

[CV-12] Self-supervised Conformal Prediction for Uncertainty Quantification in Imaging Problems

【速读】:该论文旨在解决图像恢复问题中的不确定性量化难题,特别是在缺乏可靠地面真值数据或存在分布偏移的情况下。论文的关键解决方案是提出了一种自监督的确认预测方法,该方法利用Stein的无偏风险估计器(SURE)直接从观测到的噪声测量中进行自我校准,从而避免了对地面真值数据的需求。这种方法适用于任何病态的线性成像逆问题,并且在与现代自监督图像恢复技术结合使用时尤为强大,这些技术也可以直接从测量数据中进行训练。

链接: https://arxiv.org/abs/2502.05127
作者: Jasper M. Everink,Bernardin Tamo Amougou,Marcelo Pereyra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Most image restoration problems are ill-conditioned or ill-posed and hence involve significant uncertainty. Quantifying this uncertainty is crucial for reliably interpreting experimental results, particularly when reconstructed images inform critical decisions and science. However, most existing image restoration methods either fail to quantify uncertainty or provide estimates that are highly inaccurate. Conformal prediction has recently emerged as a flexible framework to equip any estimator with uncertainty quantification capabilities that, by construction, have nearly exact marginal coverage. To achieve this, conformal prediction relies on abundant ground truth data for calibration. However, in image restoration problems, reliable ground truth data is often expensive or not possible to acquire. Also, reliance on ground truth data can introduce large biases in situations of distribution shift between calibration and deployment. This paper seeks to develop a more robust approach to conformal prediction for image restoration problems by proposing a self-supervised conformal prediction method that leverages Stein’s Unbiased Risk Estimator (SURE) to self-calibrate itself directly from the observed noisy measurements, bypassing the need for ground truth. The method is suitable for any linear imaging inverse problem that is ill-conditioned, and it is especially powerful when used with modern self-supervised image restoration techniques that can also be trained directly from measurement data. The proposed approach is demonstrated through numerical experiments on image denoising and deblurring, where it delivers results that are remarkably accurate and comparable to those obtained by supervised conformal prediction with ground truth data.
zh

[CV-13] DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

【速读】:该论文旨在解决将视觉-语言模型(Vision-Language Models, VLMs)扩展到三维医学影像处理中的计算挑战。现有方法依赖于计算成本高昂的视觉变换器(Vision Transformers, ViTs)或需要大量参数和浮点运算(FLOPs)的三维卷积(3D Convolutions)。论文的关键解决方案是引入DCFormer,一种高效的三维医学图像编码器,通过将三维卷积分解为沿深度、高度和宽度方向的三个平行一维卷积(1D Convolutions),从而在保持空间信息的同时显著降低了计算成本。

链接: https://arxiv.org/abs/2502.05091
作者: Gorkem Can Ates,Kuang Gong,Wei Shao
机构: University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) align visual and textual representations, enabling high-performance zero-shot classification and image-text retrieval in 2D medical imaging. However, extending VLMs to 3D medical imaging remains computationally challenging. Existing 3D VLMs rely on Vision Transformers (ViTs), which are computationally expensive due to self-attention’s quadratic complexity, or 3D convolutions, which demand excessive parameters and FLOPs as kernel size increases. We introduce DCFormer, an efficient 3D medical image encoder that factorizes 3D convolutions into three parallel 1D convolutions along depth, height, and width. This design preserves spatial information while significantly reducing computational cost. Integrated into a CLIP-based vision-language framework, DCFormer is evaluated on CT-RATE, a dataset of 50,188 paired 3D chest CT volumes and radiology reports, for zero-shot multi-abnormality detection across 18 pathologies. Compared to ViT, ConvNeXt, PoolFormer, and TransUNet, DCFormer achieves superior efficiency and accuracy, with DCFormer-Tiny reaching 62.0% accuracy and a 46.3% F1-score while using significantly fewer parameters. These results highlight DCFormer’s potential for scalable, clinically deployable 3D medical VLMs. Our codes will be publicly available.
zh

[CV-14] Beautiful Images Toxic Words: Understanding and Addressing Offensive Text in Generated Images

【速读】:该论文旨在解决生成式视觉模型(如扩散模型DMs和视觉自回归模型VARs)在生成高度逼真图像的同时嵌入不适宜工作(NSFW)文本的问题。论文的关键解决方案在于通过安全微调主要扩散模型架构中的文本编码器,并引入ToxicBench基准,以评估和缓解图像生成模型中的NSFW文本生成风险,同时保持整体图像和文本生成质量。

链接: https://arxiv.org/abs/2502.05066
作者: Aditya Kumar,Tom Blanchard,Adam Dziedzic,Franziska Boenisch
机构: CISPA Helmholtz Center for Information Security; Vector Institute; University of Toronto
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art visual generation models, such as Diffusion Models (DMs) and Vision Auto-Regressive Models (VARs), produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, Flux, DeepFloyd IF) and VARs (e.g., Infinity) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we explore safety fine-tuning of the text encoder underlying major DM architectures using a customized dataset. Thereby, we suppress NSFW generation while preserving overall image and text generation quality. Finally, to advance research in this area, we introduce ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. ToxicBench provides a curated dataset of harmful prompts, new metrics, and an evaluation pipeline assessing both NSFW-ness and generation quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models and is available at this https URL
zh

[CV-15] Differentiable Mobile Display Photometric Stereo

【速读】:该论文旨在解决现有不同iable display photometric stereo (DDPS)方法在实用性和便携性方面的局限性,即DDPS需要固定的桌面成像设置,包括偏振相机和台式显示器。为了解决这一问题,论文提出了一种更实用的基于物理的photometric stereo方法——differentiable mobile display photometric stereo (DMDPS),利用配备显示屏和摄像头的移动电话。关键在于开发了一款移动应用程序及相应方法,能够同时显示图案并捕获高质量的高动态范围(HDR)图像。通过这种方法,论文展示了DMDPS在实际3D打印物体和落叶数据集上的有效性,从而推进了基于物理的practical photometric stereo的实际应用。

链接: https://arxiv.org/abs/2502.05055
作者: Gawoon Ban,Hyeongjun Kim,Seokjun Choi,Seungwoo Yoon,Seung-Hwan Baek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 9 pages

点击查看摘要

Abstract:Display photometric stereo uses a display as a programmable light source to illuminate a scene with diverse illumination conditions. Recently, differentiable display photometric stereo (DDPS) demonstrated improved normal reconstruction accuracy by using learned display patterns. However, DDPS faced limitations in practicality, requiring a fixed desktop imaging setup using a polarization camera and a desktop-scale monitor. In this paper, we propose a more practical physics-based photometric stereo, differentiable mobile display photometric stereo (DMDPS), that leverages a mobile phone consisting of a display and a camera. We overcome the limitations of using a mobile device by developing a mobile app and method that simultaneously displays patterns and captures high-quality HDR images. Using this technique, we capture real-world 3D-printed objects and learn display patterns via a differentiable learning process. We demonstrate the effectiveness of DMDPS on both a 3D printed dataset and a first dataset of fallen leaves. The leaf dataset contains reconstructed surface normals and albedos of fallen leaves that may enable future research beyond computer graphics and vision. We believe that DMDPS takes a step forward for practical physics-based photometric stereo.
zh

[CV-16] GaussRender: Learning 3D Occupancy with Gaussian Rendering

【速读】:该论文旨在解决三维占用模型在训练过程中忽略体素预测之间空间关系的问题。论文的关键解决方案是提出GaussRender,这是一种插拔式的3D到2D重投影损失方法,通过将三维体素表示投影到任意二维视角,并利用高斯点撒法作为体素的不同iable渲染代理,引入投影元素之间的空间依赖性,从而增强基于体素的监督,提高语义和几何一致性,并更有效地处理遮挡问题,且无需对现有架构进行修改。

链接: https://arxiv.org/abs/2502.05040
作者: Loick Chambon,Eloi Zablocki,Alexandre Boulch,Mickael Chen,Matthieu Cord
机构: ValeoAI; Sorbonne University (索邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the 3D geometry and semantics of driving scenes is critical for developing of safe autonomous vehicles. While 3D occupancy models are typically trained using voxel-based supervision with standard losses (e.g., cross-entropy, Lovasz, dice), these approaches treat voxel predictions independently, neglecting their spatial relationships. In this paper, we propose GaussRender, a plug-and-play 3D-to-2D reprojection loss that enhances voxel-based supervision. Our method projects 3D voxel representations into arbitrary 2D perspectives and leverages Gaussian splatting as an efficient, differentiable rendering proxy of voxels, introducing spatial dependencies across projected elements. This approach improves semantic and geometric consistency, handles occlusions more efficiently, and requires no architectural modifications. Extensive experiments on multiple benchmarks (SurroundOcc-nuScenes, Occ3D-nuScenes, SSCBench-KITTI360) demonstrate consistent performance gains across various 3D occupancy models (TPVFormer, SurroundOcc, Symphonies), highlighting the robustness and versatility of our framework. The code is available at this https URL.
zh

[CV-17] FlightForge: Advancing UAV Research with Procedural Generation of High-Fidelity Simulation and Integrated Autonomy ICRA2025

【速读】:该论文旨在解决现有无人机(Uncrewed Aerial Vehicles, UAV)仿真器在复杂任务如未知环境自主导航中的高阶自主性不足的问题。关键创新在于引入了新型的程序化环境生成技术和无缝集成高阶自主性的方法,从而实现了逼真的传感器渲染能力和在几乎无限环境中的自主导航能力。

链接: https://arxiv.org/abs/2502.05038
作者: David Čapek,Jan Hrnčíř,Tomáš Báča,Jakub Jirkal,Vojtěch Vonásek,Robert Pěnička,Martin Saska
机构: Multi-robot Systems Group, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic (捷克布拉格捷克技术大学电气工程学院多机器人系统小组)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 8 figures, Accepted to 2025 IEEE International Conference on Robotics Automation (ICRA 2025)

点击查看摘要

Abstract:Robotic simulators play a crucial role in the development and testing of autonomous systems, particularly in the realm of Uncrewed Aerial Vehicles (UAV). However, existing simulators often lack high-level autonomy, hindering their immediate applicability to complex tasks such as autonomous navigation in unknown environments. This limitation stems from the challenge of integrating realistic physics, photorealistic rendering, and diverse sensor modalities into a single simulation environment. At the same time, the existing photorealistic UAV simulators use mostly hand-crafted environments with limited environment sizes, which prevents the testing of long-range missions. This restricts the usage of existing simulators to only low-level tasks such as control and collision avoidance. To this end, we propose the novel FlightForge UAV open-source simulator. FlightForge offers advanced rendering capabilities, diverse control modalities, and, foremost, procedural generation of environments. Moreover, the simulator is already integrated with a fully autonomous UAV system capable of long-range flights in cluttered unknown environments. The key innovation lies in novel procedural environment generation and seamless integration of high-level autonomy into the simulation environment. Experimental results demonstrate superior sensor rendering capability compared to existing simulators, and also the ability of autonomous navigation in almost infinite environments.
zh

[CV-18] MindAligner: Explicit Brain Functional Alignment for Cross-Subject Visual Decoding from Limited fMRI Data

【速读】:该论文旨在解决脑解码在跨个体应用中的挑战,特别是由于大脑变异性和有限的fMRI数据导致的单个受试者范式限制,这使得模型泛化能力弱且训练成本高。论文的关键解决方案是提出MindAligner框架,通过学习一个Brain Transfer Matrix (BTM)来实现任意新受试者大脑信号到已知受试者的投影,并引入Brain Functional Alignment模块,利用多层次的大脑对齐损失进行可靠的BTM学习,从而揭示具有高可解释性的精细功能对应关系。

链接: https://arxiv.org/abs/2502.05034
作者: Yuqin Dai,Zhouheng Yao,Chunfeng Song,Qihao Zheng,Weijian Mai,Kunyu Peng,Shuai Lu,Wanli Ouyang,Jian Yang,Jiamin Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain decoding aims to reconstruct visual perception of human subject from fMRI signals, which is crucial for understanding brain’s perception mechanisms. Existing methods are confined to the single-subject paradigm due to substantial brain variability, which leads to weak generalization across individuals and incurs high training costs, exacerbated by limited availability of fMRI data. To address these challenges, we propose MindAligner, an explicit functional alignment framework for cross-subject brain decoding from limited fMRI data. The proposed MindAligner enjoys several merits. First, we learn a Brain Transfer Matrix (BTM) that projects the brain signals of an arbitrary new subject to one of the known subjects, enabling seamless use of pre-trained decoding models. Second, to facilitate reliable BTM learning, a Brain Functional Alignment module is proposed to perform soft cross-subject brain alignment under different visual stimuli with a multi-level brain alignment loss, uncovering fine-grained functional correspondences with high interpretability. Experiments indicate that MindAligner not only outperforms existing methods in visual decoding under data-limited conditions, but also provides valuable neuroscience insights in cross-subject functional analysis. The code will be made publicly available.
zh

[CV-19] rust-Aware Diversion for Data-Effective Distillation

【速读】:该论文旨在解决现有数据集蒸馏方法在存在标签错误时性能下降的问题。关键在于提出了一种名为Trust-Aware Diversion (TAD)的方法,通过引入迭代双层优化框架,将数据分为可信和不可信空间,并通过外层循环优先利用可信样本以确保蒸馏过程中的信任度,同时内层循环重新校准不可信样本以最大化蒸馏目标。这种双层迭代机制逐步扩展可信空间并缩小不可信空间,从而显著提升了在标签错误情况下的数据集蒸馏效果。

链接: https://arxiv.org/abs/2502.05027
作者: Zhuojie Wu,Yanbin Liu,Xin Shen,Xiaofeng Cao,Xin Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation compresses a large dataset into a small synthetic subset that retains essential information. Existing methods assume that all samples are perfectly labeled, limiting their real-world applications where incorrect labels are ubiquitous. These mislabeled samples introduce untrustworthy information into the dataset, which misleads model optimization in dataset distillation. To tackle this issue, we propose a Trust-Aware Diversion (TAD) dataset distillation method. Our proposed TAD introduces an iterative dual-loop optimization framework for data-effective distillation. Specifically, the outer loop divides data into trusted and untrusted spaces, redirecting distillation toward trusted samples to guarantee trust in the distillation process. This step minimizes the impact of mislabeled samples on dataset distillation. The inner loop maximizes the distillation objective by recalibrating untrusted samples, thus transforming them into valuable ones for distillation. This dual-loop iteratively refines and compensates for each other, gradually expanding the trusted space and shrinking the untrusted space. Experiments demonstrate that our method can significantly improve the performance of existing dataset distillation methods on three widely used benchmarks (CIFAR10, CIFAR100, and Tiny ImageNet) in three challenging mislabeled settings (symmetric, asymmetric, and real-world).
zh

[CV-20] OccGS: Zero-shot 3D Occupancy Reconstruction with Semantic and Geometric-Aware Gaussian Splatting

【速读】:该论文旨在解决从原始传感器数据中无需人工标注即可获得语义三维占据信息的问题。解决方案的关键在于提出了一种名为OccGS的新框架,它利用语义和几何感知高斯点云,在零样本情况下进行三维占据重建。OccGS通过融合视觉-语言模型提取的语义信息和LiDAR点引导的几何信息,从多传感器原始数据构建语义和几何感知高斯分布,并采用累积高斯到三维体素投射的方法来实现占据重建。这一方法在占用预测任务中表现出色,其性能可与全监督方法相媲美,并在零样本语义三维占据估计任务中达到了当前最优水平。

链接: https://arxiv.org/abs/2502.04981
作者: Xiaoyu Zhou,Jingqi Wang,Yongtao Wang,Yufei Wei,Nan Dong,Ming-Hsuan Yang
机构: Wangxuan Institute of Computer Technology, Peking University(王选计算机技术研究所,北京大学); Chongqing Changan Automobile Co., Ltd(重庆长安汽车股份有限公司); University of California, Merced(加州大学梅塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Obtaining semantic 3D occupancy from raw sensor data without manual annotations remains an essential yet challenging task. While prior works have approached this as a perception prediction problem, we formulate it as scene-aware 3D occupancy reconstruction with geometry and semantics. In this work, we propose OccGS, a novel 3D Occupancy reconstruction framework utilizing Semantic and Geometric-Aware Gaussian Splatting in a zero-shot manner. Leveraging semantics extracted from vision-language models and geometry guided by LiDAR points, OccGS constructs Semantic and Geometric-Aware Gaussians from raw multisensor data. We also develop a cumulative Gaussian-to-3D voxel splatting method for reconstructing occupancy from the Gaussians. OccGS performs favorably against self-supervised methods in occupancy prediction, achieving comparable performance to fully supervised approaches and achieving state-of-the-art performance on zero-shot semantic 3D occupancy estimation.
zh

[CV-21] raining-free Neural Architecture Search through Variance of Knowledge of Deep Network Weights

【速读】:该论文旨在解决神经架构搜索(Neural Architecture Search, NAS)中存在的高计算成本问题。NAS虽能系统性地寻找最优网络架构,但通常需要从头训练每个候选架构,导致其计算资源需求巨大。本文提出的关键解决方案是一种基于Fisher信息量的新型无训练代理指标,用于估计给定深度网络的预期图像分类精度,从而无需实际训练网络即可显著降低标准NAS算法的计算成本。这一方法在三个公开数据集和两种搜索空间中均取得了当前最佳结果,并且引入了一种新的评估指标,证明对于实际NAS应用更具信息量。

链接: https://arxiv.org/abs/2502.04975
作者: Ondřej Týbl,Lukáš Neumann
机构: FEE, Czech Technical University (捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has revolutionized computer vision, but it achieved its tremendous success using deep network architectures which are mostly hand-crafted and therefore likely suboptimal. Neural Architecture Search (NAS) aims to bridge this gap by following a well-defined optimization paradigm which systematically looks for the best architecture, given objective criterion such as maximal classification accuracy. The main limitation of NAS is however its astronomical computational cost, as it typically requires training each candidate network architecture from scratch. In this paper, we aim to alleviate this limitation by proposing a novel training-free proxy for image classification accuracy based on Fisher Information. The proposed proxy has a strong theoretical background in statistics and it allows estimating expected image classification accuracy of a given deep network without training the network, thus significantly reducing computational cost of standard NAS algorithms. Our training-free proxy achieves state-of-the-art results on three public datasets and in two search spaces, both when evaluated using previously proposed metrics, as well as using a new metric that we propose which we demonstrate is more informative for practical NAS applications. The source code is publicly available at this http URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.04975 [cs.CV] (or arXiv:2502.04975v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.04975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-22] SurGen: 1020 HE-stained Whole Slide Images With Survival and Genetic Markers

【速读】:该论文旨在解决在结直肠癌研究中,高质量病理图像与全面临床及遗传信息结合的需求。解决方案的关键在于SurGen数据集的构建,该数据集包含1,020张来自843例结直肠癌病例的HE染色全切片图像(WSI),并详细标注了关键基因突变(KRAS, NRAS, BRAF)和错配修复状态,以及426例病例的生存数据。通过证明性机器学习实验预测错配修复状态,达到了测试AUROC为0.8316的结果,展示了SurGen数据集在生物标志物发现、预后建模及高级机器学习应用方面的潜力。

链接: https://arxiv.org/abs/2502.04946
作者: Craig Myles,In Hwa Um,Craig Marshall,David Harris-Birtill,David J. Harrison
机构: School of Computer Science, University of St Andrews (计算机学院, 圣安德鲁斯大学); School of Medicine, University of St Andrews (医学院, 圣安德鲁斯大学); Lothian Biorepository, NHS Lothian (洛锡安生物库, NHS 洛锡安)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To download the dataset, see this https URL . See this https URL for GitHub repository and additional info

点击查看摘要

Abstract: \textbfBackground : Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine. \textbfResults : We present SurGen, a dataset comprising 1,020 HE-stained whole slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. To demonstrate SurGen’s practical utility, we conducted a proof-of-concept machine learning experiment predicting mismatch repair status from the WSIs, achieving a test AUROC of 0.8316. These preliminary results underscore the dataset’s potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer. \textbfConclusions : SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset’s capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online at this https URL.
zh

[CV-23] Cached Multi-Lora Composition for Multi-Concept Image Generation ICLR2025

【速读】:该论文旨在解决在多概念图像生成过程中,低秩适应(Low-Rank Adaptation, LoRA)的组合导致生成图像质量下降的问题。论文的关键在于通过频域分析发现不同LoRA在增强高频和低频特征上的差异,并提出了一种基于频域的序列策略来优化LoRA的集成顺序。此外,论文还引入了一个无需训练的新型框架Cached Multi-LoRA (CMLoRA),该框架通过灵活的融合机制和非均匀缓存策略,有效减少了语义冲突并提升了计算效率。

链接: https://arxiv.org/abs/2502.04923
作者: Xiandong Zou,Mingzhu Shen,Christos-Savvas Bouganis,Yiren Zhao
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The Thirteenth International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has emerged as a widely adopted technique in text-to-image models, enabling precise rendering of multiple distinct elements, such as characters and styles, in multi-concept image generation. However, current approaches face significant challenges when composing these LoRAs for multi-concept image generation, resulting in diminished generated image quality. In this paper, we initially investigate the role of LoRAs in the denoising process through the lens of the Fourier frequency domain. Based on the hypothesis that applying multiple LoRAs could lead to “semantic conflicts”, we find that certain LoRAs amplify high-frequency features such as edges and textures, whereas others mainly focus on low-frequency elements, including the overall structure and smooth color gradients. Building on these insights, we devise a frequency domain based sequencing strategy to determine the optimal order in which LoRAs should be integrated during inference. This strategy offers a methodical and generalizable solution compared to the naive integration commonly found in existing LoRA fusion techniques. To fully leverage our proposed LoRA order sequence determination method in multi-LoRA composition tasks, we introduce a novel, training-free framework, Cached Multi-LoRA (CMLoRA), designed to efficiently integrate multiple LoRAs while maintaining cohesive image generation. With its flexible backbone for multi-LoRA fusion and a non-uniform caching strategy tailored to individual LoRAs, CMLoRA has the potential to reduce semantic conflicts in LoRA composition and improve computational efficiency. Our experimental evaluations demonstrate that CMLoRA outperforms state-of-the-art training-free LoRA fusion methods by a significant margin – it achieves an average improvement of 2.19% in CLIPScore, and 11.25% in MLLM win rate compared to LoraHub, LoRA Composite, and LoRA Switch.
zh

[CV-24] Goku: Flow Based Video Generative Foundation Models

【速读】:该论文旨在开发一种高性能的联合图像与视频生成模型,以实现行业领先的表现。解决方案的关键在于Goku模型的基础要素,包括数据整理流程、模型架构设计、流形公式化以及高效的大型规模训练基础设施。这些要素共同作用使得Goku模型在文本到图像生成(GenEval: 0.76,DPG-Bench: 83.65)和文本到视频任务(VBench: 84.85)中均表现出色,从而确立了新的基准。

链接: https://arxiv.org/abs/2502.04896
作者: Shoufa Chen,Chongjian Ge,Yuqi Zhang,Yida Zhang,Fengda Zhu,Hao Yang,Hongxiang Hao,Hui Wu,Zhichao Lai,Yifei Hu,Ting-Che Lin,Shilong Zhang,Fu Li,Chuan Li,Xing Wang,Yanghua Peng,Peize Sun,Ping Luo,Yi Jiang,Zehuan Yuan,Bingyue Peng,Xiaobing Liu
机构: The University of Hong Kong; Bytedance Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: page: this https URL

点击查看摘要

Abstract:This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.
zh

[CV-25] IPSeg: Image Posterior Mitigates Semantic Drift in Class-Incremental Segmentation

【速读】:该论文旨在解决在类别增量语义分割(Class Incremental Semantic Segmentation, CISS)过程中出现的两个关键挑战:一是单独优化导致的不同模型部分在不同增量阶段概率尺度不一致的问题;二是由于不当伪标签导致的语义噪声。为了解决这些问题,论文提出了一种名为图像后验与语义解耦分割(Image Posterior and Semantics Decoupling for Segmentation, IPSeg)的新方法。IPSeg的关键机制包括利用图像后验概率来对齐不同阶段的优化,并减轻单独优化的影响,以及采用语义解耦来处理语义噪声并针对不同的语义定制学习策略。

链接: https://arxiv.org/abs/2502.04870
作者: Xiao Yu,Yan Fang,Yao Zhao,Yunchao Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:Class incremental learning aims to enable models to learn from sequential, non-stationary data streams across different tasks without catastrophic forgetting. In class incremental semantic segmentation (CISS), the semantic content of image pixels evolves over incremental phases, known as semantic drift. In this work, we identify two critical challenges in CISS that contribute to semantic drift and degrade performance. First, we highlight the issue of separate optimization, where different parts of the model are optimized in distinct incremental stages, leading to misaligned probability scales. Second, we identify noisy semantics arising from inappropriate pseudo-labeling, which results in sub-optimal results. To address these challenges, we propose a novel and effective approach, Image Posterior and Semantics Decoupling for Segmentation (IPSeg). IPSeg introduces two key mechanisms: (1) leveraging image posterior probabilities to align optimization across stages and mitigate the effects of separate optimization, and (2) employing semantics decoupling to handle noisy semantics and tailor learning strategies for different semantics. Extensive experiments on the Pascal VOC 2012 and ADE20K datasets demonstrate that IPSeg achieves superior performance compared to state-of-the-art methods, particularly in challenging long-term incremental scenarios.
zh

[CV-26] Relative Age Estimation Using Face Images

【速读】:该论文旨在解决单张面部图像年龄估计的准确性问题。关键在于引入了一种新的深度学习方法,通过微调初始年龄估计来改进结果。具体而言,该方法利用一个具有相似年龄和外观的参考人脸数据库,并采用网络估算输入图像与已知年龄参考图像之间的年龄差异,从而实现对初始估计的精炼。此外,论文提出了一种年龄增强方案,通过迭代方式在训练过程中优化初始年龄估计的误差分布,进一步提升了估计精度。这些技术使得该方法在MORPH II和CACD数据集上达到了最先进的准确度。

链接: https://arxiv.org/abs/2502.04852
作者: Ran Sandhaus,Yosi Keller
机构: Faculty of Engineering, Bar Ilan University (巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work introduces a novel deep-learning approach for estimating age from a single facial image by refining an initial age estimate. The refinement leverages a reference face database of individuals with similar ages and appearances. We employ a network that estimates age differences between an input image and reference images with known ages, thus refining the initial estimate. Our method explicitly models age-dependent facial variations using differential regression, yielding improved accuracy compared to conventional absolute age estimation. Additionally, we introduce an age augmentation scheme that iteratively refines initial age estimates by modeling their error distribution during training. This iterative approach further enhances the initial estimates. Our approach surpasses existing methods, achieving state-of-the-art accuracy on the MORPH II and CACD datasets. Furthermore, we examine the biases inherent in contemporary state-of-the-art age estimation techniques.
zh

[CV-27] HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation

【速读】:该论文旨在解决人体运动视频生成中细节部分(如手部和面部)渲染不准确的问题,特别是在长序列和复杂动作的情况下。现有方法还存在固定分辨率限制和视觉一致性难以保持的问题。为了解决这些局限性,论文提出了一种基于扩散Transformer (Diffusion Transformer, DiT) 的框架——HumanDiT,并通过大规模高质量视频数据集进行训练。关键解决方案包括:(i) 支持多种视频分辨率和可变序列长度,以促进长序列视频生成的学习;(ii) 引入前缀潜在参考策略,以在扩展序列中保持个性化特征。此外,HumanDiT 在推理阶段利用Keypoint-DiT生成后续姿态序列,并使用姿态适配器实现给定序列的姿态转移。

链接: https://arxiv.org/abs/2502.04847
作者: Qijun Gan,Yi Ren,Chen Zhang,Zhenhui Ye,Pan Xie,Xiang Yin,Zehuan Yuan,Bingyue Peng,Jianke Zhu
机构: ByteDance; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Human motion video generation has advanced significantly, while existing methods still struggle with accurately rendering detailed body parts like hands and faces, especially in long sequences and intricate motions. Current approaches also rely on fixed resolution and struggle to maintain visual consistency. To address these limitations, we propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a large and wild dataset containing 14,000 hours of high-quality video to produce high-fidelity videos with fine-grained body rendering. Specifically, (i) HumanDiT, built on DiT, supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation; (ii) we introduce a prefix-latent reference strategy to maintain personalized characteristics across extended sequences. Furthermore, during inference, HumanDiT leverages Keypoint-DiT to generate subsequent pose sequences, facilitating video continuation from static images or existing videos. It also utilizes a Pose Adapter to enable pose transfer with given sequences. Extensive experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.
zh

[CV-28] PoI: Pixel of Interest for Novel View Synthesis Assisted Scene Coordinate Regression

【速读】:该论文旨在解决通过新型视图合成技术(如NeRF和Gaussian Splatting)增强相机姿态估计任务过程中,渲染图像常出现的模糊和重影等问题,这些问题尤其影响基于场景坐标回归(Scene Coordinate Regression, SCR)方法的性能。解决方案的关键在于引入一种新颖的过滤方法,该方法在训练过程中同时测量SCR模型的实时重投影损失和梯度,从而选择性地提取高质量渲染的像素而丢弃低质量像素。此外,还开发了一种利用稀疏输入改进场景坐标回归的新策略。实验结果验证了该方法的有效性,在室内和室外数据集上均展示了最先进的性能。

链接: https://arxiv.org/abs/2502.04843
作者: Feifei Li,Qi Song,Chi Zhang,Hui Shuai,Rui Huang
机构: Chinese University of Hong Kong(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The task of estimating camera poses can be enhanced through novel view synthesis techniques such as NeRF and Gaussian Splatting to increase the diversity and extension of training data. However, these techniques often produce rendered images with issues like blurring and ghosting, which compromise their reliability. These issues become particularly pronounced for Scene Coordinate Regression (SCR) methods, which estimate 3D coordinates at the pixel level. To mitigate the problems associated with unreliable rendered images, we introduce a novel filtering approach, which selectively extracts well-rendered pixels while discarding the inferior ones. This filter simultaneously measures the SCR model’s real-time reprojection loss and gradient during training. Building on this filtering technique, we also develop a new strategy to improve scene coordinate regression using sparse inputs, drawing on successful applications of sparse input techniques in novel view synthesis. Our experimental results validate the effectiveness of our method, demonstrating state-of-the-art performance on indoor and outdoor datasets.
zh

[CV-29] DetVPCC: RoI-based Point Cloud Sequence Compression for 3D Object Detection

【速读】:该论文旨在解决基于MPEG标准的视频点云压缩(VPCC)在支持3D物体检测时,比特率节省与检测精度之间权衡不佳的问题。VPCC难以优先处理点云内不同重要性的区域。为了解决这一问题,论文提出了一种名为DetVPCC的新方法,通过引入感兴趣区域(RoI)编码与VPCC结合的方式,实现点云序列的高效压缩同时保持3D物体检测精度。关键在于通过分配非均匀的空间质量级别来增强VPCC以支持基于RoI的压缩,并引入轻量级的RoI检测器识别可能包含物体的关键区域。

链接: https://arxiv.org/abs/2502.04804
作者: Mingxuan Yan,Ruijie Zhang,Xuedou Xiao,Wei Wang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While MPEG-standardized video-based point cloud compression (VPCC) achieves high compression efficiency for human perception, it struggles with a poor trade-off between bitrate savings and detection accuracy when supporting 3D object detectors. This limitation stems from VPCC’s inability to prioritize regions of different importance within point clouds. To address this issue, we propose DetVPCC, a novel method integrating region-of-interest (RoI) encoding with VPCC for efficient point cloud sequence compression while preserving the 3D object detection accuracy. Specifically, we augment VPCC to support RoI-based compression by assigning spatially non-uniform quality levels. Then, we introduce a lightweight RoI detector to identify crucial regions that potentially contain objects. Experiments on the nuScenes dataset demonstrate that our approach significantly improves the detection accuracy. The code and demo video are available in supplementary materials.
zh

[CV-30] Autoregressive Generation of Static and Growing Trees

【速读】:该论文旨在解决树结构生成的问题,提出了一种多分辨率处理的变压器架构及训练策略。关键在于其采用了沙漏形状的网络结构,中间层处理的token少于外层,并引入了长距离跳跃连接以补充多分辨率方法。这种设计显著提高了处理速度和降低了内存消耗,从而能够生成更为复杂的树结构。此外,该方法还扩展到了基于图像和点云的条件生成以及模拟树生长过程,实现了4D树的生成。

链接: https://arxiv.org/abs/2502.04762
作者: Hanxiao Wang,Biao Zhang,Jonathan Klein,Dominik L. Michels,Dongming Yan,Peter Wonka
机构: CASIA; KAUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a transformer architecture and training strategy for tree generation. The architecture processes data at multiple resolutions and has an hourglass shape, with middle layers processing fewer tokens than outer layers. Similar to convolutional networks, we introduce longer range skip connections to completent this multi-resolution approach. The key advantage of this architecture is the faster processing speed and lower memory consumption. We are therefore able to process more complex trees than would be possible with a vanilla transformer architecture. Furthermore, we extend this approach to perform image-to-tree and point-cloud-to-tree conditional generation and to simulate the tree growth processes, generating 4D trees. Empirical results validate our approach in terms of speed, memory consumption, and generation quality.
zh

[CV-31] Self-Supervised Learning for Pre-training Capsule Networks: Overcoming Medical Imaging Dataset Challenges

【速读】:该论文旨在解决在小规模、类别不平衡且数据分布存在变化的医疗图像数据集上训练胶囊网络所面临的挑战。关键解决方案在于利用对比学习(Contrastive Learning)和上色(Colorisation)等自监督学习任务对胶囊网络进行预训练,从而引导模型捕捉有助于结肠癌息肉分类的重要视觉特征,最终使模型的准确率较其他权重初始化方法提高了5.26%。

链接: https://arxiv.org/abs/2502.04748
作者: Heba El-Shimy,Hind Zantout,Michael A. Lones,Neamat El Gayar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning techniques are increasingly being adopted in diagnostic medical imaging. However, the limited availability of high-quality, large-scale medical datasets presents a significant challenge, often necessitating the use of transfer learning approaches. This study investigates self-supervised learning methods for pre-training capsule networks in polyp diagnostics for colon cancer. We used the PICCOLO dataset, comprising 3,433 samples, which exemplifies typical challenges in medical datasets: small size, class imbalance, and distribution shifts between data splits. Capsule networks offer inherent interpretability due to their architecture and inter-layer information routing mechanism. However, their limited native implementation in mainstream deep learning frameworks and the lack of pre-trained versions pose a significant challenge. This is particularly true if aiming to train them on small medical datasets, where leveraging pre-trained weights as initial parameters would be beneficial. We explored two auxiliary self-supervised learning tasks, colourisation and contrastive learning, for capsule network pre-training. We compared self-supervised pre-trained models against alternative initialisation strategies. Our findings suggest that contrastive learning and in-painting techniques are suitable auxiliary tasks for self-supervised learning in the medical domain. These techniques helped guide the model to capture important visual features that are beneficial for the downstream task of polyp classification, increasing its accuracy by 5.26% compared to other weight initialisation methods.
zh

[CV-32] SelaFD:Seamless Adaptation of Vision Transformer Fine-tuning for Radar-based Human Activity

【速读】:该论文旨在解决基于雷达的时间多普勒信号在人体活动识别(Human Activity Recognition, HAR)中的挑战,特别是针对跌倒检测。传统图像数据集的处理方法不适用于这些非视觉信号,并且不同活动之间的相似度较高,导致直接微调Vision Transformer (ViT) 模型效果不佳。为了解决这一问题,论文提出了一种新颖的方法,即在权重空间中采用低秩适应(Low-Rank Adaptation, LoRA)进行微调,并通过在特征空间中集成串并联适配器来增强特征表示。这种方法显著提高了HAR的准确性,超越了现有最先进的技术。

链接: https://arxiv.org/abs/2502.04740
作者: Yijun Wang,Yong Wang,Chendong xu,Shuai Yao,Qisong Wu
机构: Purple Mountain Laboratories(紫金山实验室); Key Laboratory of Underwater Acoustic Signal Processing of Ministry of Education, Southeast University(教育部水声信号处理重点实验室, 东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) such as fall detection has become increasingly critical due to the aging population, necessitating effective monitoring systems to prevent serious injuries and fatalities associated with falls. This study focuses on fine-tuning the Vision Transformer (ViT) model specifically for HAR using radar-based Time-Doppler signatures. Unlike traditional image datasets, these signals present unique challenges due to their non-visual nature and the high degree of similarity among various activities. Directly fine-tuning the ViT with all parameters proves suboptimal for this application. To address this challenge, we propose a novel approach that employs Low-Rank Adaptation (LoRA) fine-tuning in the weight space to facilitate knowledge transfer from pre-trained ViT models. Additionally, to extract fine-grained features, we enhance feature representation through the integration of a serial-parallel adapter in the feature space. Our innovative joint fine-tuning method, tailored for radar-based Time-Doppler signatures, significantly improves HAR accuracy, surpassing existing state-of-the-art methodologies in this domain. Our code is released at this https URL.
zh

[CV-33] SC-OmniGS: Self-Calibrating Omnidirectional Gaussian Splatting WWW ICLR2025

【速读】:该论文旨在解决360度图像在辐射场三维重建中的特定挑战。传统方法在处理360度图像时存在不足,论文提出了一种新颖的自校准全方位高斯点阵系统SC-OmniGS,以实现快速且精确的全方位辐射场重建。解决方案的关键在于将360度图像视为完整球体,并推导出一种数学框架,使直接全方位相机姿态校准与三维高斯优化成为可能。此外,引入了可微分的全方位相机模型以校正真实数据的失真,从而提高性能。总体而言,通过最小化加权球面光度损失函数,全方位相机内参、外参姿态及三维高斯点阵被联合优化。

链接: https://arxiv.org/abs/2502.04734
作者: Huajian Huang,Yingshu Chen,Longwei Li,Hui Cheng,Tristan Braud,Yajie Zhao,Sai-Kit Yeung
机构: The Hong Kong University of Science and Technology (香港科技大学); Sun Yat-sen University (中山大学); ICT, University of Southern California (南加州大学ICT学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to ICLR 2025, Project Page: this http URL

点击查看摘要

Abstract:360-degree cameras streamline data collection for radiance field 3D reconstruction by capturing comprehensive scene data. However, traditional radiance field methods do not address the specific challenges inherent to 360-degree images. We present SC-OmniGS, a novel self-calibrating omnidirectional Gaussian splatting system for fast and accurate omnidirectional radiance field reconstruction using 360-degree images. Rather than converting 360-degree images to cube maps and performing perspective image calibration, we treat 360-degree images as a whole sphere and derive a mathematical framework that enables direct omnidirectional camera pose calibration accompanied by 3D Gaussians optimization. Furthermore, we introduce a differentiable omnidirectional camera model in order to rectify the distortion of real-world data for performance enhancement. Overall, the omnidirectional camera intrinsic model, extrinsic poses, and 3D Gaussians are jointly optimized by minimizing weighted spherical photometric loss. Extensive experiments have demonstrated that our proposed SC-OmniGS is able to recover a high-quality radiance field from noisy camera poses or even no pose prior in challenging scenarios characterized by wide baselines and non-object-centric configurations. The noticeable performance gain in the real-world dataset captured by consumer-grade omnidirectional cameras verifies the effectiveness of our general omnidirectional camera model in reducing the distortion of 360-degree images.
zh

[CV-34] Can Diffusion Models Learn Hidden Inter-Feature Rules Behind Images?

【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在学习图像特征之间隐含规则(如高度与影长之间的关系)时所表现出的局限性。具体而言,尽管扩散模型在数据生成方面取得了显著成功,但它们在处理精细规则(例如一致的光照-阴影关系和匹配的对象-镜像反射)时存在失败案例。论文的关键解决方案在于引入分类器引导(classifier guidance)技术,在采样过程中辅助扩散模型,以实现对细粒度规则的有限改进。通过理论分析,作者指出扩散模型通过去噪评分匹配(denoising score matching, DSM)训练时存在固有误差,因为DSM目标与规则一致性不兼容。

链接: https://arxiv.org/abs/2502.04725
作者: Yujin Han,Andi Han,Wei Huang,Chaochao Lu,Difan Zou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 18 figures, 3 tables

点击查看摘要

Abstract:Despite the remarkable success of diffusion models (DMs) in data generation, they exhibit specific failure cases with unsatisfactory outputs. We focus on one such limitation: the ability of DMs to learn hidden rules between image features. Specifically, for image data with dependent features ( \mathbfx ) and ( \mathbfy ) (e.g., the height of the sun ( \mathbfx ) and the length of the shadow ( \mathbfy )), we investigate whether DMs can accurately capture the inter-feature rule ( p(\mathbfy|\mathbfx) ). Empirical evaluations on mainstream DMs (e.g., Stable Diffusion 3.5) reveal consistent failures, such as inconsistent lighting-shadow relationships and mismatched object-mirror reflections. Inspired by these findings, we design four synthetic tasks with strongly correlated features to assess DMs’ rule-learning abilities. Extensive experiments show that while DMs can identify coarse-grained rules, they struggle with fine-grained ones. Our theoretical analysis demonstrates that DMs trained via denoising score matching (DSM) exhibit constant errors in learning hidden rules, as the DSM objective is not compatible with rule conformity. To mitigate this, we introduce a common technique - incorporating additional classifier guidance during sampling, which achieves (limited) improvements. Our analysis reveals that the subtle signals of fine-grained rules are challenging for the classifier to capture, providing insights for future exploration.
zh

[CV-35] olerance-Aware Deep Optics

【速读】:该论文旨在解决深度光学系统设计过程中制造和装配公差分析与优化不足的问题,这导致设计与实际制造的光学系统之间存在显著性能差距。论文的关键解决方案是提出了一种端到端的公差感知优化框架,该框架将多种公差类型整合到深度光学设计流程中,并结合物理信息建模与数据驱动训练,以考虑并补偿制造和装配中的结构偏差。

链接: https://arxiv.org/abs/2502.04719
作者: Jun Dai,Liqun Chen,Xinge Yang,Yuyao Hu,Jinwei Gu,Tianfan Xue
机构: Shanghai AI Laboratory (上海人工智能实验室); KAUST (国王科技大学); NVIDIA (英伟达); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 14 pages, 14 figures

点击查看摘要

Abstract:Deep optics has emerged as a promising approach by co-designing optical elements with deep learning algorithms. However, current research typically overlooks the analysis and optimization of manufacturing and assembly tolerances. This oversight creates a significant performance gap between designed and fabricated optical systems. To address this challenge, we present the first end-to-end tolerance-aware optimization framework that incorporates multiple tolerance types into the deep optics design pipeline. Our method combines physics-informed modelling with data-driven training to enhance optical design by accounting for and compensating for structural deviations in manufacturing and assembly. We validate our approach through computational imaging applications, demonstrating results in both simulations and real-world experiments. We further examine how our proposed solution improves the robustness of optical systems and vision algorithms against tolerances through qualitative and quantitative analyses. Code and additional visual results are available at this http URL.
zh

[CV-36] AI-Driven Solutions for Falcon Disease Classification: Concatenated ConvNeXt cum EfficientNet AI Model Approach

【速读】:该论文旨在解决猎鹰在狩猎活动中健康状况监测的问题,特别是通过引入一种创新方法来实现精准的猎鹰疾病分类。解决方案的关键在于采用了一种结合Concatenated ConvNeXt和EfficientNet的AI模型,相较于传统方法和单一架构模型,这种复合模型展示了优越的性能。该方法通过精确区分“正常”、“肝脏”和“曲霉病”三种情况,利用综合数据集进行训练和评估,并采用准确性(accuracy)、精度(precision)、召回率(recall)和F1分数(f1-score)等指标进行严格实验和评估。

链接: https://arxiv.org/abs/2502.04682
作者: Alavikunhu Panthakkan,Zubair Medammal,S M Anzar,Fatma Taher,Hussain Al-Ahmad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:Falconry, an ancient practice of training and hunting with falcons, emphasizes the need for vigilant health monitoring to ensure the well-being of these highly valued birds, especially during hunting activities. This research paper introduces a cutting-edge approach, which leverages the power of Concatenated ConvNeXt and EfficientNet AI models for falcon disease classification. Focused on distinguishing ‘Normal,’ ‘Liver,’ and ‘Aspergillosis’ cases, the study employs a comprehensive dataset for model training and evaluation, utilizing metrics such as accuracy, precision, recall, and f1-score. Through rigorous experimentation and evaluation, we demonstrate the superior performance of the concatenated AI model compared to traditional methods and standalone architectures. This novel approach contributes to accurate falcon disease classification, laying the groundwork for further advancements in avian veterinary AI applications.
zh

[CV-37] Performance Evaluation of Image Enhancement Techniques on Transfer Learning for Touchless Fingerprint Recognition

【速读】:该论文旨在解决接触式指纹识别系统存在的图像退化和用户交互不一致的问题,提出采用无接触式指纹识别技术作为替代方案。解决方案的关键在于使用图像增强技术配合迁移学习方法(transfer learning),以提升预训练深度学习模型在无接触指纹识别中的性能。实验结果显示,应用图像增强的间接方法显著优于未使用增强的直接方法,具体而言,VGG-16模型在使用增强图像时达到了98%的训练准确率和93%的测试准确率,从而证明了图像增强技术的有效性。

链接: https://arxiv.org/abs/2502.04680
作者: S Sreehari,Dilavar P D,S M Anzar,Alavikunhu Panthakkan,Saad Ali Amin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages

点击查看摘要

Abstract:Fingerprint recognition remains one of the most reliable biometric technologies due to its high accuracy and uniqueness. Traditional systems rely on contact-based scanners, which are prone to issues such as image degradation from surface contamination and inconsistent user interaction. To address these limitations, contactless fingerprint recognition has emerged as a promising alternative, providing non-intrusive and hygienic authentication. This study evaluates the impact of image enhancement tech-niques on the performance of pre-trained deep learning models using transfer learning for touchless fingerprint recognition. The IIT-Bombay Touchless and Touch-Based Fingerprint Database, containing data from 200 subjects, was employed to test the per-formance of deep learning architectures such as VGG-16, VGG-19, Inception-V3, and ResNet-50. Experimental results reveal that transfer learning methods with fingerprint image enhance-ment (indirect method) significantly outperform those without enhancement (direct method). Specifically, VGG-16 achieved an accuracy of 98% in training and 93% in testing when using the enhanced images, demonstrating superior performance compared to the direct method. This paper provides a detailed comparison of the effectiveness of image enhancement in improving the accuracy of transfer learning models for touchless fingerprint recognition, offering key insights for developing more efficient biometric systems. Comments: 6 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2502.04680 [cs.CV] (or arXiv:2502.04680v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.04680 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/ICSPIS63676.2024.10812653 Focus to learn more DOI(s) linking to related resources
zh

[CV-38] Mechanistic Understandings of Representation Vulnerabilities and Engineering Robust Vision Transformers

【速读】:该论文旨在解决视觉变换器(Vision Transformers, ViT)在面对输入空间中的微小扰动时所表现出的表示脆弱性问题,即感知上相同的图像可能具有非常不同的表示,而语义上不相关的图像可能具有相同的表示。论文的关键解决方案是提出了一种名为NeuroShield-ViT的新防御机制,通过有策略地使早期层中易受攻击的神经元失效,以防止对抗性效应的传播。这种方法在多种攻击测试中表现出色,尤其是在强迭代攻击下,展示了其零样本泛化能力,并且无需微调即可达到77.8%的对抗性示例识别准确率,超越了传统鲁棒性方法。

链接: https://arxiv.org/abs/2502.04679
作者: Chashi Mahiul Islam,Samuel Jacob Chacko,Mao Nishino,Xiuwen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:While transformer-based models dominate NLP and vision applications, their underlying mechanisms to map the input space to the label space semantically are not well understood. In this paper, we study the sources of known representation vulnerabilities of vision transformers (ViT), where perceptually identical images can have very different representations and semantically unrelated images can have the same representation. Our analysis indicates that imperceptible changes to the input can result in significant representation changes, particularly in later layers, suggesting potential instabilities in the performance of ViTs. Our comprehensive study reveals that adversarial effects, while subtle in early layers, propagate and amplify through the network, becoming most pronounced in middle to late layers. This insight motivates the development of NeuroShield-ViT, a novel defense mechanism that strategically neutralizes vulnerable neurons in earlier layers to prevent the cascade of adversarial effects. We demonstrate NeuroShield-ViT’s effectiveness across various attacks, particularly excelling against strong iterative attacks, and showcase its remarkable zero-shot generalization capabilities. Without fine-tuning, our method achieves a competitive accuracy of 77.8% on adversarial examples, surpassing conventional robustness methods. Our results shed new light on how adversarial effects propagate through ViT layers, while providing a promising approach to enhance the robustness of vision transformers against adversarial attacks. Additionally, they provide a promising approach to enhance the robustness of vision transformers against adversarial attacks.
zh

[CV-39] MHAF-YOLO: Multi-Branch Heterogeneous Auxiliary Fusion YOLO for accurate object detection

【速读】:该论文旨在解决现有基于YOLO的检测器在处理尺度变化较大的场景时,由于路径聚合特征金字塔网络(Path Aggregation FPN, PAFPN)难以有效融合高层语义信息与低层空间细节,导致性能受限的问题。为了解决这一问题,论文提出了一种新的检测框架MHAF-YOLO,其关键在于引入了多分支辅助特征金字塔网络(Multi-Branch Auxiliary FPN, MAFPN),该网络包含表层辅助融合(Superficial Assisted Fusion, SAF)和深度辅助融合(Advanced Assisted Fusion, AAF)两个模块。SAF通过融合浅层特征有效地传递低级空间信息,而AAF则在深层网络层整合多尺度特征信息,从而提升模型的学习能力。此外,论文还提出了全局异构灵活核选择机制(Global Heterogeneous Flexible Kernel Selection, GHFKS)和重参数化异构多尺度模块(Reparameterized Heterogeneous Multi-Scale, RepHMS),进一步增强特征融合效果。

链接: https://arxiv.org/abs/2502.04656
作者: Zhiqiang Yang,Qiu Guan,Zhongwen Yu,Xinli Xu,Haixia Long,Sheng Lian,Haigen Hu,Ying Tang
机构: Zhejiang University of Technology(浙江工业大学); Fuzhou University(福州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2407.04381

点击查看摘要

Abstract:Due to the effective multi-scale feature fusion capabilities of the Path Aggregation FPN (PAFPN), it has become a widely adopted component in YOLO-based detectors. However, PAFPN struggles to integrate high-level semantic cues with low-level spatial details, limiting its performance in real-world applications, especially with significant scale variations. In this paper, we propose MHAF-YOLO, a novel detection framework featuring a versatile neck design called the Multi-Branch Auxiliary FPN (MAFPN), which consists of two key modules: the Superficial Assisted Fusion (SAF) and Advanced Assisted Fusion (AAF). The SAF bridges the backbone and the neck by fusing shallow features, effectively transferring crucial low-level spatial information with high fidelity. Meanwhile, the AAF integrates multi-scale feature information at deeper neck layers, delivering richer gradient information to the output layer and further enhancing the model learning capacity. To complement MAFPN, we introduce the Global Heterogeneous Flexible Kernel Selection (GHFKS) mechanism and the Reparameterized Heterogeneous Multi-Scale (RepHMS) module to enhance feature fusion. RepHMS is globally integrated into the network, utilizing GHFKS to select larger convolutional kernels for various feature layers, expanding the vertical receptive field and capturing contextual information across spatial hierarchies. Locally, it optimizes convolution by processing both large and small kernels within the same layer, broadening the lateral receptive field and preserving crucial details for detecting smaller targets. The source code of this work is available at: this https URL.
zh

[CV-40] Building Rome with Convex Optimization

【速读】:该论文旨在解决全局束调整(Bundle Adjustment, BA)在三维重建中的复杂性问题。关键在于提出了一种基于学习深度的尺度束调整(Scaled Bundle Adjustment, SBA)方法,并设计了一个紧致的凸半定规划(Convex Semidefinite Program, SDP)松弛方案以实现SBA的可验证全局最优解。此外,通过使用Burer-Monteiro因子化和基于CUDA的信任区域黎曼优化器(XM),实现了SDP松弛的大规模求解。最终构建了一个以XM作为优化引擎的运动恢复结构(Structure from Motion, SfM)管道,证明了该方法在重建质量、速度、可扩展性和无需初始化方面的优势。

链接: https://arxiv.org/abs/2502.04640
作者: Haoyu Han,Heng Yang
机构: School of Engineering and Applied Sciences, Harvard University (工程与应用科学学院, 哈佛大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Global bundle adjustment is made easy by depth prediction and convex optimization. We (i) propose a scaled bundle adjustment (SBA) formulation that lifts 2D keypoint measurements to 3D with learned depth, (ii) design an empirically tight convex semidfinite program (SDP) relaxation that solves SBA to certfiable global optimality, (iii) solve the SDP relaxations at extreme scale with Burer-Monteiro factorization and a CUDA-based trust-region Riemannian optimizer (dubbed XM), (iv) build a structure from motion (SfM) pipeline with XM as the optimization engine and show that XM-SfM dominates or compares favorably with existing SfM pipelines in terms of reconstruction quality while being faster, more scalable, and initialization-free.
zh

[CV-41] Learning Street View Representations with Spatiotemporal Contrast

【速读】:该论文旨在解决街景图像在表示学习中的挑战,特别是如何有效编码动态城市环境(如行人、车辆和植被)、建成环境(包括建筑物、道路和城市基础设施)以及环境氛围(如文化和社会经济氛围),以支持下游城市相关任务。论文的关键解决方案在于提出了一种创新的自监督学习框架,该框架利用街景图像的时间和空间属性,通过构建对比学习任务来学习动态城市环境的图像表示,从而提升视觉地点识别、社会经济评估及人与环境感知等任务的表现。

链接: https://arxiv.org/abs/2502.04638
作者: Yong Li,Yingjing Huang,Gengchen Mai,Fan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Street view imagery is extensively utilized in representation learning for urban visual environments, supporting various sustainable development tasks such as environmental perception and socio-economic assessment. However, it is challenging for existing image representations to specifically encode the dynamic urban environment (such as pedestrians, vehicles, and vegetation), the built environment (including buildings, roads, and urban infrastructure), and the environmental ambiance (such as the cultural and socioeconomic atmosphere) depicted in street view imagery to address downstream tasks related to the city. In this work, we propose an innovative self-supervised learning framework that leverages temporal and spatial attributes of street view imagery to learn image representations of the dynamic urban environment for diverse downstream tasks. By employing street view images captured at the same location over time and spatially nearby views at the same time, we construct contrastive learning tasks designed to learn the temporal-invariant characteristics of the built environment and the spatial-invariant neighborhood ambiance. Our approach significantly outperforms traditional supervised and unsupervised methods in tasks such as visual place recognition, socioeconomic estimation, and human-environment perception. Moreover, we demonstrate the varying behaviors of image representations learned through different contrastive learning objectives across various downstream tasks. This study systematically discusses representation learning strategies for urban studies based on street view images, providing a benchmark that enhances the applicability of visual data in urban science. The code is available at this https URL.
zh

[CV-42] High-Speed Dynamic 3D Imaging with Sensor Fusion Splatting

【速读】:该论文旨在解决使用单一成像模态捕捉和重建高速动态3D场景的挑战。解决方案的关键在于提出了一种新颖的传感器融合方法,利用高斯散射(Gaussian Splatting)结合RGB相机、深度相机和事件相机来捕捉和重建形变场景。这种方法通过整合这些成像模态的互补优势:RGB相机获取详细的色彩信息,事件相机记录微秒级分辨率的场景快速变化,深度相机提供三维场景几何结构,从而实现高效高质量的复杂快速场景成像,即使在低光照、基线较窄或快速运动等具有挑战性的条件下也能表现优异。

链接: https://arxiv.org/abs/2502.04630
作者: Zihao Zou,Ziyuan Qu,Xi Peng,Vivek Boominathan,Adithya Pediredla,Praneeth Chakravarthula
机构: University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校); Dartmouth College(达特茅斯学院); Rice University(莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Capturing and reconstructing high-speed dynamic 3D scenes has numerous applications in computer graphics, vision, and interdisciplinary fields such as robotics, aerodynamics, and evolutionary biology. However, achieving this using a single imaging modality remains challenging. For instance, traditional RGB cameras suffer from low frame rates, limited exposure times, and narrow baselines. To address this, we propose a novel sensor fusion approach using Gaussian splatting, which combines RGB, depth, and event cameras to capture and reconstruct deforming scenes at high speeds. The key insight of our method lies in leveraging the complementary strengths of these imaging modalities: RGB cameras capture detailed color information, event cameras record rapid scene changes with microsecond resolution, and depth cameras provide 3D scene geometry. To unify the underlying scene representation across these modalities, we represent the scene using deformable 3D Gaussians. To handle rapid scene movements, we jointly optimize the 3D Gaussian parameters and their temporal deformation fields by integrating data from all three sensor modalities. This fusion enables efficient, high-quality imaging of fast and complex scenes, even under challenging conditions such as low light, narrow baselines, or rapid motion. Experiments on synthetic and real datasets captured with our prototype sensor fusion setup demonstrate that our method significantly outperforms state-of-the-art techniques, achieving noticeable improvements in both rendering fidelity and structural accuracy.
zh

[CV-43] AIQViT: Architecture-Informed Post-Training Quantization for Vision Transformers

【速读】:该论文旨在解决通过后训练量化(Post-training quantization, PTQ)减少视觉变换器(Vision Transformers, ViTs)存储和计算成本时所面临的信息丢失和性能下降问题。关键解决方案包括:首先,设计了一种架构感知的低秩补偿机制,引入可学习的低秩权重以补偿因权重量化导致的性能退化;其次,提出了一种动态聚焦量化器,能够适应Softmax激活后的非均匀分布,并动态选择最有价值的区间以提高量化分辨率。这些创新方法构成了名为AIQViT (Architecture-Informed Post-training Quantization for ViTs) 的新型量化技术。

链接: https://arxiv.org/abs/2502.04628
作者: Runqing Jiang,Ye Zhang,Longguang Wang,Pengpeng Yu,Yulan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) has emerged as a promising solution for reducing the storage and computational cost of vision transformers (ViTs). Recent advances primarily target at crafting quantizers to deal with peculiar activations characterized by ViTs. However, most existing methods underestimate the information loss incurred by weight quantization, resulting in significant performance deterioration, particularly in low-bit cases. Furthermore, a common practice in quantizing post-Softmax activations of ViTs is to employ logarithmic transformations, which unfortunately prioritize less informative values around zero. This approach introduces additional redundancies, ultimately leading to suboptimal quantization efficacy. To handle these, this paper proposes an innovative PTQ method tailored for ViTs, termed AIQViT (Architecture-Informed Post-training Quantization for ViTs). First, we design an architecture-informed low rank compensation mechanism, wherein learnable low-rank weights are introduced to compensate for the degradation caused by weight quantization. Second, we design a dynamic focusing quantizer to accommodate the unbalanced distribution of post-Softmax activations, which dynamically selects the most valuable interval for higher quantization resolution. Extensive experiments on five vision tasks, including image classification, object detection, instance segmentation, point cloud classification, and point cloud part segmentation, demonstrate the superiority of AIQViT over state-of-the-art PTQ methods.
zh

[CV-44] HetSSNet: Spatial-Spectral Heterogeneous Graph Learning Network for Panchromatic and Multispectral Images Fusion

【速读】:该论文旨在解决遥感图像融合过程中在处理不规则地物时主流模型(如CNN和Transformer)的局限性。论文的关键在于提出了一种空间-光谱异构图学习网络HetSSNet,通过构建定制化的异构图结构来显式描述空间-光谱关系,并设计了基本关系模式生成模块和关系模式聚合模块,以从局部和全局视角自适应地学习统一的空间-光谱表示。

链接: https://arxiv.org/abs/2502.04623
作者: Mengting Ma,Yizhen Jiang,Mengjiao Zhao,Jiaxin Li,Wei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing pansharpening aims to reconstruct spatial-spectral properties during the fusion of panchromatic (PAN) images and low-resolution multi-spectral (LR-MS) images, finally generating the high-resolution multi-spectral (HR-MS) images. In the mainstream modeling strategies, i.e., CNN and Transformer, the input images are treated as the equal-sized grid of pixels in the Euclidean space. They have limitations in facing remote sensing images with irregular ground objects. Graph is the more flexible structure, however, there are two major challenges when modeling spatial-spectral properties with graph: \emph1) constructing the customized graph structure for spatial-spectral relationship priors; \emph2) learning the unified spatial-spectral representation through the graph. To address these challenges, we propose the spatial-spectral heterogeneous graph learning network, named \textbfHetSSNet. Specifically, HetSSNet initially constructs the heterogeneous graph structure for pansharpening, which explicitly describes pansharpening-specific relationships. Subsequently, the basic relationship pattern generation module is designed to extract the multiple relationship patterns from the heterogeneous graph. Finally, relationship pattern aggregation module is exploited to collaboratively learn unified spatial-spectral representation across different relationships among nodes with adaptive importance learning from local and global perspectives. Extensive experiments demonstrate the significant superiority and generalization of HetSSNet.
zh

[CV-45] Neural Clustering for Prefractured Mesh Generation in Real-time Object Destruction

【速读】:该论文旨在解决实时物体破坏模拟中由于预裂方法的启发式特性导致的不真实结果问题。关键解决方案在于将预裂网格生成的聚类视为点云数据上的无序分割,并利用基于物理数据集训练的深度神经网络进行预测,从而成功预测物体的结构弱点,提供高质量的即用型结果。

链接: https://arxiv.org/abs/2502.04615
作者: Seunghwan Kim,Sunha Park,Seungkyu Lee
机构: Kyung Hee University(庆熙大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Prefracture method is a practical implementation for real-time object destruction that is hardly achievable within performance constraints, but can produce unrealistic results due to its heuristic nature. To mitigate it, we approach the clustering of prefractured mesh generation as an unordered segmentation on point cloud data, and propose leveraging the deep neural network trained on a physics-based dataset. Our novel paradigm successfully predicts the structural weakness of object that have been limited, exhibiting ready-to-use results with remarkable quality.
zh

[CV-46] Multiscale style transfer based on a Laplacian pyramid for traditional Chinese painting

【速读】:该论文旨在解决现有风格迁移方法在处理传统中国画风格时产生的不自然杂乱艺术效果,并且仅在原始图像尺度上工作导致忽略多尺度图像信息的问题。论文的关键解决方案在于提出了一种基于拉普拉斯金字塔分解与重构的新型有效多尺度风格迁移方法。该方法通过不同尺度学习不同的图像特征,首先在低分辨率下转移整体图案,然后在高分辨率下逐步增强内容和风格细节,从而实现传统中国画独特风格的有效迁移。

链接: https://arxiv.org/abs/2502.04597
作者: Kunxiao Liu,Guowu Yuan,Hongyu Liu,Hao Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 13 figures

点击查看摘要

Abstract:Style transfer is adopted to synthesize appealing stylized images that preserve the structure of a content image but carry the pattern of a style image. Many recently proposed style transfer methods use only western oil paintings as style images to achieve image stylization. As a result, unnatural messy artistic effects are produced in stylized images when using these methods to directly transfer the patterns of traditional Chinese paintings, which are composed of plain colors and abstract objects. Moreover, most of them work only at the original image scale and thus ignore multiscale image information during training. In this paper, we present a novel effective multiscale style transfer method based on Laplacian pyramid decomposition and reconstruction, which can transfer unique patterns of Chinese paintings by learning different image features at different scales. In the first stage, the holistic patterns are transferred at low resolution by adopting a Style Transfer Base Network. Then, the details of the content and style are gradually enhanced at higher resolutions by a Detail Enhancement Network with an edge information selection (EIS) module in the second stage. The effectiveness of our method is demonstrated through the generation of appealing high-quality stylization results and a comparison with some state-of-the-art style transfer methods. Datasets and codes are available at this https URL.
zh

[CV-47] An Optimized YOLOv5 Based Approach For Real-time Vehicle Detection At Road Intersections Using Fisheye Cameras

【速读】:该论文旨在解决城市交通监控中实时车辆检测的挑战,特别是在交叉路口由于高密度交通导致的事故和拥堵问题。解决方案的关键在于提出了一种改进的YOLOv5目标检测方案,专门针对鱼眼相机图像中的车辆检测。该方案包括一个轻量级的昼夜分类器CNN,以应对白天和夜晚的检测问题,并通过数据集增强和模型集成训练来提高小车的准确定位以及整体检测精度和泛化能力。

链接: https://arxiv.org/abs/2502.04566
作者: Md. Jahin Alam,Muhammad Zubair Hasan,Md Maisoon Rahman,Md Awsafur Rahman,Najibul Haque Sarker,Shariar Azad,Tasnim Nishat Islam,Bishmoy Paul,Tanvir Anjum,Barproda Halder,Shaikh Anowarul Fattah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real time vehicle detection is a challenging task for urban traffic surveillance. Increase in urbanization leads to increase in accidents and traffic congestion in junction areas resulting in delayed travel time. In order to solve these problems, an intelligent system utilizing automatic detection and tracking system is significant. But this becomes a challenging task at road intersection areas which require a wide range of field view. For this reason, fish eye cameras are widely used in real time vehicle detection purpose to provide large area coverage and 360 degree view at junctions. However, it introduces challenges such as light glare from vehicles and street lights, shadow, non-linear distortion, scaling issues of vehicles and proper localization of small vehicles. To overcome each of these challenges, a modified YOLOv5 object detection scheme is proposed. YOLOv5 is a deep learning oriented convolutional neural network (CNN) based object detection method. The proposed scheme for detecting vehicles in fish-eye images consists of a light-weight day-night CNN classifier so that two different solutions can be implemented to address the day-night detection issues. Furthurmore, challenging instances are upsampled in the dataset for proper localization of vehicles and later on the detection model is ensembled and trained in different combination of vehicle datasets for better generalization, detection and accuracy. For testing, a real world fisheye dataset provided by the Video and Image Processing (VIP) Cup organizer ISSD has been used which includes images from video clips of different fisheye cameras at junction of different cities during day and night time. Experimental results show that our proposed model has outperformed the YOLOv5 model on the dataset by 13.7% mAP @ 0.5.
zh

[CV-48] he Phantom of the Elytra – Phylogenetic Trait Extraction from Images of Rove Beetles Using Deep Learning – Is the Mask Enough? AAAI2025

【速读】:该论文旨在解决传统系统发育分析中依赖手工提取形态特征导致的低效与不可扩展性问题。研究的关键在于比较不同形态表示方法(全分割、二值掩模、傅里叶描述子)在深度学习模型中的表现,以实现自动化形态特征提取。实验结果表明,基于掩模的模型在Rove-Tree-11数据集上取得了最佳性能,其归一化对齐分数为0.33 ± 0.02,优于傅里叶模型和基于分割的模型。这反映了掩模方法在捕捉形状特征方面的能力及其利用ResNet50架构深度和容量的优势。

链接: https://arxiv.org/abs/2502.04541
作者: Roberta Hunt,Kim Steenstrup Pedersen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Imageomics Workshop at AAAI 2025 (not published in proceedings)

点击查看摘要

Abstract:Phylogenetic analysis traditionally relies on labor-intensive manual extraction of morphological traits, limiting its scalability for large datasets. Recent advances in deep learning offer the potential to automate this process, but the effectiveness of different morphological representations for phylogenetic trait extraction remains poorly understood. In this study, we compare the performance of deep learning models using three distinct morphological representations - full segmentations, binary masks, and Fourier descriptors of beetle outlines. We test this on the Rove-Tree-11 dataset, a curated collection of images from 215 rove beetle species. Our results demonstrate that the mask-based model outperformed the others, achieving a normalized Align Score of 0.33 plus/minus 0.02 on the test set, compared to 0.45 plus/minus 0.01 for the Fourier-based model and 0.39 plus/minus 0.07 for the segmentation-based model. The performance of the mask-based model likely reflects its ability to capture shape features while taking advantage of the depth and capacity of the ResNet50 architecture. These results also indicate that dorsal textural features, at least in this group of beetles, may be of lowered phylogenetic relevance, though further investigation is necessary to confirm this. In contrast, the Fourier-based model suffered from reduced capacity and occasional inaccuracies in outline approximations, particularly in fine structures like legs. These findings highlight the importance of selecting appropriate morphological representations for automated phylogenetic studies and the need for further research into explainability in automatic morphological trait extraction.
zh

[CV-49] AnyPlace: Learning Generalized Object Placement for Robot Manipulation

【速读】:该论文旨在解决机器人任务中物体放置多样性挑战的问题。关键在于利用视觉-语言模型(Vision-Language Model, VLM)识别大致放置位置,从而专注于相关区域进行局部放置,这使得低层次的放置姿态预测模型能够高效捕捉多样的放置情况。论文提出的方法名为AnyPlace,通过两阶段训练完全基于合成数据,验证结果显示其在成功率、可能放置模式覆盖率及精度方面均优于基线方法,并且可以直接将仅基于合成数据训练的模型应用于现实世界,成功完成复杂场景下的物体放置任务。

链接: https://arxiv.org/abs/2502.04531
作者: Yuchi Zhao,Miroslav Bogdanovic,Chengyuan Luo,Steven Tohme,Kourosh Darvish,Alán Aspuru-Guzik,Florian Shkurti,Animesh Garg
机构: University of Toronto(多伦多大学); Vector Institute(向量研究所); Shanghai Jiao Tong University(上海交通大学); Wilfrid Laurier University(劳里尔大学); Acceleration Consortium(加速联盟); Georgia Institute of Technology(乔治亚理工学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object placement in robotic tasks is inherently challenging due to the diversity of object geometries and placement configurations. To address this, we propose AnyPlace, a two-stage method trained entirely on synthetic data, capable of predicting a wide range of feasible placement poses for real-world tasks. Our key insight is that by leveraging a Vision-Language Model (VLM) to identify rough placement locations, we focus only on the relevant regions for local placement, which enables us to train the low-level placement-pose-prediction model to capture diverse placements efficiently. For training, we generate a fully synthetic dataset of randomly generated objects in different placement configurations (insertion, stacking, hanging) and train local placement-prediction models. We conduct extensive evaluations in simulation, demonstrating that our method outperforms baselines in terms of success rate, coverage of possible placement modes, and precision. In real-world experiments, we show how our approach directly transfers models trained purely on synthetic data to the real world, where it successfully performs placements in scenarios where other models struggle – such as with varying object geometries, diverse placement modes, and achieving high precision for fine placement. More at: this https URL.
zh

[CV-50] Agricultural Field Boundary Detection through Integration of “Simple Non-Iterative Clustering (SNIC) Super Pixels” and “Canny Edge Detection Method”

【速读】:该论文旨在解决高效利用耕种区域以实现农业可持续发展和确保粮食安全的问题。解决方案的关键在于结合使用“SNIC (Simple Non-Iterative Clustering) 超像素”算法和“Canny 边缘检测方法”来确定耕种区域的适宜性和绿色指数。SNIC 算法将卫星图像中的像素聚合成具有相似特征的更大区域(超像素),从而提供更好的图像分析;而 Canny 边缘检测方法则通过检测图像中的显著变化(边缘)来确定农田边界的精确位置。这种组合方法有效地提高了识别农田边界的准确性和可靠性。

链接: https://arxiv.org/abs/2502.04529
作者: Artughrul Gayibov(Baku Engineering University)
机构: Baku Engineering University (巴库工程大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:Efficient use of cultivated areas is a necessary factor for sustainable development of agriculture and ensuring food security. Along with the rapid development of satellite technologies in developed countries, new methods are being searched for accurate and operational identification of cultivated areas. In this context, identification of cropland boundaries based on spectral analysis of data obtained from satellite images is considered one of the most optimal and accurate methods in modern agriculture. This article proposes a new approach to determine the suitability and green index of cultivated areas using satellite data obtained through the “Google Earth Engine” (GEE) platform. In this approach, two powerful algorithms, “SNIC (Simple Non-Iterative Clustering) Super Pixels” and “Canny Edge Detection Method”, are combined. The SNIC algorithm combines pixels in a satellite image into larger regions (super pixels) with similar characteristics, thereby providing better image analysis. The Canny Edge Detection Method detects sharp changes (edges) in the image to determine the precise boundaries of agricultural fields. This study, carried out using high-resolution multispectral data from the Sentinel-2 satellite and the Google Earth Engine JavaScript API, has shown that the proposed method is effective in accurately and reliably classifying randomly selected agricultural fields. The combined use of these two tools allows for more accurate determination of the boundaries of agricultural fields by minimizing the effects of outliers in satellite images. As a result, more accurate and reliable maps can be created for agricultural monitoring and resource management over large areas based on the obtained data. By expanding the application capabilities of cloud-based platforms and artificial intelligence methods in the agricultural field.
zh

[CV-51] Fast Video Generation with Sliding Tile Attention

【速读】:该论文旨在解决扩散变换器(Diffusion Transformers, DiTs)在视频生成任务中计算成本过高的问题。具体而言,当生成一段5秒的720P视频时,注意力机制(attention)就占据了总推理时间945秒中的800秒。论文的关键解决方案是引入滑动瓦片注意力(Sliding Tile Attention, STA),它通过利用预训练视频扩散模型中注意力分数主要集中在局部三维窗口内的观察结果,减少了冗余计算。STA采用硬件感知的滑动窗口设计,以瓦片为单位进行操作,与传统的基于令牌的滑动窗口注意力不同,STA在保持表达能力的同时提高了硬件效率。通过精心的内核级优化,STA实现了高效的二维/三维滑动窗口式注意力机制,加速比达到了2.8到17倍相对于FlashAttention-2 (FA2),1.6到10倍相对于FlashAttention-3 (FA3)。

链接: https://arxiv.org/abs/2502.04507
作者: Peiyuan Zhang,Yongqi Chen,Runlong Su,Hangliang Ding,Ion Stoica,Zhenghong Liu,Hao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost – when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.
zh

[CV-52] Measuring Physical Plausibility of 3D Human Poses Using Physics Simulation BMVC2024

【速读】:该论文旨在解决3D人体姿态估计模型在预测姿态时缺乏物理合理性及稳定性评估的问题。关键在于引入基于物理模拟的两种新度量方法,以捕捉预测3D姿态的物理合理性和稳定性,从而克服现有方法仅关注关节位置误差而忽略姿态物理合理性的局限性。

链接: https://arxiv.org/abs/2502.04483
作者: Nathan Louis,Mahzad Khoshlessan,Jason J. Corso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC2024

点击查看摘要

Abstract:Modeling humans in physical scenes is vital for understanding human-environment interactions for applications involving augmented reality or assessment of human actions from video (e.g. sports or physical rehabilitation). State-of-the-art literature begins with a 3D human pose, from monocular or multiple views, and uses this representation to ground the person within a 3D world space. While standard metrics for accuracy capture joint position errors, they do not consider physical plausibility of the 3D pose. This limitation has motivated researchers to propose other metrics evaluating jitter, floor penetration, and unbalanced postures. Yet, these approaches measure independent instances of errors and are not representative of balance or stability during motion. In this work, we propose measuring physical plausibility from within physics simulation. We introduce two metrics to capture the physical plausibility and stability of predicted 3D poses from any 3D Human Pose Estimation model. Using physics simulation, we discover correlations with existing plausibility metrics and measuring stability during motion. We evaluate and compare the performances of two state-of-the-art methods, a multi-view triangulated baseline, and ground truth 3D markers from the Human3.6m dataset.
zh

[CV-53] OneTrack-M: A multitask approach to transformer-based MOT models

【速读】:该论文旨在解决多目标跟踪(MOT)在计算机视觉中的挑战,特别是提高跟踪的计算效率和准确性。论文的关键解决方案是提出OneTrack-M模型,该模型基于Transformer架构,并简化了传统Transformer模型的结构,去除了用于目标检测和跟踪的解码器,仅使用编码器作为时间数据解释的主干,从而显著减少了处理时间和提高了推理速度。此外,通过创新的数据预处理和多任务训练技术来应对遮挡和多样化的挑战,进一步提升了模型性能。实验结果表明,OneTrack-M模型相比现有最先进模型至少快25%的推理时间,同时保持或提高了跟踪精度指标。

链接: https://arxiv.org/abs/2502.04478
作者: Luiz C. S. de Araujo,Carlos M. S. Figueiredo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 11 figures

点击查看摘要

Abstract:Multi-Object Tracking (MOT) is a critical problem in computer vision, essential for understanding how objects move and interact in videos. This field faces significant challenges such as occlusions and complex environmental dynamics, impacting model accuracy and efficiency. While traditional approaches have relied on Convolutional Neural Networks (CNNs), introducing transformers has brought substantial advancements. This work introduces OneTrack-M, a transformer-based MOT model designed to enhance tracking computational efficiency and accuracy. Our approach simplifies the typical transformer-based architecture by eliminating the need for a decoder model for object detection and tracking. Instead, the encoder alone serves as the backbone for temporal data interpretation, significantly reducing processing time and increasing inference speed. Additionally, we employ innovative data pre-processing and multitask training techniques to address occlusion and diverse objective challenges within a single set of weights. Experimental results demonstrate that OneTrack-M achieves at least 25% faster inference times compared to state-of-the-art models in the literature while maintaining or improving tracking accuracy metrics. These improvements highlight the potential of the proposed solution for real-time applications such as autonomous vehicles, surveillance systems, and robotics, where rapid responses are crucial for system effectiveness.
zh

[CV-54] Augmented Conditioning Is Enough For Effective Training Image Generation

【速读】:该论文旨在解决文本到图像扩散模型生成的图像在训练下游图像分类模型时缺乏多样性的问题。关键解决方案在于通过在增强的真实图像和文本提示条件下进行生成过程,以提高生成图像的多样性和真实性,从而更有效地服务于下游任务。这种方法不仅使生成的图像与真实图像分布保持一致,还引入了视觉多样性,显著提升了极端少样本情况下的分类性能。

链接: https://arxiv.org/abs/2502.04475
作者: Jiahui Chen,Amy Zhang,Adriana Romero-Soriano
机构: UT Austin; McGill University (麦吉尔大学); Mila (米拉)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Image generation abilities of text-to-image diffusion models have significantly advanced, yielding highly photo-realistic images from descriptive text and increasing the viability of leveraging synthetic images to train computer vision models. To serve as effective training data, generated images must be highly realistic while also sufficiently diverse within the support of the target data distribution. Yet, state-of-the-art conditional image generation models have been primarily optimized for creative applications, prioritizing image realism and prompt adherence over conditional diversity. In this paper, we investigate how to improve the diversity of generated images with the goal of increasing their effectiveness to train downstream image classification models, without fine-tuning the image generation model. We find that conditioning the generation process on an augmented real image and text prompt produces generations that serve as effective synthetic datasets for downstream training. Conditioning on real training images contextualizes the generation process to produce images that are in-domain with the real image distribution, while data augmentations introduce visual diversity that improves the performance of the downstream classifier. We validate augmentation-conditioning on a total of five established long-tail and few-shot image classification benchmarks and show that leveraging augmentations to condition the generation process results in consistent improvements over the state-of-the-art on the long-tailed benchmark and remarkable gains in extreme few-shot regimes of the remaining four benchmarks. These results constitute an important step towards effectively leveraging synthetic data for downstream training.
zh

[CV-55] Color in Visual-Language Models: CLIP deficiencies

【速读】:该论文旨在探究CLIP(Contrastive Language-Image Pre-training)模型在色彩编码方面的能力与局限性。研究发现,尽管CLIP能够正确标注彩色视觉刺激的颜色,但存在两个主要缺陷:(a) 对于与颜色概念关联较弱的非彩色刺激(如白色、灰色和黑色),CLIP很少将其作为颜色标签;(b) 倾向于优先处理文本信息而非其他视觉信息。为了找出这些色彩缺陷的原因,作者分析了神经元层面的内部表示,发现CLIP模型中存在大量选择性针对文本的神经元,尤其是在网络的深层,而多模态色彩神经元数量较少。关键在于通过改进神经网络中的色彩表示机制,以促进对颜色更全面的理解,从而提升像CLIP这样的多模态模型在实际应用场景中的效能与通用性。

链接: https://arxiv.org/abs/2502.04470
作者: Guillem Arias,Ramon Baldrich,Maria Vanrell
机构: Computer Vision Center (计算机视觉中心); Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 10 figures, conference, Artificial Intelligence

点击查看摘要

Abstract:This work explores how color is encoded in CLIP (Contrastive Language-Image Pre-training) which is currently the most influential VML (Visual Language model) in Artificial Intelligence. After performing different experiments on synthetic datasets created for this task, we conclude that CLIP is able to attribute correct color labels to colored visual stimulus, but, we come across two main deficiencies: (a) a clear bias on achromatic stimuli that are poorly related to the color concept, thus white, gray and black are rarely assigned as color labels; and (b) the tendency to prioritize text over other visual information. Here we prove it is highly significant in color labelling through an exhaustive Stroop-effect test. With the aim to find the causes of these color deficiencies, we analyse the internal representation at the neuron level. We conclude that CLIP presents an important amount of neurons selective to text, specially in deepest layers of the network, and a smaller amount of multi-modal color neurons which could be the key of understanding the concept of color properly. Our investigation underscores the necessity of refining color representation mechanisms in neural networks to foster a more comprehensive comprehension of colors as humans understand them, thereby advancing the efficacy and versatility of multimodal models like CLIP in real-world scenarios.
zh

[CV-56] No Images No Problem: Retaining Knowledge in Continual VQA with Questions-Only Memory

【速读】:该论文旨在解决连续学习在视觉问答(Visual Question Answering, VQA)中的稳定性与可塑性之间的平衡问题,特别是在多模态任务中。现有的方法主要针对单模态任务设计,难以有效应对这一挑战。论文的关键解决方案是引入了一种名为QUestion-only replay with Attention Distillation (QUAD) 的新方法。QUAD通过仅利用过去任务的问题进行正则化,消除了存储视觉数据的需求,并解决了内存和隐私问题。此外,QUAD通过选择性使用先前任务的问题来防止对当前任务答案空间的过拟合,从而实现稳定性,并通过注意力一致性蒸馏确保跨任务的模态内和模态间注意力一致性,保持关键的视觉-语言关联。实验结果表明,QUAD在VQAv2和NExT-QA数据集上的表现显著优于现有最先进方法,实现了持续VQA任务中的稳健性能。

链接: https://arxiv.org/abs/2502.04469
作者: Imad Eddine Marouf,Enzo Tartaglione,Stephane Lathuiliere,Joost van de Weijer
机构: LTCI, Télécom-Paris, Institut Polytechnique de Paris (巴黎高等电信学院); Inria Grenoble (法国国家信息与自动化研究所); Computer Vision Center (CVC) (计算机视觉中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, in-review

点击查看摘要

Abstract:Continual Learning in Visual Question Answering (VQACL) requires models to learn new visual-linguistic tasks (plasticity) while retaining knowledge from previous tasks (stability). The multimodal nature of VQACL presents unique challenges, requiring models to balance stability across visual and textual domains while maintaining plasticity to adapt to novel objects and reasoning tasks. Existing methods, predominantly designed for unimodal tasks, often struggle to balance these demands effectively. In this work, we introduce QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularisation, eliminating the need to store visual data and addressing both memory and privacy concerns. QUAD achieves stability by introducing a question-only replay mechanism that selectively uses questions from previous tasks to prevent overfitting to the current task’s answer space, thereby mitigating the out-of-answer-set problem. Complementing this, we propose attention consistency distillation, which uniquely enforces both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations. Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA.
zh

[CV-57] rraQ: Spatiotemporal Question-Answering on Satellite Image Archives

【速读】:该论文旨在解决通过自然语言处理系统访问卫星图像档案中满足特定标准的图像的问题。解决方案的关键在于TerraQ引擎,它能够处理基于图像元数据和专业领域知识库(如Emilia-Romagna地区)请求的自然语言处理任务,从而使地球观测数据更易于访问,符合当前数字化助手的发展趋势。

链接: https://arxiv.org/abs/2502.04415
作者: Sergios-Anestis Kefalidis,Konstantinos Plas,Manolis Koubarakis
机构: National and Kapodistrian University of Athens (国立卡波迪斯特里亚大学); Archimedes/Athena RC (阿基米德/雅典研究与技术基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:TerraQ is a spatiotemporal question-answering engine for satellite image archives. It is a natural language processing system that is built to process requests for satellite images satisfying certain criteria. The requests can refer to image metadata and entities from a specialized knowledge base (e.g., the Emilia-Romagna region). With it, users can make requests like “Give me a hundred images of rivers near ports in France, with less than 20% snow coverage and more than 10% cloud coverage”, thus making Earth Observation data more easily accessible, in-line with the current landscape of digital assistants.
zh

[CV-58] me-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting

【速读】:该论文旨在解决时间序列预测中单一模态(文本或视觉)数据的局限性问题。文本提供了上下文理解,但缺乏细粒度的时间细节;而视觉能够捕捉复杂的时间模式,但缺乏语义上下文,限制了这些模态的互补潜力。为了解决这一问题,论文提出了一种名为Time-VLM的新型多模态框架,该框架利用预训练的视觉-语言模型(Vision-Language Models, VLMs)来整合时间、视觉和文本模态,以增强预测性能。关键解决方案在于其三个组成部分:(1) 增强型检索学习器,通过记忆库交互提取丰富的时序特征;(2) 视觉增强学习器,将时间序列编码为信息图像;(3) 文本增强学习器,生成上下文相关的文本描述。这些组件与冻结的预训练VLMs协作,生成多模态嵌入,并将其与时序特征融合以进行最终预测。实验结果表明,Time-VLM在少样本和零样本场景下表现出色,为多模态时间序列预测开辟了新的方向。

链接: https://arxiv.org/abs/2502.04395
作者: Siru Zhong,Weilin Ruan,Ming Jin,Huan Li,Qingsong Wen,Yuxuan Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages

点击查看摘要

Abstract:Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments across diverse datasets demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting.
zh

[CV-59] UniCP: A Unified Caching and Pruning Framework for Efficient Video Generation

【速读】:该论文旨在解决扩散Transformer (Diffusion Transformers, DiT) 在视频生成过程中因注意力机制的二次复杂性而导致的显著计算挑战。论文指出,相邻扩散步骤之间的注意力差异呈现U形模式,并且当前方法通过缓存注意力块来利用这一特性,但仍面临突发错误尖峰和较大误差差异的问题。为了解决这些问题,论文提出了一种名为UniCP的统一缓存与剪枝框架,其关键是通过Error Aware Dynamic Cache Window (EDCW) 动态调整不同时间步长上各块的缓存窗口大小以适应突发错误变化;通过PCA基于切片 (PCA based Slicing, PCAS) 剪枝冗余的注意力组件;并通过动态权重切换 (Dynamic Weight Shift, DWS) 实现缓存和剪枝输出之间动态切换,从而提高计算效率并保持视频细节保真度。

链接: https://arxiv.org/abs/2502.04393
作者: Wenzhang Sun,Qirui Hou,Donglin Di,Jiahui Yang,Yongjia Ma,Jianxun Cui
机构: Li Auto(理想汽车); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiT) excel in video generation but encounter significant computational challenges due to the quadratic complexity of attention. Notably, attention differences between adjacent diffusion steps follow a U-shaped pattern. Current methods leverage this property by caching attention blocks, however, they still struggle with sudden error spikes and large discrepancies. To address these issues, we propose UniCP a unified caching and pruning framework for efficient video generation. UniCP optimizes both temporal and spatial dimensions through. Error Aware Dynamic Cache Window (EDCW): Dynamically adjusts cache window sizes for different blocks at various timesteps, adapting to abrupt error changes. PCA based Slicing (PCAS) and Dynamic Weight Shift (DWS): PCAS prunes redundant attention components, and DWS integrates caching and pruning by enabling dynamic switching between pruned and cached outputs. By adjusting cache windows and pruning redundant components, UniCP enhances computational efficiency and maintains video detail fidelity. Experimental results show that UniCP outperforms existing methods in both performance and efficiency.
zh

[CV-60] owards Fair and Robust Face Parsing for Generative AI: A Multi-Objective Approach

【速读】:该论文旨在解决现有面部解析(Face Parsing)模型在公平性(fairness)和鲁棒性(robustness)方面的不足,这些问题导致不同人群之间的分割偏差以及在遮挡、噪声和领域转移情况下的错误。论文的关键解决方案在于提出一个多目标学习框架,通过同伦(homotopy)基础损失函数动态调整训练过程中准确度、公平性和鲁棒性的权重,从而优化面部解析任务。实验结果表明,这种公平性和鲁棒性增强的分割方法可以提高基于GAN的面部生成照片真实感和一致性。

链接: https://arxiv.org/abs/2502.04391
作者: Sophia J. Abraham,Jonathan D. Hauenstein,Walter J. Scheirer
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Face parsing is a fundamental task in computer vision, enabling applications such as identity verification, facial editing, and controllable image synthesis. However, existing face parsing models often lack fairness and robustness, leading to biased segmentation across demographic groups and errors under occlusions, noise, and domain shifts. These limitations affect downstream face synthesis, where segmentation biases can degrade generative model outputs. We propose a multi-objective learning framework that optimizes accuracy, fairness, and robustness in face parsing. Our approach introduces a homotopy-based loss function that dynamically adjusts the importance of these objectives during training. To evaluate its impact, we compare multi-objective and single-objective U-Net models in a GAN-based face synthesis pipeline (Pix2PixHD). Our results show that fairness-aware and robust segmentation improves photorealism and consistency in face generation. Additionally, we conduct preliminary experiments using ControlNet, a structured conditioning model for diffusion-based synthesis, to explore how segmentation quality influences guided image generation. Our findings demonstrate that multi-objective face parsing improves demographic consistency and robustness, leading to higher-quality GAN-based synthesis.
zh

[CV-61] owards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings

【速读】:该论文旨在解决自监督学习在三维CT数据中的嵌入编码潜在的种族、性别及年龄等人口统计学信息的问题,从而威胁到临床应用的公平性。论文的关键解决方案是提出了一种基于变分自编码器(Variation Autoencoder, VAE)的对抗去偏框架,通过将嵌入转换至新的潜在空间来消除这些人口统计学信息,同时保持下游任务如肺癌风险预测的关键性能。实验验证表明,该方法不仅有效消除了多个嵌入的人口统计学信息,提升了公平性,还确保了在1年和2年时间间隔内肺癌风险预测的准确性,并增强了对抗偏见攻击的鲁棒性。

链接: https://arxiv.org/abs/2502.04386
作者: Guangyao Zheng,Michael A. Jacobs,Vladimir Braverman,Vishwa S. Parekh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning has revolutionized medical imaging by enabling efficient and generalizable feature extraction from large-scale unlabeled datasets. Recently, self-supervised foundation models have been extended to three-dimensional (3D) computed tomography (CT) data, generating compact, information-rich embeddings with 1408 features that achieve state-of-the-art performance on downstream tasks such as intracranial hemorrhage detection and lung cancer risk forecasting. However, these embeddings have been shown to encode demographic information, such as age, sex, and race, which poses a significant risk to the fairness of clinical applications. In this work, we propose a Variation Autoencoder (VAE) based adversarial debiasing framework to transform these embeddings into a new latent space where demographic information is no longer encoded, while maintaining the performance of critical downstream tasks. We validated our approach on the NLST lung cancer screening dataset, demonstrating that the debiased embeddings effectively eliminate multiple encoded demographic information and improve fairness without compromising predictive accuracy for lung cancer risk at 1-year and 2-year intervals. Additionally, our approach ensures the embeddings are robust against adversarial bias attacks. These results highlight the potential of adversarial debiasing techniques to ensure fairness and equity in clinical applications of self-supervised 3D CT embeddings, paving the way for their broader adoption in unbiased medical decision-making. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.04386 [cs.CV] (or arXiv:2502.04386v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.04386 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-62] xLiDAR: Automated Text Understanding for Panoramic LiDAR Data

【速读】:该论文旨在解决通过文本连接LiDAR数据的挑战,特别是在处理3D点云时遇到的编码效率和神经网络处理的问题。关键在于利用由先进LiDAR传感器(如Ouster OS1)生成的固定分辨率深度、信号和环境全景2D图像,而非传统的3D点云数据。通过在零样本设置下使用Florence 2大型模型进行图像描述和目标检测,该方法展示了比现有方法(如CLIP)更优的性能。

链接: https://arxiv.org/abs/2502.04385
作者: Naor Cohen,Roy Orfaig,Ben-Zion Bobrovsky
机构: School of Electrical Engineering, Tel-Aviv University (特拉维夫大学电气工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efforts to connect LiDAR data with text, such as LidarCLIP, have primarily focused on embedding 3D point clouds into CLIP text-image space. However, these approaches rely on 3D point clouds, which present challenges in encoding efficiency and neural network processing. With the advent of advanced LiDAR sensors like Ouster OS1, which, in addition to 3D point clouds, produce fixed resolution depth, signal, and ambient panoramic 2D images, new opportunities emerge for LiDAR based tasks. In this work, we propose an alternative approach to connect LiDAR data with text by leveraging 2D imagery generated by the OS1 sensor instead of 3D point clouds. Using the Florence 2 large model in a zero-shot setting, we perform image captioning and object detection. Our experiments demonstrate that Florence 2 generates more informative captions and achieves superior performance in object detection tasks compared to existing methods like CLIP. By combining advanced LiDAR sensor data with a large pre-trained model, our approach provides a robust and accurate solution for challenging detection scenarios, including real-time applications requiring high accuracy and robustness.
zh

[CV-63] DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation

【速读】:该论文旨在解决深度学习模型测试中因现有方法(如简单的数据增强技术或生成对抗网络)在生成逼真且多样化的测试案例方面存在局限性的问题。论文的关键解决方案在于提出了一种新颖的框架,该框架利用大型语言模型(Large Language Models)和控制条件扩散模型(control-conditioned Diffusion Models),通过将图像转换为详细的文本描述,再生成反事实描述,最终利用文生图扩散过程生成新的高保真测试图像,从而有效提升模型的鲁棒性。

链接: https://arxiv.org/abs/2502.04378
作者: Luciano Baresi,Davide Yi Xian Hu,Muhammad Irfan Mas’udi,Giovanni Quattrocchi
机构: Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano (电子、信息和生物工程学部, 米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Ensuring the robustness of deep learning models requires comprehensive and diverse testing. Existing approaches, often based on simple data augmentation techniques or generative adversarial networks, are limited in producing realistic and varied test cases. To address these limitations, we present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models to generate synthetic, high-fidelity test cases. Our approach begins by translating images into detailed textual descriptions using a captioning model, allowing the language model to identify modifiable aspects of the image and generate counterfactual descriptions. These descriptions are then used to produce new test images through a text-to-image diffusion process that preserves spatial consistency and maintains the critical elements of the scene. We demonstrate the effectiveness of our method using two datasets: ImageNet1K for image classification and SHIFT for semantic segmentation in autonomous driving. The results show that our approach can generate significant test cases that reveal weaknesses and improve the robustness of the model through targeted retraining. We conducted a human assessment using Mechanical Turk to validate the generated images. The responses from the participants confirmed, with high agreement among the voters, that our approach produces valid and realistic images.
zh

[CV-64] MapFusion: A Novel BEV Feature Fusion Network for Multi-modal Map Construction

【速读】:该论文旨在解决多模态传感器(Multi-modal Sensors)在地图构建任务中的语义对齐(Semantic Alignment)和信息损失(Information Loss)问题。现有方法通常依赖简单的融合策略(Simple Fusion Strategies),导致跨模态特征之间存在对齐问题和信息损失。为了解决这些问题,论文提出了一种新的鸟瞰图(Bird’s-Eye View, BEV)特征融合方法——MapFusion。关键在于引入了跨模态交互变换(Cross-modal Interaction Transform, CIT)模块,以实现两个BEV特征空间之间的交互,并通过自注意力机制增强特征表示。此外,还提出了有效的双动态融合(Dual Dynamic Fusion, DDF)模块,能够自适应地从不同模态中选择有价值的信息。这些创新使得MapFusion不仅提高了精度,还能简单地集成到现有的系统中。

链接: https://arxiv.org/abs/2502.04377
作者: Xiaoshuai Hao,Yunfeng Diao,Mengchuan Wei,Yifan Yang,Peng Hao,Rong Yin,Hui Zhang,Weiming Li,Shu Zhao,Yu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Map construction task plays a vital role in providing precise and comprehensive static environmental information essential for autonomous driving systems. Primary sensors include cameras and LiDAR, with configurations varying between camera-only, LiDAR-only, or camera-LiDAR fusion, based on cost-performance considerations. While fusion-based methods typically perform best, existing approaches often neglect modality interaction and rely on simple fusion strategies, which suffer from the problems of misalignment and information loss. To address these issues, we propose MapFusion, a novel multi-modal Bird’s-Eye View (BEV) feature fusion method for map construction. Specifically, to solve the semantic misalignment problem between camera and LiDAR BEV features, we introduce the Cross-modal Interaction Transform (CIT) module, enabling interaction between two BEV feature spaces and enhancing feature representation through a self-attention mechanism. Additionally, we propose an effective Dual Dynamic Fusion (DDF) module to adaptively select valuable information from different modalities, which can take full advantage of the inherent information between different modalities. Moreover, MapFusion is designed to be simple and plug-and-play, easily integrated into existing pipelines. We evaluate MapFusion on two map construction tasks, including High-definition (HD) map and BEV map segmentation, to show its versatility and effectiveness. Compared with the state-of-the-art methods, MapFusion achieves 3.6% and 6.2% absolute improvements on the HD map construction and BEV map segmentation tasks on the nuScenes dataset, respectively, demonstrating the superiority of our approach.
zh

[CV-65] HSI: A Holistic Style Injector for Arbitrary Style Transfer

【速读】:该论文旨在解决现有注意力机制在风格迁移过程中过度关注局部模式而忽略全局特征的问题,并且在处理大图像时计算复杂度高。解决方案的关键在于提出了一种新的注意型风格转换模块——整体风格注入器(Holistic Style Injector, HSI),它仅基于全局风格表示进行风格化处理,以避免生成不和谐的局部模式。此外,HSI通过引入双关系学习机制,利用内容和风格之间的语义相似性动态渲染图像,从而保留原始内容并提高风格保真度。HSI还实现了线性的计算复杂度,因为其通过逐元素乘法而不是矩阵乘法建立特征映射。

链接: https://arxiv.org/abs/2502.04369
作者: Shuhao Zhang,Hui Kang,Yang Liu,Fang Mei,Hongjuan Li
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Attention-based arbitrary style transfer methods have gained significant attention recently due to their impressive ability to synthesize style details. However, the point-wise matching within the attention mechanism may overly focus on local patterns such that neglect the remarkable global features of style images. Additionally, when processing large images, the quadratic complexity of the attention mechanism will bring high computational load. To alleviate above problems, we propose Holistic Style Injector (HSI), a novel attention-style transformation module to deliver artistic expression of target style. Specifically, HSI performs stylization only based on global style representation that is more in line with the characteristics of style transfer, to avoid generating local disharmonious patterns in stylized images. Moreover, we propose a dual relation learning mechanism inside the HSI to dynamically render images by leveraging semantic similarity in content and style, ensuring the stylized images preserve the original content and improve style fidelity. Note that the proposed HSI achieves linear computational complexity because it establishes feature mapping through element-wise multiplication rather than matrix multiplication. Qualitative and quantitative results demonstrate that our method outperforms state-of-the-art approaches in both effectiveness and efficiency.
zh

[CV-66] AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth

【速读】:该论文旨在解决新生儿出生时间(Time of Birth, ToB)记录不准确的问题,以提高新生儿复苏的效果。解决方案的关键在于开发了一种基于AI和热成像技术的视频系统,用于自动检测ToB。该系统通过避免使用可识别的视觉数据来保护医护人员和母亲的隐私,同时在性能评估中实现了91.4%的精确度和97.4%的召回率,并在测试案例中将出生时间的绝对中位偏差控制在1秒以内。

链接: https://arxiv.org/abs/2502.04365
作者: Jorge García-Torres,Øyvind Meinich-Bache,Siren Rettedal,Kjersti Engan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Paper accepted in 2025 IEEE International Symposium on Biomedical Imaging (ISBI 2025)

点击查看摘要

Abstract:Approximately 10% of newborns need some assistance to start breathing and 5% proper ventilation. It is crucial that interventions are initiated as soon as possible after birth. Accurate documentation of Time of Birth (ToB) is thereby essential for documenting and improving newborn resuscitation performance. However, current clinical practices rely on manual recording of ToB, typically with minute precision. In this study, we present an AI-driven, video-based system for automated ToB detection using thermal imaging, designed to preserve the privacy of healthcare providers and mothers by avoiding the use of identifiable visual data. Our approach achieves 91.4% precision and 97.4% recall in detecting ToB within thermal video clips during performance evaluation. Additionally, our system successfully identifies ToB in 96% of test cases with an absolute median deviation of 1 second compared to manual annotations. This method offers a reliable solution for improving ToB documentation and enhancing newborn resuscitation outcomes.
zh

[CV-67] Lost in Edits? A lambda-Compass for AIGC Provenance

【速读】:该论文旨在解决扩散模型驱动的文本引导图像编辑工具所引发的滥用风险问题,特别是如何在迭代编辑过程中确保图像内容的真实性和可追溯性。论文的关键解决方案是提出了一种名为LambdaTracer的新颖潜空间归因方法,该方法能够在不修改生成或编辑流水线的情况下,通过自适应校准重建损失,有效识别和区分真实输出与被篡改的图像,从而在自动化编辑工具(如InstructPix2Pix和ControlNet)或手动使用编辑软件(如Adobe Photoshop)进行的复杂编辑场景中保持其有效性。

链接: https://arxiv.org/abs/2502.04364
作者: Wenhao You,Bryan Hooi,Yiwei Wang,Euijin Choo,Ming-Hsuan Yang,Junsong Yuan,Zi Huang,Yujun Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have driven the growth of text-guided image editing tools, enabling precise and iterative modifications of synthesized content. However, as these tools become increasingly accessible, they also introduce significant risks of misuse, emphasizing the critical need for robust attribution methods to ensure content authenticity and traceability. Despite the creative potential of such tools, they pose significant challenges for attribution, particularly in adversarial settings where edits can be layered to obscure an image’s origins. We propose LambdaTracer, a novel latent-space attribution method that robustly identifies and differentiates authentic outputs from manipulated ones without requiring any modifications to generative or editing pipelines. By adaptively calibrating reconstruction losses, LambdaTracer remains effective across diverse iterative editing processes, whether automated through text-guided editing tools such as InstructPix2Pix and ControlNet or performed manually with editing software such as Adobe Photoshop. Extensive experiments reveal that our method consistently outperforms baseline approaches in distinguishing maliciously edited images, providing a practical solution to safeguard ownership, creativity, and credibility in the open, fast-evolving AI ecosystems.
zh

[CV-68] On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

【速读】:该论文旨在解决在计算和内存受限的移动设备上进行基于扩散模型的文本到视频生成的问题。关键解决方案包括:Linear Proportional Leap (LPL),通过高效的跳跃方法减少视频扩散过程中所需的过度去噪步骤;Temporal Dimension Token Merging (TDTM),通过沿时间维度合并连续标记来最小化注意力层中的密集标记处理计算;以及Concurrent Inference with Dynamic Loading (CI-DL),动态将大模型划分为较小块,并加载到内存中进行并发模型推理,以有效应对设备内存限制。这些技术共同实现了在资源受限的移动设备上高效且高质量的视频生成。

链接: https://arxiv.org/abs/2502.04363
作者: Bosung Kim,Kyuhwan Lee,Isu Jeong,Jungmin Cheon,Yeojin Lee,Seulki Lee
机构: Ulsan National Institute of Science and Technology (蔚山科学技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present On-device Sora, a first pioneering solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. Building on Open-Sora, On-device Sora applies three novel techniques to address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations demonstrate that it is capable of generating high-quality videos on the device, comparable to those produced by Open-Sora running on high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices, expanding accessibility, ensuring user privacy, reducing dependence on cloud infrastructure, and lowering associated costs. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation capabilities on commodity mobile and embedded devices. The code implementation is publicly available at an GitHub repository: this https URL.
zh

[CV-69] Predicting 3D Motion from 2D Video for Behavior-Based VR Biometrics

【速读】:该论文旨在解决虚拟现实(VR)应用中基于行为的认证方法所面临的关键局限性,即当前设备无法实现全身关节活动的完整跟踪,从而丢失重要的签名数据。论文提出的关键解决方案是利用外部2D摄像头捕获参与者右侧身体关节(包括肩部、肘部、腕部、髋部、膝部和踝部)的数据,并通过Transformer基础的深度神经网络预测右侧控制器的过去和未来3D轨迹。这种方法能够增强认证中的3D知识,提供最低等错误率(EER)为0.025,并且相比仅使用单个单元3D轨迹输入的方法,最大EER降低0.040。

链接: https://arxiv.org/abs/2502.04361
作者: Mingjun Li,Natasha Kholgade Banerjee,Sean Banerjee
机构: Clarkson University (克拉克森大学), USA; Wright State University (莱特州立大学), USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: IEEE AIxVR 2025: 7th International Conference on Artificial Intelligence extended and Virtual Reality

点击查看摘要

Abstract:Critical VR applications in domains such as healthcare, education, and finance that use traditional credentials, such as PIN, password, or multi-factor authentication, stand the chance of being compromised if a malicious person acquires the user credentials or if the user hands over their credentials to an ally. Recently, a number of approaches on user authentication have emerged that use motions of VR head-mounted displays (HMDs) and hand controllers during user interactions in VR to represent the user’s behavior as a VR biometric signature. One of the fundamental limitations of behavior-based approaches is that current on-device tracking for HMDs and controllers lacks capability to perform tracking of full-body joint articulation, losing key signature data encapsulated by the user articulation. In this paper, we propose an approach that uses 2D body joints, namely shoulder, elbow, wrist, hip, knee, and ankle, acquired from the right side of the participants using an external 2D camera. Using a Transformer-based deep neural network, our method uses the 2D data of body joints that are not tracked by the VR device to predict past and future 3D tracks of the right controller, providing the benefit of augmenting 3D knowledge in authentication. Our approach provides a minimum equal error rate (EER) of 0.025, and a maximum EER drop of 0.040 over prior work that uses single-unit 3D trajectory as the input.
zh

[CV-70] Chest X-ray Foundation Model with Global and Local Representations Integration

【速读】:该论文旨在解决胸部X光(Chest X-ray, CXR)图像在多种临床任务中的分类模型受限于特定任务、需要昂贵的标记数据且泛化能力不足的问题。解决方案的关键在于引入了一个名为CheXFound的自监督视觉基础模型,并通过一个全局与局部表征整合(Global and Local Representations Integration, GLoRI)模块,增强了多标签分类的性能。CheXFound通过在大规模公开数据集上预训练获得稳健的CXR表示,并在不同下游任务中表现出色,尤其在处理分布外数据时展现了显著的优势。

链接: https://arxiv.org/abs/2502.05142
作者: Zefan Yang,Xuanang Xu,Jiajin Zhang,Ge Wang,Mannudeep K. Kalra,Pingkun Yan
机构: Department of Biomedical Engineering and Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute (伦斯勒理工学院生物医学工程系及生物技术与跨学科研究中心); Department of Radiology, Massachusetts General Hospital, Harvard Medical School (哈佛医学院麻省总医院放射科)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest X-ray (CXR) is the most frequently ordered imaging test, supporting diverse clinical tasks from thoracic disease detection to postoperative monitoring. However, task-specific classification models are limited in scope, require costly labeled data, and lack generalizability to out-of-distribution datasets. To address these challenges, we introduce CheXFound, a self-supervised vision foundation model that learns robust CXR representations and generalizes effectively across a wide range of downstream tasks. We pretrain CheXFound on a curated CXR-1M dataset, comprising over one million unique CXRs from publicly available sources. We propose a Global and Local Representations Integration (GLoRI) module for downstream adaptations, by incorporating disease-specific local features with global image features for enhanced performance in multilabel classification. Our experimental results show that CheXFound outperforms state-of-the-art models in classifying 40 disease findings across different prevalence levels on the CXR-LT 24 dataset and exhibits superior label efficiency on downstream tasks with limited training data. Additionally, CheXFound achieved significant improvements on new tasks with out-of-distribution datasets, including opportunistic cardiovascular disease risk estimation and mortality prediction. These results highlight CheXFound’s strong generalization capabilities, enabling diverse adaptations with improved label efficiency. The project source code is publicly available at this https URL.
zh

[CV-71] Investigating the impact of kernel harmonization and deformable registration on inspiratory and expiratory chest CT images for people with COPD

【速读】:该论文旨在解决因重建核(reconstruction kernels)差异导致的定量分析误差问题。解决方案的关键在于采用两阶段流程:首先使用循环生成对抗网络(Cycle GAN)将硬核(BONE)重建的吸气相扫描图像转化为与软核(STANDARD)重建的呼气相扫描图像一致,其次进行可变形图像配准(deformable image registration)。这种方法显著减少了定量测量的一致性问题,特别是在测量肺气肿(emphysema)时,使中位数评分从10.479%降至3.039%,接近目标标准中位数1.305%。

链接: https://arxiv.org/abs/2502.05119
作者: Aravind R. Krishnan,Yihao Liu,Kaiwen Xu,Michael E. Kim,Lucas W. Remedios,Gaurav Rudravaram,Adam M. Saunders,Bradley W. Richmond,Kim L. Sandler,Fabien Maldonado,Bennett A. Landman,Lianrui Zuo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at SPIE Medical Imaging 2025, Clinical and Biomedical Imaging

点击查看摘要

Abstract:Paired inspiratory-expiratory CT scans enable the quantification of gas trapping due to small airway disease and emphysema by analyzing lung tissue motion in COPD patients. Deformable image registration of these scans assesses regional lung volumetric changes. However, variations in reconstruction kernels between paired scans introduce errors in quantitative analysis. This work proposes a two-stage pipeline to harmonize reconstruction kernels and perform deformable image registration using data acquired from the COPDGene study. We use a cycle generative adversarial network (GAN) to harmonize inspiratory scans reconstructed with a hard kernel (BONE) to match expiratory scans reconstructed with a soft kernel (STANDARD). We then deformably register the expiratory scans to inspiratory scans. We validate harmonization by measuring emphysema using a publicly available segmentation algorithm before and after harmonization. Results show harmonization significantly reduces emphysema measurement inconsistencies, decreasing median emphysema scores from 10.479% to 3.039%, with a reference median score of 1.305% from the STANDARD kernel as the target. Registration accuracy is evaluated via Dice overlap between emphysema regions on inspiratory, expiratory, and deformed images. The Dice coefficient between inspiratory emphysema masks and deformably registered emphysema masks increases significantly across registration stages (p0.001). Additionally, we demonstrate that deformable registration is robust to kernel variations.
zh

[CV-72] C2GM: Cascading Conditional Generation of Multi-scale Maps from Remote Sensing Images Constrained by Geographic Features

【速读】:该论文旨在解决现有图像生成网络在生成多尺度地图瓦片时,因生成模型侧重于自然图像的纹理特征而忽视遥感特征独特性和地图瓦片尺度属性的问题,导致地理信息表达不准确及地图瓦片生成质量有待提高。解决方案的关键在于提出C2GM框架,通过条件引导扩散和多尺度级联生成方法实现多尺度地图瓦片的生成。C2GM利用条件特征融合编码器从遥感图像中提取对象先验,并采用级联参考双分支输入确保复杂特征的准确表示。同时,通过引入CLIP模态信息模拟地图比例与制图概括之间的关系,进一步提升了地图生成的视觉连续性和准确性。

链接: https://arxiv.org/abs/2502.04991
作者: Chenxing Sun,Yongyang Xu,Xuwei Xu,Xixi Fan,Jing Bai,Xiechun Lu,Zhanlong Chen
机构: Key Laboratory of Geological Survey and Evaluation of Ministry of Education, China University of Geosciences (中国地质大学(武汉)地质调查与评价教育部重点实验室), Wuhan 430074, China;
School of Computer Science, China University of Geosciences (中国地质大学(武汉)计算机科学学院), Wuhan 430074, China;
Engineering Research Center of Natural Resource Information Management and Digital Twin Engineering Software, Ministry of Education (教育部自然资源信息管理及数字孪生工程软件工程技术研究中心), Wuhan 430074, China;
School of Geographical and Information Engineering, China University of Geosciences (中国地质大学(武汉)地理与信息工程学院), Wuhan 430074, China;
College Of Computer and Information Technology, China Three Gorges University (三峡大学计算机与信息学院), Yichang 443002, China;
Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University (湖北水电工程智能视觉监测重点实验室(三峡大学)), Yichang 443002, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-scale maps are essential representations of surveying and cartographic results, serving as fundamental components of geographic services. Current image generation networks can quickly produce map tiles from remote-sensing images. However, generative models designed for natural images often focus on texture features, neglecting the unique characteristics of remote-sensing features and the scale attributes of tile maps. This limitation in generative models impairs the accurate representation of geographic information, and the quality of tile map generation still needs improvement. Diffusion models have demonstrated remarkable success in various image generation tasks, highlighting their potential to address this challenge. This paper presents C2GM, a novel framework for generating multi-scale tile maps through conditional guided diffusion and multi-scale cascade generation. Specifically, we implement a conditional feature fusion encoder to extract object priors from remote sensing images and cascade reference double branch input, ensuring an accurate representation of complex features. Low-level generated tiles act as constraints for high-level map generation, enhancing visual continuity. Moreover, we incorporate map scale modality information using CLIP to simulate the relationship between map scale and cartographic generalization in tile maps. Extensive experimental evaluations demonstrate that C2GM consistently achieves the state-of-the-art (SOTA) performance on all metrics, facilitating the rapid and effective generation of multi-scale large-format maps for emergency response and remote mapping applications.
zh

[CV-73] CMamba: Learned Image Compression with State Space Models

【速读】:该论文旨在解决在保持低计算复杂度的前提下实现高效图像压缩的问题。论文的关键在于提出了一种名为CMamba的混合卷积与状态空间模型(State Space Models, SSMs)的图像压缩框架,该框架通过引入两个核心组件:内容自适应状态空间模型(Content-Adaptive SSM, CA-SSM)模块和上下文感知熵(Context-Aware Entropy, CAE)模块来实现这一目标。CA-SSM模块能够动态融合由SSM块提取的全局内容和由CNN块捕捉的局部细节,从而在压缩过程中有效保留重要图像内容;而CAE模块则通过SSMs参数化潜在表示中的空间内容,并以自回归方式减少通道冗余,显著提升了压缩效率并降低了冗余。

链接: https://arxiv.org/abs/2502.04988
作者: Zhuojie Wu,Heming Du,Shuyun Wang,Ming Lu,Haiyang Sun,Yandong Guo,Xin Yu
机构: School of Electrical Engineering and Computer Science, University of Queensland (昆士兰大学电气工程与计算机科学学院), Brisbane 4067, Australia; Intel Lab China (英特尔中国实验室), Beijing 100876, China; LiAuto (理想汽车), Shanghai 201805, China; AI2 Robotics (机器人AI2), Shenzhen 518055, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learned Image Compression (LIC) has explored various architectures, such as Convolutional Neural Networks (CNNs) and transformers, in modeling image content distributions in order to achieve compression effectiveness. However, achieving high rate-distortion performance while maintaining low computational complexity (\ie, parameters, FLOPs, and latency) remains challenging. In this paper, we propose a hybrid Convolution and State Space Models (SSMs) based image compression framework, termed \textitCMamba, to achieve superior rate-distortion performance with low computational complexity. Specifically, CMamba introduces two key components: a Content-Adaptive SSM (CA-SSM) module and a Context-Aware Entropy (CAE) module. First, we observed that SSMs excel in modeling overall content but tend to lose high-frequency details. In contrast, CNNs are proficient at capturing local details. Motivated by this, we propose the CA-SSM module that can dynamically fuse global content extracted by SSM blocks and local details captured by CNN blocks in both encoding and decoding stages. As a result, important image content is well preserved during compression. Second, our proposed CAE module is designed to reduce spatial and channel redundancies in latent representations after encoding. Specifically, our CAE leverages SSMs to parameterize the spatial content in latent representations. Benefiting from SSMs, CAE significantly improves spatial compression efficiency while reducing spatial content redundancies. Moreover, along the channel dimension, CAE reduces inter-channel redundancies of latent representations via an autoregressive manner, which can fully exploit prior knowledge from previous channels without sacrificing efficiency. Experimental results demonstrate that CMamba achieves superior rate-distortion performance.
zh

[CV-74] Wavelet-Assisted Multi-Frequency Attention Network for Pansharpening

【速读】:该论文旨在解决在频率域进行全色图像与多光谱图像融合时,现有方法未能充分利用频率域优势的问题。论文的关键在于提出了一种名为多频融合注意力机制(Multi-Frequency Fusion Attention, MFFA)的方法,通过小波变换将不同频率清晰分离,并采用基于物理意义的频率查询(Frequency-Query)、空间键(Spatial-Key)和融合值(Fusion-Value)来更有效地捕捉频率域中的特定信息。此外,该方法还关注不同操作过程中频率特征的一致性保持,并利用小波金字塔在网络中逐步融合多尺度信息,从而更好地防止在融合过程中不同频率特征的混淆和损失。

链接: https://arxiv.org/abs/2502.04903
作者: Jie Huang,Rui Huang,Jinghao Xu,Siran Pen,Yule Duan,Liangjian Deng
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 13 figures

点击查看摘要

Abstract:Pansharpening aims to combine a high-resolution panchromatic (PAN) image with a low-resolution multispectral (LRMS) image to produce a high-resolution multispectral (HRMS) image. Although pansharpening in the frequency domain offers clear advantages, most existing methods either continue to operate solely in the spatial domain or fail to fully exploit the benefits of the frequency domain. To address this issue, we innovatively propose Multi-Frequency Fusion Attention (MFFA), which leverages wavelet transforms to cleanly separate frequencies and enable lossless reconstruction across different frequency domains. Then, we generate Frequency-Query, Spatial-Key, and Fusion-Value based on the physical meanings represented by different features, which enables a more effective capture of specific information in the frequency domain. Additionally, we focus on the preservation of frequency features across different operations. On a broader level, our network employs a wavelet pyramid to progressively fuse information across multiple scales. Compared to previous frequency domain approaches, our network better prevents confusion and loss of different frequency features during the fusion process. Quantitative and qualitative experiments on multiple datasets demonstrate that our method outperforms existing approaches and shows significant generalization capabilities for real-world scenarios.
zh

[CV-75] ARTInp: CBCT-to-CT Image Inpainting and Image Translation in Radiotherapy

【速读】:该论文旨在解决在自适应放疗(ART)流程中,锥形束计算机断层扫描(CBCT)图像因分辨率低和伪影多而难以精确验证治疗的问题。特别是在复杂的全身照射治疗(如全身骨髓与淋巴结照射,TMLI)中,CBCT图像的不连续性导致重要解剖信息缺失。论文的关键解决方案是提出了一种名为ARTInp的新型深度学习框架,结合图像修复和CBCT到CT的转换技术。ARTInp采用双网络策略:一个完成网络用于填补CBCT体积中的解剖空缺,一个定制的生成对抗网络(GAN)用于生成高质量的合成CT(sCT)图像。通过在SynthRad 2023挑战提供的配对CBCT和CT图像数据集上进行训练,ARTInp在18名患者的测试集中展示了其增强CBCT为基础的放疗工作流程的潜力。

链接: https://arxiv.org/abs/2502.04898
作者: Ricardo Coimbra Brioso,Leonardo Crespi,Andrea Seghetto,Damiano Dei,Nicola Lambri,Pietro Mancosu,Marta Scorsetti,Daniele Loiacono
机构: Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano (电子、信息和生物工程系, 米兰理工大学), Milan, Italy; Department of Biomedical Sciences, Humanitas University (生物医学科学系, 仁慈大学), Pieve Emanuele, Milan, Italy; Radiotherapy and Radiosurgery Department, IRCCS Humanitas Research Hospital (放射治疗和放射外科部门, IRCCS仁慈研究医院), Rozzano, Milan, Italy
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A key step in Adaptive Radiation Therapy (ART) workflows is the evaluation of the patient’s anatomy at treatment time to ensure the accuracy of the delivery. To this end, Cone Beam Computerized Tomography (CBCT) is widely used being cost-effective and easy to integrate into the treatment process. Nonetheless, CBCT images have lower resolution and more artifacts than CT scans, making them less reliable for precise treatment validation. Moreover, in complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI), where full-body visualization of the patient is critical for accurate dose delivery, the CBCT images are often discontinuous, leaving gaps that could contain relevant anatomical information. To address these limitations, we propose ARTInp (Adaptive Radiation Therapy Inpainting), a novel deep-learning framework combining image inpainting and CBCT-to-CT translation. ARTInp employs a dual-network approach: a completion network that fills anatomical gaps in CBCT volumes and a custom Generative Adversarial Network (GAN) to generate high-quality synthetic CT (sCT) images. We trained ARTInp on a dataset of paired CBCT and CT images from the SynthRad 2023 challenge, and the performance achieved on a test set of 18 patients demonstrates its potential for enhancing CBCT-based workflows in radiotherapy.
zh

[CV-76] MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin

【速读】:该论文旨在解决未知发热(Fever of Unknown Origin, FUO)的诊断挑战。解决方案的关键在于MedMimic框架,它利用预训练模型如DINOv2、视觉变换器(Vision Transformer)和ResNet-18将高维的18F-FDG PET/CT成像数据转换为低维且语义上有意义的特征,并通过可学习的基于自注意力机制的融合网络将这些影像特征与临床数据整合,进行分类。该方法在416例FUO患者病例中表现出色,其多模态融合分类网络(MFCN)的宏平均AUROC评分在七个任务中达到0.8654到0.9291,优于传统的机器学习和单模态深度学习方法。

链接: https://arxiv.org/abs/2502.04794
作者: Minrui Chen,Yi Zhou,Huidong Jiang,Yuhan Zhu,Guanjie Zou,Minqi Chen,Rong Tian,Hiroto Saigo
机构: Kyushu University (九州大学), Fukuoka, Japan; West China Hospital, Sichuan University (华西医院,四川大学), Chengdu, China; Department of Computer Science, Institute of Science Tokyo (东京工业大学计算机科学系), Yokohama, Japan; Center for Advanced Intelligence Project, RIKEN (理化学研究所先进智能研究中心), Tokyo, Japan; Department of Computational Biology and Medical Sciences, University of Tokyo (东京大学计算生物学与医学科学系), Tokyo, Japan; Department of Thoracic Surgery and Institute of Thoracic Oncology, West China Hospital, Sichuan University (胸外科和胸部肿瘤研究所,华西医院,四川大学), Chengdu, China; Department of Electrical Engineering and Computer Science, Kyushu University (电气工程与计算机科学系,九州大学), Fukuoka, Japan
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fever of unknown origin FUO remains a diagnostic challenge. MedMimic is introduced as a multimodal framework inspired by real-world diagnostic processes. It uses pretrained models such as DINOv2, Vision Transformer, and ResNet-18 to convert high-dimensional 18F-FDG PET/CT imaging into low-dimensional, semantically meaningful features. A learnable self-attention-based fusion network then integrates these imaging features with clinical data for classification. Using 416 FUO patient cases from Sichuan University West China Hospital from 2017 to 2023, the multimodal fusion classification network MFCN achieved macro-AUROC scores ranging from 0.8654 to 0.9291 across seven tasks, outperforming conventional machine learning and single-modality deep learning methods. Ablation studies and five-fold cross-validation further validated its effectiveness. By combining the strengths of pretrained large models and deep learning, MedMimic offers a promising solution for disease classification.
zh

[CV-77] Leverag ing band diversity for feature selection in EO data

【速读】:该论文旨在解决高光谱成像(HSI)数据在重建过程中因高维度导致的数据处理难题。关键解决方案在于通过确定性点过程(Determinantal Point Processes, DPPs)选择多样化的光谱带组,同时利用光谱角映射分析(Spectral Angle Mapper, SAM)来应对由此分组可能产生的重叠带问题,从而实现高精度和高准确性地进行详细分析和监测。

链接: https://arxiv.org/abs/2502.04713
作者: Sadia Hussain,Brejesh Lall
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) is a powerful earth observation technology that captures and processes information across a wide spectrum of wavelengths. Hyperspectral imaging provides comprehensive and detailed spectral data that is invaluable for a wide range of reconstruction problems. However due to complexity in analysis it often becomes difficult to handle this data. To address the challenge of handling large number of bands in reconstructing high quality HSI, we propose to form groups of bands. In this position paper we propose a method of selecting diverse bands using determinantal point processes in correlated bands. To address the issue of overlapping bands that may arise from grouping, we use spectral angle mapper analysis. This analysis can be fed to any Machine learning model to enable detailed analysis and monitoring with high precision and accuracy.
zh

[CV-78] Generative Autoregressive Transformers for Model-Agnostic Federated MRI Reconstruction

【速读】:该论文旨在解决单站点模型在有限本地数据集上训练时泛化能力差的问题,并且现有的联邦学习(Federated Learning, FL)方法无法支持模型异构设置。关键在于引入FedGAT,这是一种基于生成式自回归变换器的新型模型不可知联邦学习技术。FedGAT通过分散训练全局生成先验来捕捉多站点磁共振成像(MRI)图像的分布,允许每个站点使用其首选架构训练特定于站点的重建模型,从而支持灵活的协作并实现优于现有FL基线的重建性能。

链接: https://arxiv.org/abs/2502.04521
作者: Valiyeh A. Nezhad,Gokberk Elmas,Bilal Kabas,Fuat Arslan,Tolga Çukur
机构: Department of Electrical and Electronics Engineering, and the National Magnetic Resonance Research Center, Bilkent University (电气与电子工程系, 国家磁共振研究中心, 比尔肯特大学), Ankara, Turkey (土耳其安卡拉)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although learning-based models hold great promise for MRI reconstruction, single-site models built on limited local datasets often suffer from poor generalization. This challenge has spurred interest in collaborative model training on multi-site datasets via federated learning (FL) – a privacy-preserving framework that aggregates model updates instead of sharing imaging data. Conventional FL builds a global model by aggregating locally trained model weights, inherently constraining all sites to a homogeneous model architecture. This rigid homogeneity requirement forces sites to forgo architectures tailored to their compute infrastructure and application-specific demands. Consequently, existing FL methods for MRI reconstruction fail to support model-heterogeneous settings, where individual sites are allowed to use distinct architectures. To overcome this fundamental limitation, here we introduce FedGAT, a novel model-agnostic FL technique based on generative autoregressive transformers. FedGAT decentralizes the training of a global generative prior that captures the distribution of multi-site MR images. For enhanced fidelity, we propose a novel site-prompted GAT prior that controllably synthesizes MR images from desired sites via autoregressive prediction across spatial scales. Each site then trains its site-specific reconstruction model – using its preferred architecture – on a hybrid dataset comprising the local MRI dataset and GAT-generated synthetic MRI datasets for other sites. Comprehensive experiments on multi-institutional datasets demonstrate that FedGAT supports flexible collaborations while enjoying superior within-site and across-site reconstruction performance compared to state-of-the-art FL baselines.
zh

[CV-79] LUND-PROBE – LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset

【速读】:该论文旨在解决前列腺癌放射治疗中手动分割目标体积和危及器官(OARs)耗时且繁琐的问题。解决方案的关键在于提供一个包含432名患者的临床数据集,该数据集包含了MRI和合成CT(sCT)图像、目标及OARs的分割结果以及放疗剂量分布。此外,还包括一个扩展数据集,包含35名患者的数据,其中含有深度学习(Deep Learning, DL)生成的分割、DL分割不确定性图谱以及由四名放射肿瘤学家手动调整的DL分割结果。这些资源的发布旨在促进自动化放射治疗计划、分割、观察者间分析以及DL模型不确定性研究的发展。该数据集托管在AIDA Data Hub上,为科学界提供了一个免费使用的宝贵资源,有助于医学影像和前列腺癌放射治疗研究的进步。

链接: https://arxiv.org/abs/2502.04493
作者: Viktor Rogowski,Lars E Olsson,Jonas Scherman,Emilia Persson,Mustafa Kadhim,Sacha af Wetterstedt,Adalsteinn Gunnlaugsson,Martin P. Nilsson,Nandor Vass,Mathieu Moreau,Maria Gebre Medhin,Sven Bäck,Per Munck af Rosenschöld,Silke Engelholm,Christian Jamtheim Gustafsson
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 4 figures

点击查看摘要

Abstract:Radiotherapy treatment for prostate cancer relies on computed tomography (CT) and/or magnetic resonance imaging (MRI) for segmentation of target volumes and organs at risk (OARs). Manual segmentation of these volumes is regarded as the gold standard for ground truth in machine learning applications but to acquire such data is tedious and time-consuming. A publicly available clinical dataset is presented, comprising MRI- and synthetic CT (sCT) images, target and OARs segmentations, and radiotherapy dose distributions for 432 prostate cancer patients treated with MRI-guided radiotherapy. An extended dataset with 35 patients is also included, with the addition of deep learning (DL)-generated segmentations, DL segmentation uncertainty maps, and DL segmentations manually adjusted by four radiation oncologists. The publication of these resources aims to aid research within the fields of automated radiotherapy treatment planning, segmentation, inter-observer analyses, and DL model uncertainty investigation. The dataset is hosted on the AIDA Data Hub and offers a free-to-use resource for the scientific community, valuable for the advancement of medical imaging and prostate cancer radiotherapy research.
zh

[CV-80] Hybrid Deep Learning Framework for Classification of Kidney CT Images: Diagnosis of Stones Cysts and Tumors

【速读】:该论文旨在解决肾脏CT图像分类问题,以提高肾病诊断的自动化水平。解决方案的关键在于提出了一种融合特征的混合深度学习模型,该模型结合了预训练的ResNet101与自定义CNN,通过特征融合技术显著提升了分类精度,最终实现了99.73%的训练准确率和100%的测试准确率。

链接: https://arxiv.org/abs/2502.04367
作者: Kiran Sharma,Ziya Uddin,Adarsh Wadal,Dhruv Gupta
机构: School of Engineering & Technology, BML Munjal University (BML Munjal大学), Gurugram, Haryana-122413, India; Center for Advanced Data and Computational Science, BML Munjal University (BML Munjal大学), Gurugram, Haryana-122413, India
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical image classification is a vital research area that utilizes advanced computational techniques to improve disease diagnosis and treatment planning. Deep learning models, especially Convolutional Neural Networks (CNNs), have transformed this field by providing automated and precise analysis of complex medical images. This study introduces a hybrid deep learning model that integrates a pre-trained ResNet101 with a custom CNN to classify kidney CT images into four categories: normal, stone, cyst, and tumor. The proposed model leverages feature fusion to enhance classification accuracy, achieving 99.73% training accuracy and 100% testing accuracy. Using a dataset of 12,446 CT images and advanced feature mapping techniques, the hybrid CNN model outperforms standalone ResNet101. This architecture delivers a robust and efficient solution for automated kidney disease diagnosis, providing improved precision, recall, and reduced testing time, making it highly suitable for clinical applications.
zh

人工智能

[AI-0] MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison

链接: https://arxiv.org/abs/2502.05174
作者: Kaijie Zhu,Xianjun Yang,Jindong Wang,Wenbo Guo,William Yang Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent research has explored that LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions. Existing defenses against IPI have significant limitations: either require essential model training resources, lack effectiveness against sophisticated attacks, or harm the normal utilities. We present MELON (Masked re-Execution and TooL comparisON), a novel IPI defense. Our approach builds on the observation that under a successful attack, the agent’s next action becomes less dependent on user tasks and more on malicious tasks. Following this, we design MELON to detect attacks by re-executing the agent’s trajectory with a masked user prompt modified through a masking function. We identify an attack if the actions generated in the original and masked executions are similar. We also include three key designs to reduce the potential false positives and false negatives. Extensive evaluation on the IPI benchmark AgentDojo demonstrates that MELON outperforms SOTA defenses in both attack prevention and utility preservation. Moreover, we show that combining MELON with a SOTA prompt augmentation defense (denoted as MELON-Aug) further improves its performance. We also conduct a detailed ablation study to validate our key designs.

[AI-1] “It Felt Like I Was Left in the Dark”: Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings

链接: https://arxiv.org/abs/2502.05115
作者: Shihan Fu,Bingsheng Yao,Smit Desai,Yuqi Hu,Yuling Sun,Samantha Stonbraker,Yanjun Gao,Elizabeth M. Goldberg,Dakuo Wang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Older adult patients constitute a rapidly growing subgroup of Intensive Care Unit (ICU) patients. In these situations, their family caregivers are expected to represent the unconscious patients to access and interpret patients’ medical information. However, caregivers currently have to rely on overloaded clinicians for information updates and typically lack the health literacy to understand complex medical information. Our project aims to explore the information needs of caregivers of ICU older adult patients, from which we can propose design opportunities to guide future AI systems. The project begins with formative interviews with 11 caregivers to identify their challenges in accessing and interpreting medical information; From these findings, we then synthesize design requirements and propose an AI system prototype to cope with caregivers’ challenges. The system prototype has two key features: a timeline visualization to show the AI extracted and summarized older adult patients’ key medical events; and an LLM-based chatbot to provide context-aware informational support. We conclude our paper by reporting on the follow-up user evaluation of the system and discussing future AI-based systems for ICU caregivers of older adults.

[AI-2] ApplE: An Applied Ethics Ontology with Event Context

链接: https://arxiv.org/abs/2502.05110
作者: Aisha Aijaz,Raghava Mutharaju,Manohar Kumar
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Applied ethics is ubiquitous in most domains, requiring much deliberation due to its philosophical nature. Varying views often lead to conflicting courses of action where ethical dilemmas become challenging to resolve. Although many factors contribute to such a decision, the major driving forces can be discretized and thus simplified to provide an indicative answer. Knowledge representation and reasoning offer a way to explicitly translate abstract ethical concepts into applicable principles within the context of an event. To achieve this, we propose ApplE, an Applied Ethics ontology that captures philosophical theory and event context to holistically describe the morality of an action. The development process adheres to a modified version of the Simplified Agile Methodology for Ontology Development (SAMOD) and utilizes standard design and publication practices. Using ApplE, we model a use case from the bioethics domain that demonstrates our ontology’s social and scientific value. Apart from the ontological reasoning and quality checks, ApplE is also evaluated using the three-fold testing process of SAMOD. ApplE follows FAIR principles and aims to be a viable resource for applied ethicists and ontology engineers.

[AI-3] Leverag ing Hypernetworks and Learnable Kernels for Consumer Energy Forecasting Across Diverse Consumer Types

链接: https://arxiv.org/abs/2502.05104
作者: Muhammad Umair Danish,Katarina Grolinger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Consumer energy forecasting is essential for managing energy consumption and planning, directly influencing operational efficiency, cost reduction, personalized energy management, and sustainability efforts. In recent years, deep learning techniques, especially LSTMs and transformers, have been greatly successful in the field of energy consumption forecasting. Nevertheless, these techniques have difficulties in capturing complex and sudden variations, and, moreover, they are commonly examined only on a specific type of consumer (e.g., only offices, only schools). Consequently, this paper proposes HyperEnergy, a consumer energy forecasting strategy that leverages hypernetworks for improved modeling of complex patterns applicable across a diversity of consumers. Hypernetwork is responsible for predicting the parameters of the primary prediction network, in our case LSTM. A learnable adaptable kernel, comprised of polynomial and radial basis function kernels, is incorporated to enhance performance. The proposed HyperEnergy was evaluated on diverse consumers including, student residences, detached homes, a home with electric vehicle charging, and a townhouse. Across all consumer types, HyperEnergy consistently outperformed 10 other techniques, including state-of-the-art models such as LSTM, AttentionLSTM, and transformer.

[AI-4] Learning Temporal Invariance in Android Malware Detectors

链接: https://arxiv.org/abs/2502.05098
作者: Xinran Zheng,Shuo Yang,Edith C.H. Ngai,Suman Jana,Lorenzo Cavallaro
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning-based Android malware detectors degrade over time due to natural distribution drift caused by malware variants and new families. This paper systematically investigates the challenges classifiers trained with empirical risk minimization (ERM) face against such distribution shifts and attributes their shortcomings to their inability to learn stable discriminative features. Invariant learning theory offers a promising solution by encouraging models to generate stable representations crossing environments that expose the instability of the training set. However, the lack of prior environment labels, the diversity of drift factors, and low-quality representations caused by diverse families make this task challenging. To address these issues, we propose TIF, the first temporal invariant training framework for malware detection, which aims to enhance the ability of detectors to learn stable representations across time. TIF organizes environments based on application observation dates to reveal temporal drift, integrating specialized multi-proxy contrastive learning and invariant gradient alignment to generate and align environments with high-quality, stable representations. TIF can be seamlessly integrated into any learning-based detector. Experiments on a decade-long dataset show that TIF excels, particularly in early deployment stages, addressing real-world needs and outperforming state-of-the-art methods.

[AI-5] Causality can systematically address the monsters under the bench(marks)

链接: https://arxiv.org/abs/2502.05085
作者: Felix Leeb,Zhijing Jin,Bernhard Schölkopf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effective and reliable evaluation is essential for advancing empirical machine learning. However, the increasing accessibility of generalist models and the progress towards ever more complex, high-level tasks make systematic evaluation more challenging. Benchmarks are plagued by various biases, artifacts, or leakage, while models may behave unreliably due to poorly explored failure modes. Haphazard treatments and inconsistent formulations of such “monsters” can contribute to a duplication of efforts, a lack of trust in results, and unsupported inferences. In this position paper, we argue causality offers an ideal framework to systematically address these challenges. By making causal assumptions in an approach explicit, we can faithfully model phenomena, formulate testable hypotheses with explanatory power, and leverage principled tools for analysis. To make causal model design more accessible, we identify several useful Common Abstract Topologies (CATs) in causal graphs which help gain insight into the reasoning abilities in large language models. Through a series of case studies, we demonstrate how the precise yet pragmatic language of causality clarifies the strengths and limitations of a method and inspires new approaches for systematic progress.

[AI-6] Computing and Learning on Combinatorial Data

链接: https://arxiv.org/abs/2502.05063
作者: Simon Zhang
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS)
*备注: Ph.D. dissertation, 503 pages, 66 figures

点击查看摘要

Abstract:The twenty-first century is a data-driven era where human activities and behavior, physical phenomena, scientific discoveries, technology advancements, and almost everything that happens in the world resulting in massive generation, collection, and utilization of data. Connectivity in data is a crucial property. A straightforward example is the World Wide Web, where every webpage is connected to other web pages through hyperlinks, providing a form of directed connectivity. Combinatorial data refers to combinations of data items based on certain connectivity rules. Other forms of combinatorial data include social networks, meshes, community clusters, set systems, and molecules. This Ph.D. dissertation focuses on learning and computing with combinatorial data. We study and examine topological and connectivity features within and across connected data to improve the performance of learning and achieve high algorithmic efficiency. Comments: Ph.D. dissertation, 503 pages, 66 figures Subjects: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2502.05063 [cs.AI] (or arXiv:2502.05063v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.05063 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-7] Preference-aware compensation policies for crowdsourced on-demand services

链接: https://arxiv.org/abs/2502.05060
作者: Georgina Nouli,Axel Parmentier,Maximilian Schiffer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Crowdsourced on-demand services offer benefits such as reduced costs, faster service fulfillment times, greater adaptability, and contributions to sustainable urban transportation in on-demand delivery contexts. However, the success of an on-demand platform that utilizes crowdsourcing relies on finding a compensation policy that strikes a balance between creating attractive offers for gig workers and ensuring profitability. In this work, we examine a dynamic pricing problem for an on-demand platform that sets request-specific compensation of gig workers in a discrete-time framework, where requests and workers arrive stochastically. The operator’s goal is to determine a compensation policy that maximizes the total expected reward over the time horizon. Our approach introduces compensation strategies that explicitly account for gig worker request preferences. To achieve this, we employ the Multinomial Logit model to represent the acceptance probabilities of gig workers, and, as a result, derive an analytical solution that utilizes post-decision states. Subsequently, we integrate this solution into an approximate dynamic programming algorithm. We compare our algorithm against benchmark algorithms, including formula-based policies and an upper bound provided by the full information linear programming solution. Our algorithm demonstrates consistent performance across diverse settings, achieving improvements of at least 2.5-7.5% in homogeneous gig worker populations and 9% in heterogeneous populations over benchmarks, based on fully synthetic data. For real-world data, it surpasses benchmarks by 8% in weak and 20% in strong location preference scenarios.

[AI-8] Federated Learning for Anomaly Detection in Energy Consumption Data: Assessing the Vulnerability to Adversarial Attacks

链接: https://arxiv.org/abs/2502.05041
作者: Yohannis Kifle Telila,Damitha Senevirathne,Dumindu Tissera,Apurva Narayan,Miriam A.M. Capretz,Katarina Grolinger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 12th IEEE Conference on Technologies for Sustainability

点击查看摘要

Abstract:Anomaly detection is crucial in the energy sector to identify irregular patterns indicating equipment failures, energy theft, or other issues. Machine learning techniques for anomaly detection have achieved great success, but are typically centralized, involving sharing local data with a central server which raises privacy and security concerns. Federated Learning (FL) has been gaining popularity as it enables distributed learning without sharing local data. However, FL depends on neural networks, which are vulnerable to adversarial attacks that manipulate data, leading models to make erroneous predictions. While adversarial attacks have been explored in the image domain, they remain largely unexplored in time series problems, especially in the energy domain. Moreover, the effect of adversarial attacks in the FL setting is also mostly unknown. This paper assesses the vulnerability of FL-based anomaly detection in energy data to adversarial attacks. Specifically, two state-of-the-art models, Long Short Term Memory (LSTM) and Transformers, are used to detect anomalies in an FL setting, and two white-box attack methods, Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD), are employed to perturb the data. The results show that FL is more sensitive to PGD attacks than to FGSM attacks, attributed to PGD’s iterative nature, resulting in an accuracy drop of over 10% even with naive, weaker attacks. Moreover, FL is more affected by these attacks than centralized learning, highlighting the need for defense mechanisms in FL.

[AI-9] Bridging Voting and Deliberation with Algorithms: Field Insights from vTaiwan and Kultur Komitee

链接: https://arxiv.org/abs/2502.05017
作者: Joshua C. Yang,Fynn Bachmann
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注: Submitted to ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2025

点击查看摘要

Abstract:Democratic processes increasingly aim to integrate large-scale voting with face-to-face deliberation, addressing the challenge of reconciling individual preferences with collective decision-making. This work introduces new methods that use algorithms and computational tools to bridge online voting with face-to-face deliberation, tested in two real-world scenarios: Kultur Komitee 2024 (KK24) and vTaiwan. These case studies highlight the practical applications and impacts of the proposed methods. We present three key contributions: (1) Radial Clustering for Preference Based Subgroups, which enables both in-depth and broad discussions in deliberative settings by computing homogeneous and heterogeneous group compositions with balanced and adjustable group sizes; (2) Human-in-the-loop MES, a practical method that enhances the Method of Equal Shares (MES) algorithm with real-time digital feedback. This builds algorithmic trust by giving participants full control over how much decision-making is delegated to the voting aggregation algorithm as compared to deliberation; and (3) the ReadTheRoom deliberation method, which uses opinion space mapping to identify agreement and divergence, along with spectrum-based preference visualisation to track opinion shifts during deliberation. This approach enhances transparency by clarifying collective sentiment and fosters collaboration by encouraging participants to engage constructively with differing perspectives. By introducing these actionable frameworks, this research extends in-person deliberation with scalable digital methods that address the complexities of modern decision-making in participatory processes. Comments: Submitted to ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2025 Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); General Economics (econ.GN) MSC classes: 91B14, 91B12, 91A12, 68T01, 68T20, 68U35 ACMclasses: H.5.3; I.2.0; I.2.11; J.1; G.2.0; G.2.2; K.4.1; K.4.3 Cite as: arXiv:2502.05017 [cs.HC] (or arXiv:2502.05017v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2502.05017 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] Analyzing Advanced AI Systems Against Definitions of Life and Consciousness

链接: https://arxiv.org/abs/2502.05007
作者: Azadeh Alavi,Hossein Akhoundi,Fatemeh Kouchmeshki
类目: Artificial Intelligence (cs.AI)
*备注: 78 pages, 15 figures, 4 tables

点击查看摘要

Abstract:Could artificial intelligence ever become truly conscious in a functional sense; this paper explores that open-ended question through the lens of Life, a concept unifying classical biological criteria (Oxford, NASA, Koshland) with empirical hallmarks such as adaptive self maintenance, emergent complexity, and rudimentary self referential modeling. We propose a number of metrics for examining whether an advanced AI system has gained consciousness, while emphasizing that we do not claim all AI stems can become conscious. Rather, we suggest that sufficiently advanced architectures exhibiting immune like sabotage defenses, mirror self-recognition analogs, or meta-cognitive updates may cross key thresholds akin to life-like or consciousness-like traits. To demonstrate these ideas, we start by assessing adaptive self-maintenance capability, and introduce controlled data corruption sabotage into the training process. The result demonstrates AI capability to detect these inconsistencies and revert or self-correct analogous to regenerative biological processes. We also adapt an animal-inspired mirror self recognition test to neural embeddings, finding that partially trained CNNs can distinguish self from foreign features with complete accuracy. We then extend our analysis by performing a question-based mirror test on five state-of-the-art chatbots (ChatGPT4, Gemini, Perplexity, Claude, and Copilot) and demonstrated their ability to recognize their own answers compared to those of the other chatbots.

[AI-11] A New Paradigm in Tuning Learned Indexes: A Reinforcement Learning Enhanced Approach

链接: https://arxiv.org/abs/2502.05001
作者: Taiyi Wang,Liang Liang,Guang Yang,Thomas Heinis,Eiko Yoneki
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 15 pages

点击查看摘要

Abstract:Learned Index Structures (LIS) have significantly advanced data management by leveraging machine learning models to optimize data indexing. However, designing these structures often involves critical trade-offs, making it challenging for both designers and end-users to find an optimal balance tailored to specific workloads and scenarios. While some indexes offer adjustable parameters that demand intensive manual tuning, others rely on fixed configurations based on heuristic auto-tuners or expert knowledge, which may not consistently deliver optimal performance. This paper introduces LITune, a novel framework for end-to-end automatic tuning of Learned Index Structures. LITune employs an adaptive training pipeline equipped with a tailor-made Deep Reinforcement Learning (DRL) approach to ensure stable and efficient tuning. To accommodate long-term dynamics arising from online tuning, we further enhance LITune with an on-the-fly updating mechanism termed the O2 system. These innovations allow LITune to effectively capture state transitions in online tuning scenarios and dynamically adjust to changing data distributions and workloads, marking a significant improvement over other tuning methods. Our experimental results demonstrate that LITune achieves up to a 98% reduction in runtime and a 17-fold increase in throughput compared to default parameter settings given a selected Learned Index instance. These findings highlight LITune’s effectiveness and its potential to facilitate broader adoption of LIS in real-world applications. Comments: 15 pages Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) Cite as: arXiv:2502.05001 [cs.DB] (or arXiv:2502.05001v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2502.05001 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-12] Robust Graph Learning Against Adversarial Evasion Attacks via Prior-Free Diffusion-Based Structure Purification WWW2025

链接: https://arxiv.org/abs/2502.05000
作者: Jiayi Luo,Qingyun Sun,Haonan Yuan,Xingcheng Fu,Jianxin Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for poster at WWW 2025

点击查看摘要

Abstract:Adversarial evasion attacks pose significant threats to graph learning, with lines of studies that have improved the robustness of Graph Neural Networks (GNNs). However, existing works rely on priors about clean graphs or attacking strategies, which are often heuristic and inconsistent. To achieve robust graph learning over different types of evasion attacks and diverse datasets, we investigate this problem from a prior-free structure purification perspective. Specifically, we propose a novel Diffusion-based Structure Purification framework named DiffSP, which creatively incorporates the graph diffusion model to learn intrinsic distributions of clean graphs and purify the perturbed structures by removing adversaries under the direction of the captured predictive patterns without relying on priors. DiffSP is divided into the forward diffusion process and the reverse denoising process, during which structure purification is achieved. To avoid valuable information loss during the forward process, we propose an LID-driven nonisotropic diffusion mechanism to selectively inject noise anisotropically. To promote semantic alignment between the clean graph and the purified graph generated during the reverse process, we reduce the generation uncertainty by the proposed graph transfer entropy guided denoising mechanism. Extensive experiments demonstrate the superior robustness of DiffSP against evasion attacks.

[AI-13] On Sequential Fault-Intolerant Process Planning

链接: https://arxiv.org/abs/2502.04998
作者: Andrzej Kaczmarczyk,Davin Choo,Niclas Boehmer,Milind Tambe,Haifeng Xu
类目: Artificial Intelligence (cs.AI)
*备注: 20 pages; 7 figures

点击查看摘要

Abstract:We propose and study a planning problem we call Sequential Fault-Intolerant Process Planning (SFIPP). SFIPP captures a reward structure common in many sequential multi-stage decision problems where the planning is deemed successful only if all stages succeed. Such reward structures are different from classic additive reward structures and arise in important applications such as drug/material discovery, security, and quality-critical product design. We design provably tight online algorithms for settings in which we need to pick between different actions with unknown success chances at each stage. We do so both for the foundational case in which the behavior of actions is deterministic, and the case of probabilistic action outcomes, where we effectively balance exploration for learning and exploitation for planning through the usage of multi-armed bandit algorithms. In our empirical evaluations, we demonstrate that the specialized algorithms we develop, which leverage additional information about the structure of the SFIPP instance, outperform our more general algorithm.

[AI-14] Fast Adaptive Anti-Jamming Channel Access via Deep Q Learning and Coarse-Grained Spectrum Prediction

链接: https://arxiv.org/abs/2502.04963
作者: Jianshu Zhang,Xiaofu Wu,Junquan Hu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates the anti-jamming channel access problem in complex and unknown jamming environments, where the jammer could dynamically adjust its strategies to target different channels. Traditional channel hopping anti-jamming approaches using fixed patterns are ineffective against such dynamic jamming attacks. Although the emerging deep reinforcement learning (DRL) based dynamic channel access approach could achieve the Nash equilibrium under fast-changing jamming attacks, it requires extensive training episodes. To address this issue, we propose a fast adaptive anti-jamming channel access approach guided by the intuition of ``learning faster than the jammer", where a synchronously updated coarse-grained spectrum prediction serves as an auxiliary task for the deep Q learning (DQN) based anti-jamming model. This helps the model identify a superior Q-function compared to standard DRL while significantly reducing the number of training episodes. Numerical results indicate that the proposed approach significantly accelerates the rate of convergence in model training, reducing the required training episodes by up to 70% compared to standard DRL. Additionally, it also achieves a 10% improvement in throughput over NE strategies, owing to the effective use of coarse-grained spectrum prediction.

[AI-15] he Rising Threat to Emerging AI-Powered Search Engines

链接: https://arxiv.org/abs/2502.04951
作者: Zeren Luo,Zifan Peng,Yule Liu,Zhen Sun,Mingchen Li,Jingyi Zheng,Xinlei He
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly enhanced the capabilities of AI-Powered Search Engines (AIPSEs), offering precise and efficient responses by integrating external databases with pre-existing knowledge. However, we observe that these AIPSEs raise risks such as quoting malicious content or citing malicious websites, leading to harmful or unverified information dissemination. In this study, we conduct the first safety risk quantification on seven production AIPSEs by systematically defining the threat model, risk level, and evaluating responses to various query types. With data collected from PhishTank, ThreatBook, and LevelBlue, our findings reveal that AIPSEs frequently generate harmful content that contains malicious URLs even with benign queries (e.g., with benign keywords). We also observe that directly query URL will increase the risk level while query with natural language will mitigate such risk. We further perform two case studies on online document spoofing and phishing to show the ease of deceiving AIPSEs in the real-world setting. To mitigate these risks, we develop an agent-based defense with a GPT-4o-based content refinement tool and an XGBoost-based URL detector. Our evaluation shows that our defense can effectively reduce the risk but with the cost of reducing available information. Our research highlights the urgent need for robust safety measures in AIPSEs.

[AI-16] Data-driven Modality Fusion: An AI-enabled Framework for Large-Scale Sensor Network Management

链接: https://arxiv.org/abs/2502.04937
作者: Hrishikesh Dutta,Roberto Minerva,Maira Alvi,Noel Crespi
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development and operation of smart cities relyheavily on large-scale Internet-of-Things (IoT) networks and sensor infrastructures that continuously monitor various aspects of urban environments. These networks generate vast amounts of data, posing challenges related to bandwidth usage, energy consumption, and system scalability. This paper introduces a novel sensing paradigm called Data-driven Modality Fusion (DMF), designed to enhance the efficiency of smart city IoT network management. By leveraging correlations between timeseries data from different sensing modalities, the proposed DMF approach reduces the number of physical sensors required for monitoring, thereby minimizing energy expenditure, communication bandwidth, and overall deployment costs. The framework relocates computational complexity from the edge devices to the core, ensuring that resource-constrained IoT devices are not burdened with intensive processing tasks. DMF is validated using data from a real-world IoT deployment in Madrid, demonstrating the effectiveness of the proposed system in accurately estimating traffic, environmental, and pollution metrics from a reduced set of sensors. The proposed solution offers a scalable, efficient mechanism for managing urban IoT networks, while addressing issues of sensor failure and privacy concerns.

[AI-17] Conformal Prediction for Electricity Price Forecasting in the Day-Ahead and Real-Time Balancing Market

链接: https://arxiv.org/abs/2502.04935
作者: Ciaran O’Connor,Mohamed Bahloul,Roberto Rossi,Steven Prestwich,Andrea Visentin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of renewable energy into electricity markets poses significant challenges to price stability and increases the complexity of market operations. Accurate and reliable electricity price forecasting is crucial for effective market participation, where price dynamics can be significantly more challenging to predict. Probabilistic forecasting, through prediction intervals, efficiently quantifies the inherent uncertainties in electricity prices, supporting better decision-making for market participants. This study explores the enhancement of probabilistic price prediction using Conformal Prediction (CP) techniques, specifically Ensemble Batch Prediction Intervals and Sequential Predictive Conformal Inference. These methods provide precise and reliable prediction intervals, outperforming traditional models in validity metrics. We propose an ensemble approach that combines the efficiency of quantile regression models with the robust coverage properties of time series adapted CP techniques. This ensemble delivers both narrow prediction intervals and high coverage, leading to more reliable and accurate forecasts. We further evaluate the practical implications of CP techniques through a simulated trading algorithm applied to a battery storage system. The ensemble approach demonstrates improved financial returns in energy trading in both the Day-Ahead and Balancing Markets, highlighting its practical benefits for market participants.

[AI-18] Complex Physics-Informed Neural Network

链接: https://arxiv.org/abs/2502.04917
作者: Chenhao Si,Ming Yan,Xin Li,Zhihong Xia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:We propose compleX-PINN, a novel physics-informed neural network (PINN) architecture that incorporates a learnable activation function inspired by Cauchy integral theorem. By learning the parameters of the activation function, compleX-PINN achieves high accuracy with just a single hidden layer. Empirical results show that compleX-PINN effectively solves problems where traditional PINNs struggle and consistently delivers significantly higher precision, often by an order of magnitude.

[AI-19] Unified Approaches in Self-Supervised Event Stream Modeling: Progress and Prospects

链接: https://arxiv.org/abs/2502.04899
作者: Levente Zólyomi,Tianze Wang,Sofiane Ennadir,Oleg Smirnov,Lele Cao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The proliferation of digital interactions across diverse domains, such as healthcare, e-commerce, gaming, and finance, has resulted in the generation of vast volumes of event stream (ES) data. ES data comprises continuous sequences of timestamped events that encapsulate detailed contextual information relevant to each domain. While ES data holds significant potential for extracting actionable insights and enhancing decision-making, its effective utilization is hindered by challenges such as the scarcity of labeled data and the fragmented nature of existing research efforts. Self-Supervised Learning (SSL) has emerged as a promising paradigm to address these challenges by enabling the extraction of meaningful representations from unlabeled ES data. In this survey, we systematically review and synthesize SSL methodologies tailored for ES modeling across multiple domains, bridging the gaps between domain-specific approaches that have traditionally operated in isolation. We present a comprehensive taxonomy of SSL techniques, encompassing both predictive and contrastive paradigms, and analyze their applicability and effectiveness within different application contexts. Furthermore, we identify critical gaps in current research and propose a future research agenda aimed at developing scalable, domain-agnostic SSL frameworks for ES modeling. By unifying disparate research efforts and highlighting cross-domain synergies, this survey aims to accelerate innovation, improve reproducibility, and expand the applicability of SSL to diverse real-world ES challenges.

[AI-20] Sparse Autoencoders Do Not Find Canonical Units of Analysis ICLR2025

链接: https://arxiv.org/abs/2502.04878
作者: Patrick Leask,Bart Bussmann,Michael Pearce,Joseph Bloom,Curt Tigges,Noura Al Moubayed,Lee Sharkey,Neel Nanda
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to ICLR 2025

点击查看摘要

Abstract:A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textitcanonical set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: \emphnovel latents, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emphreconstruction latents, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs – SAEs trained on the decoder matrix of another SAE – we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing Einstein'' decomposes into scientist’‘, Germany'', and famous person’'. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: this https URL

[AI-21] TAR2: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2502.04864
作者: Aditya Kapoor,Kale-ab Tessera,Mayank Baranwal,Harshad Khadilkar,Stefano Albrecht,Mingfei Sun
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 23 pages, 5 figures, 4 tables

点击查看摘要

Abstract:In cooperative multi-agent reinforcement learning (MARL), learning effective policies is challenging when global rewards are sparse and delayed. This difficulty arises from the need to assign credit across both agents and time steps, a problem that existing methods often fail to address in episodic, long-horizon tasks. We propose Temporal-Agent Reward Redistribution TAR^2 , a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components, thereby providing more frequent and accurate feedback for policy learning. Theoretically, we show that TAR^2 (i) aligns with potential-based reward shaping, preserving the same optimal policies as the original environment, and (ii) maintains policy gradient update directions identical to those under the original sparse reward, ensuring unbiased credit signals. Empirical results on two challenging benchmarks, SMACLite and Google Research Football, demonstrate that TAR^2 significantly stabilizes and accelerates convergence, outperforming strong baselines like AREL and STAS in both learning speed and final performance. These findings establish TAR^2 as a principled and practical solution for agent-temporal credit assignment in sparse-reward multi-agent systems.

[AI-22] Optimistic Gradient Learning with Hessian Corrections for High-Dimensional Black-Box Optimization

链接: https://arxiv.org/abs/2502.04829
作者: Yedidya Kfir,Elad Sarafian,Sarit Kraus,Yoram Louzoun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: We develop a black-box optimization algorithm that learns gradients with neural models and can be applied to solve non-convex high dimensional real-world problems

点击查看摘要

Abstract:Black-box algorithms are designed to optimize functions without relying on their underlying analytical structure or gradient information, making them essential when gradients are inaccessible or difficult to compute. Traditional methods for solving black-box optimization (BBO) problems predominantly rely on non-parametric models and struggle to scale to large input spaces. Conversely, parametric methods that model the function with neural estimators and obtain gradient signals via backpropagation may suffer from significant gradient errors. A recent alternative, Explicit Gradient Learning (EGL), which directly learns the gradient using a first-order Taylor approximation, has demonstrated superior performance over both parametric and non-parametric methods. In this work, we propose two novel gradient learning variants to address the robustness challenges posed by high-dimensional, complex, and highly non-linear problems. Optimistic Gradient Learning (OGL) introduces a bias toward lower regions in the function landscape, while Higher-order Gradient Learning (HGL) incorporates second-order Taylor corrections to improve gradient accuracy. We combine these approaches into the unified OHGL algorithm, achieving state-of-the-art (SOTA) performance on the synthetic COCO suite. Additionally, we demonstrate OHGLs applicability to high-dimensional real-world machine learning (ML) tasks such as adversarial training and code generation. Our results highlight OHGLs ability to generate stronger candidates, offering a valuable tool for ML researchers and practitioners tackling high-dimensional, non-linear optimization challenges

[AI-23] Enhancing SQL Injection Detection and Prevention Using Generative Models

链接: https://arxiv.org/abs/2502.04786
作者: Naga Sai Dasari,Atta Badii,Armin Moin,Ahmed Ashlam
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 13 pages, 22 Figures, 1 Table

点击查看摘要

Abstract:SQL Injection (SQLi) continues to pose a significant threat to the security of web applications, enabling attackers to manipulate databases and access sensitive information without authorisation. Although advancements have been made in detection techniques, traditional signature-based methods still struggle to identify sophisticated SQL injection attacks that evade predefined patterns. As SQLi attacks evolve, the need for more adaptive detection systems becomes crucial. This paper introduces an innovative approach that leverages generative models to enhance SQLi detection and prevention mechanisms. By incorporating Variational Autoencoders (VAE), Conditional Wasserstein GAN with Gradient Penalty (CWGAN-GP), and U-Net, synthetic SQL queries were generated to augment training datasets for machine learning models. The proposed method demonstrated improved accuracy in SQLi detection systems by reducing both false positives and false negatives. Extensive empirical testing further illustrated the ability of the system to adapt to evolving SQLi attack patterns, resulting in enhanced precision and robustness.

[AI-24] SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning

链接: https://arxiv.org/abs/2502.04780
作者: Wanjia Zhao,Mert Yuksekgonul,Shirley Wu,James Zou
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-agent AI systems powered by large language models (LLMs) are increasingly applied to solve complex tasks. However, these systems often rely on fragile, manually designed prompts and heuristics, making optimization difficult. A key challenge in optimizing multi-agent systems is acquiring suitable training data for specialized agents. We introduce SiriuS, a self-improving, reasoning-driven optimization framework for multi-agent systems. Central to our approach is the construction of an experience library: a repository of high-quality reasoning trajectories. The library is built by retaining reasoning steps that lead to successful outcomes, providing a robust training set for optimizing multi-agent system. Additionally, we introduce a library augmentation procedure that refines unsuccessful trajectories, further enriching the library. SiriuS boosts performance by 2.86% to 21.88% on reasoning and biomedical QA and enhances agent negotiation in competitive settings. Our results show that SiriuS enhances multi-agent performance while generating reusable data for self-correction and self-play enhancement in the future.

[AI-25] Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2502.04778
作者: Chen-Xiao Gao,Chenyang Wu,Mingjun Cao,Chenjun Xiao,Yang Yu,Zongzhang Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:The primary focus of offline reinforcement learning (RL) is to manage the risk of hazardous exploitation of out-of-distribution actions. An effective approach to achieve this goal is through behavior regularization, which augments conventional RL objectives by incorporating constraints that enforce the policy to remain close to the behavior policy. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the diffusion trajectory. By integrating the regularization, we develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint. Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance.

[AI-26] DMPA: Model Poisoning Attacks on Decentralized Federated Learning for Model Differences

链接: https://arxiv.org/abs/2502.04771
作者: Chao Feng,Yunlong Li,Yuanzhe Gao,Alberto Huertas Celdrán,Jan von der Assen,Gérôme Bovet,Burkhard Stiller
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Federated learning (FL) has garnered significant attention as a prominent privacy-preserving Machine Learning (ML) paradigm. Decentralized FL (DFL) eschews traditional FL’s centralized server architecture, enhancing the system’s robustness and scalability. However, these advantages of DFL also create new vulnerabilities for malicious participants to execute adversarial attacks, especially model poisoning attacks. In model poisoning attacks, malicious participants aim to diminish the performance of benign models by creating and disseminating the compromised model. Existing research on model poisoning attacks has predominantly concentrated on undermining global models within the Centralized FL (CFL) paradigm, while there needs to be more research in DFL. To fill the research gap, this paper proposes an innovative model poisoning attack called DMPA. This attack calculates the differential characteristics of multiple malicious client models and obtains the most effective poisoning strategy, thereby orchestrating a collusive attack by multiple participants. The effectiveness of this attack is validated across multiple datasets, with results indicating that the DMPA approach consistently surpasses existing state-of-the-art FL model poisoning attack strategies.

[AI-27] Graph Federated Learning Based Proactive Content Caching in Edge Computing

链接: https://arxiv.org/abs/2502.04760
作者: Rui Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid growth of mobile data traffic and the increasing prevalence of video streaming, proactive content caching in edge computing has become crucial for reducing latency and alleviating network congestion. However, traditional caching strategies such as FIFO, LRU, and LFU fail to effectively predict future content popularity, while existing proactive caching approaches often require users to upload data to a central server, raising concerns regarding privacy and scalability. To address these challenges, this paper proposes a Graph Federated Learning-based Proactive Content Caching (GFPCC) scheme that enhances caching efficiency while preserving user privacy. The proposed approach integrates federated learning and graph neural networks, enabling users to locally train Light Graph Convolutional Networks (LightGCN) to capture user-item relationships and predict content popularity. Instead of sharing raw data, only the trained model parameters are transmitted to the central server, where a federated averaging algorithm aggregates updates, refines the global model, and selects the most popular files for proactive caching. Experimental evaluations on real-world datasets, such as MovieLens, demonstrate that GFPCC outperforms baseline caching algorithms by achieving higher cache efficiency through more accurate content popularity predictions. Moreover, the federated learning framework strengthens privacy protection while maintaining efficient model training; however, scalability remains a challenge in large-scale networks with dynamic user preferences.

[AI-28] Enhancing Phishing Email Identification with Large Language Models

链接: https://arxiv.org/abs/2502.04759
作者: Catherine Lee
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Phishing has long been a common tactic used by cybercriminals and continues to pose a significant threat in today’s digital world. When phishing attacks become more advanced and sophisticated, there is an increasing need for effective methods to detect and prevent them. To address the challenging problem of detecting phishing emails, researchers have developed numerous solutions, in particular those based on machine learning (ML) algorithms. In this work, we take steps to study the efficacy of large language models (LLMs) in detecting phishing emails. The experiments show that the LLM achieves a high accuracy rate at high precision; importantly, it also provides interpretable evidence for the decisions.

[AI-29] Every Software as an Agent : Blueprint and Case Study

链接: https://arxiv.org/abs/2502.04747
作者: Mengwei Xu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of (multimodal) large language models (LLMs) has shed light on software agent – where software can understand and follow user instructions in natural language. However, existing approaches such as API-based and GUI-based agents are far from satisfactory at accuracy and efficiency aspects. Instead, we advocate to endow LLMs with access to the software internals (source code and runtime context) and the permission to dynamically inject generated code into software for execution. In such a whitebox setting, one may better leverage the software context and the coding ability of LLMs. We then present an overall design architecture and case studies on two popular web-based desktop applications. We also give in-depth discussion of the challenges and future directions. We deem that such a new paradigm has the potential to fundamentally overturn the existing software agent design, and finally creating a digital world in which software can comprehend, operate, collaborate, and even think to meet complex user needs.

[AI-30] Generating Symbolic World Models via Test-time Scaling of Large Language Models

链接: https://arxiv.org/abs/2502.04728
作者: Zhouliang Yu,Yuhuan Yuan,Tim Z. Xiao,Fuxiang Frank Xia,Jie Fu,Ge Zhang,Ge Lin,Weiyang Liu
类目: Artificial Intelligence (cs.AI)
*备注: Technical Report v1 (32 pages, 6 figures)

点击查看摘要

Abstract:Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domain, achieving over 50% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.

[AI-31] EigenLoRAx: Recycling Adapters to Find Principal Subspaces for Resource-Efficient Adaptation and Inference

链接: https://arxiv.org/abs/2502.04700
作者: Prakhar Kaushik,Ankit Vaidya,Shravan Chaudhari,Alan Yuille
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid growth of large models has raised concerns about their environmental impact and equity in accessibility due to significant computational costs. Low-Rank Adapters (LoRA) offer a lightweight solution for finetuning large models, resulting in an abundance of publicly available adapters tailored to diverse domains. We ask: Can these pretrained adapters be leveraged to further streamline adaptation to new tasks while addressing these challenges? We introduce EigenLoRAx, a parameter-efficient finetuning method that recycles existing adapters to create a principal subspace aligned with their shared domain knowledge which can be further augmented with orthogonal basis vectors in low-resource scenarios. This enables rapid adaptation to new tasks by learning only lightweight coefficients on the principal components of the subspace - eliminating the need to finetune entire adapters. EigenLoRAx requires significantly fewer parameters and memory, improving efficiency for both training and inference. Our method demonstrates strong performance across diverse domains and tasks, offering a scalable for edge-based applications, personalization, and equitable deployment of large models in resource-constrained environments.

[AI-32] Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance

链接: https://arxiv.org/abs/2502.04695
作者: Pratinav Seth,Vinay Kumar Sankarapu
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This position paper emphasizes the critical gap in the evaluation of Explainable AI (XAI) due to the lack of standardized and reliable metrics, which diminishes its practical value, trustworthiness, and ability to meet regulatory requirements. Current evaluation methods are often fragmented, subjective, and biased, making them prone to manipulation and complicating the assessment of complex models. A central issue is the absence of a ground truth for explanations, complicating comparisons across various XAI approaches. To address these challenges, we advocate for widespread research into developing robust, context-sensitive evaluation metrics. These metrics should be resistant to manipulation, relevant to each use case, and based on human judgment and real-world applicability. We also recommend creating domain-specific evaluation benchmarks that align with the user and regulatory needs of sectors such as healthcare and finance. By encouraging collaboration among academia, industry, and regulators, we can create standards that balance flexibility and consistency, ensuring XAI explanations are meaningful, trustworthy, and compliant with evolving regulations.

[AI-33] Learning Strategic Language Agents in the Werewolf Game with Iterative Latent Space Policy Optimization

链接: https://arxiv.org/abs/2502.04686
作者: Zelai Xu,Wanjun Gu,Chao Yu,Yi Wu,Yu Wang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language model (LLM)-based agents have recently shown impressive progress in a variety of domains, including open-ended conversation and multi-step decision-making. However, applying these agents to social deduction games such as Werewolf, which requires both strategic decision-making and free-form language interaction, remains non-trivial. Traditional methods based on Counterfactual Regret Minimization (CFR) or reinforcement learning (RL) typically depend on a predefined action space, making them unsuitable for language games with unconstrained text action space. Meanwhile, pure LLM-based agents often suffer from intrinsic biases and require prohibitively large datasets for fine-tuning. We propose Latent Space Policy Optimization (LSPO), an iterative framework that addresses these challenges by first mapping free-form text to a discrete latent space, where methods like CFR and RL can learn strategic policy more effectively. We then translate the learned policy back into natural language dialogues, which are used to fine-tune an LLM via Direct Preference Optimization (DPO). By iteratively alternating between these stages, our LSPO agent progressively enhances both strategic reasoning and language communication. Experiment results on the Werewolf game show that our method improves the agent’s performance in each iteration and outperforms existing Werewolf agents, underscoring its promise for free-form language decision-making.

[AI-34] G2PDiffusion: Genotype-to-Phenotype Prediction with Diffusion Models

链接: https://arxiv.org/abs/2502.04684
作者: Mengdi Liu,Zhangyang Gao,Hong Chang,Stan Z. Li,Shiguang Shan,Xinlin Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Discovering the genotype-phenotype relationship is crucial for genetic engineering, which will facilitate advances in fields such as crop breeding, conservation biology, and personalized medicine. Current research usually focuses on single species and small datasets due to limitations in phenotypic data collection, especially for traits that require visual assessments or physical measurements. Deciphering complex and composite phenotypes, such as morphology, from genetic data at scale remains an open question. To break through traditional generic models that rely on simplified assumptions, this paper introduces G2PDiffusion, the first-of-its-kind diffusion model designed for genotype-to-phenotype generation across multiple species. Specifically, we use images to represent morphological phenotypes across species and redefine phenotype prediction as conditional image generation. To this end, this paper introduces an environment-enhanced DNA sequence conditioner and trains a stable diffusion model with a novel alignment method to improve genotype-to-phenotype consistency. Extensive experiments demonstrate that our approach enhances phenotype prediction accuracy across species, capturing subtle genetic variations that contribute to observable traits.

[AI-35] Scalable Oversight for Superhuman AI via Recursive Self-Critiquing

链接: https://arxiv.org/abs/2502.04675
作者: Xueru Wen,Jie Lou,Xinyu Lu,Junjie Yang,Yanjiang Liu,Yaojie Lu,Debing Zhang,XingYu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques including SFT and RLHF face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become untenable when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) critique of critique can be easier than critique itself, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) this difficulty relationship is recursively held, suggesting that when direct evaluation is infeasible, performing high-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. To examine these hypotheses, we perform Human-Human, Human-AI, and AI-AI experiments across multiple tasks. Our results demonstrate encouraging evidence supporting these hypotheses and suggest that recursive self-critiquing is a promising direction for scalable oversight.

[AI-36] rm Psmall ROOFWsmall ALA: Multilingual Proof Data Synthesis and Theorem-Proving

链接: https://arxiv.org/abs/2502.04671
作者: Amitayush Thakur,George Tsoukalas,Greg Durrett,Swarat Chaudhuri
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Neural networks have shown substantial promise at automatic theorem-proving in interactive proof assistants (ITPs) like Lean and Coq. However, most neural theorem-proving models are restricted to specific ITPs, leaving out opportunities for cross-lingual \textittransfer between ITPs. We address this weakness with a multilingual proof framework, \rm P\small ROOFW\small ALA , that allows a standardized form of interaction between neural theorem-provers and two established ITPs (Coq and Lean). It enables the collection of multilingual proof step data – data recording the result of proof actions on ITP states – for training neural provers. \rm P\small ROOFW\small ALA allows the systematic evaluation of a model’s performance across different ITPs and problem domains via efficient parallel proof search algorithms. We show that multilingual training enabled by \rm P\small ROOFW\small ALA can lead to successful transfer across ITPs. Specifically, a model trained on a mix of \rm P\small ROOFW\small ALA -generated Coq and Lean data outperforms Lean-only and Coq-only models on the standard prove-at- k metric. We open source all code including code for the \hrefthis https URLProofWala; Framework , and the \hrefthis https URLMultilingual; ITP; interaction; framework .

[AI-37] CCS: Controllable and Constrained Sampling with Diffusion Models via Initial Noise Perturbation

链接: https://arxiv.org/abs/2502.04670
作者: Bowen Song,Zecheng Zhang,Zhaoxu Luo,Jason Hu,Wei Yuan,Jing Jia,Zhengxu Tang,Guanyang Wang,Liyue Shen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful tools for generative tasks, producing high-quality outputs across diverse domains. However, how the generated data responds to the initial noise perturbation in diffusion models remains under-explored, which hinders understanding the controllability of the sampling process. In this work, we first observe an interesting phenomenon: the relationship between the change of generation outputs and the scale of initial noise perturbation is highly linear through the diffusion ODE sampling. Then we provide both theoretical and empirical study to justify this linearity property of this input-output (noise-generation data) relationship. Inspired by these new insights, we propose a novel Controllable and Constrained Sampling method (CCS) together with a new controller algorithm for diffusion models to sample with desired statistical properties while preserving good sample quality. We perform extensive experiments to compare our proposed sampling approach with other methods on both sampling controllability and sampled data quality. Results show that our CCS method achieves more precisely controlled sampling while maintaining superior sample quality and diversity.

[AI-38] A Comprehensive Review on Noise Control of Diffusion Model

链接: https://arxiv.org/abs/2502.04669
作者: Zhehao Guo,Jiedong Lang,Shuyu Huang,Yunfei Gao,Xintong Ding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have recently emerged as powerful generative frameworks for producing high-quality images. A pivotal component of these models is the noise schedule, which governs the rate of noise injection during the diffusion process. Since the noise schedule substantially influences sampling quality and training quality, understanding its design and implications is crucial. In this discussion, various noise schedules are examined, and their distinguishing features and performance characteristics are highlighted.

[AI-39] Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization

链接: https://arxiv.org/abs/2502.04667
作者: Xinhao Yao,Ruifeng Ren,Yun Liao,Yong Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training large language models (LLMs) with high-quality Chain-of-Thought (CoT) annotations has become a widely adopted strategy due to its significant enhancement of reasoning capabilities. To fully comprehend this approach, two questions naturally arise: (Q1) What advantages does training with CoT offer compared to training without CoT? (Q2) If there are advantages, what are the underlying mechanisms of explicit CoT training? Analyzing the advantages and mechanisms of CoT training is challenging due to the many factors involved. To address this, we conduct a detailed analysis using clear and controllable data distributions and, for the first time, reveal that CoT training offers the following advantages: (1) Training with CoT markedly improves reasoning generalization, extending it from in-distribution (ID) to both ID and out-of-distribution (OOD) scenarios, while also speeding up convergence; (2) Even when training with CoT includes a certain range of erroneous reasoning steps, it still enables the model to learn reasoning patterns, leading to systematic generalization. We further explore the underlying mechanisms from a circuit perspective: (1) The data distribution (e.g., ratio \lambda and pattern) plays a crucial role in influencing the model’s systematic generalization; (2) CoT training (with two-hop facts) internalizes reasoning into a two-stage generalizing circuit, where the number of stages corresponds to the explicit reasoning steps during training. Our findings elucidate the mechanisms underlying explicit CoT training and offer critical insights into tuning strategies for LLMs to achieve robust generalization.

[AI-40] Importance Sampling via Score-based Generative Models

链接: https://arxiv.org/abs/2502.04646
作者: Heasung Kim,Taekyun Lee,Hyeji Kim,Gustavo de Veciana
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:Importance sampling, which involves sampling from a probability density function (PDF) proportional to the product of an importance weight function and a base PDF, is a powerful technique with applications in variance reduction, biased or customized sampling, data augmentation, and beyond. Inspired by the growing availability of score-based generative models (SGMs), we propose an entirely training-free Importance sampling framework that relies solely on an SGM for the base PDF. Our key innovation is realizing the importance sampling process as a backward diffusion process, expressed in terms of the score function of the base PDF and the specified importance weight function–both readily available–eliminating the need for any additional training. We conduct a thorough analysis demonstrating the method’s scalability and effectiveness across diverse datasets and tasks, including importance sampling for industrial and natural images with neural importance weight functions. The training-free aspect of our method is particularly compelling in real-world scenarios where a single base distribution underlies multiple biased sampling tasks, each requiring a different importance weight function. To the best of our knowledge our approach is the first importance sampling framework to achieve this.

[AI-41] Cross-Encoder Rediscovers a Semantic Variant of BM25

链接: https://arxiv.org/abs/2502.04645
作者: Meng Lu,Catherine Chen,Carsten Eickhoff
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural Ranking Models (NRMs) have rapidly advanced state-of-the-art performance on information retrieval tasks. In this work, we investigate a Cross-Encoder variant of MiniLM to determine which relevance features it computes and where they are stored. We find that it employs a semantic variant of the traditional BM25 in an interpretable manner, featuring localized components: (1) Transformer attention heads that compute soft term frequency while controlling for term saturation and document length effects, and (2) a low-rank component of its embedding matrix that encodes inverse document frequency information for the vocabulary. This suggests that the Cross-Encoder uses the same fundamental mechanisms as BM25, but further leverages their capacity to capture semantics for improved retrieval performance. The granular understanding lays the groundwork for model editing to enhance model transparency, addressing safety concerns, and improving scalability in training and real-world applications.

[AI-42] An Empirical Study of Code Obfuscation Practices in the Google Play Store

链接: https://arxiv.org/abs/2502.04636
作者: Akila Niroshan,Suranga Seneviratne,Aruna Seneviratne
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:The Android ecosystem is vulnerable to issues such as app repackaging, counterfeiting, and piracy, threatening both developers and users. To mitigate these risks, developers often employ code obfuscation techniques. However, while effective in protecting legitimate applications, obfuscation also hinders security investigations as it is often exploited for malicious purposes. As such, it is important to understand code obfuscation practices in Android apps. In this paper, we analyze over 500,000 Android APKs from Google Play, spanning an eight-year period, to investigate the evolution and prevalence of code obfuscation techniques. First, we propose a set of classifiers to detect obfuscated code, tools, and techniques and then conduct a longitudinal analysis to identify trends. Our results show a 13% increase in obfuscation from 2016 to 2023, with ProGuard and Allatori as the most commonly used tools. We also show that obfuscation is more prevalent in top-ranked apps and gaming genres such as Casino apps. To our knowledge, this is the first large-scale study of obfuscation adoption in the Google Play Store, providing insights for developers and security analysts.

[AI-43] he alpha-Alternator: Dynamic Adaptation To Varying Noise Levels In Sequences Using The Vendi Score For Improved Robustness and Performance

链接: https://arxiv.org/abs/2502.04593
作者: Mohammad Reza Rezaei,Adji Bousso Dieng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: The codebase will be made available upon publication. This paper is dedicated to Patrice Lumumba

点击查看摘要

Abstract:Current state-of-the-art dynamical models, such as Mamba, assume the same level of noisiness for all elements of a given sequence, which limits their performance on noisy temporal data. In this paper, we introduce the \alpha -Alternator, a novel generative model for time-dependent data that dynamically adapts to the complexity introduced by varying noise levels in sequences. The \alpha -Alternator leverages the Vendi Score (VS), a flexible similarity-based diversity metric, to adjust, at each time step t , the influence of the sequence element at time t and the latent representation of the dynamics up to that time step on the predicted future dynamics. This influence is captured by a parameter that is learned and shared across all sequences in a given dataset. The sign of this parameter determines the direction of influence. A negative value indicates a noisy dataset, where a sequence element that increases the VS is considered noisy, and the model relies more on the latent history when processing that element. Conversely, when the parameter is positive, a sequence element that increases the VS is considered informative, and the \alpha -Alternator relies more on this new input than on the latent history when updating its predicted latent dynamics. The \alpha -Alternator is trained using a combination of observation masking and Alternator loss minimization. Masking simulates varying noise levels in sequences, enabling the model to be more robust to these fluctuations and improving its performance in trajectory prediction, imputation, and forecasting. Our experimental results demonstrate that the \alpha -Alternator outperforms both Alternators and state-of-the-art state-space models across neural decoding and time-series forecasting benchmarks.

[AI-44] CAMEF: Causal-Augmented Multi-Modality Event-Driven Financial Forecasting by Integrating Time Series Patterns and Salient Macroeconomic Announcements

链接: https://arxiv.org/abs/2502.04592
作者: Yang Zhang,Wenbo Yang,Jun Wang,Qiang Ma,Jie Xiong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Accurately forecasting the impact of macroeconomic events is critical for investors and policymakers. Salient events like monetary policy decisions and employment reports often trigger market movements by shaping expectations of economic growth and risk, thereby establishing causal relationships between events and market behavior. Existing forecasting methods typically focus either on textual analysis or time-series modeling, but fail to capture the multi-modal nature of financial markets and the causal relationship between events and price movements. To address these gaps, we propose CAMEF (Causal-Augmented Multi-Modality Event-Driven Financial Forecasting), a multi-modality framework that effectively integrates textual and time-series data with a causal learning mechanism and an LLM-based counterfactual event augmentation technique for causal-enhanced financial forecasting. Our contributions include: (1) a multi-modal framework that captures causal relationships between policy texts and historical price data; (2) a new financial dataset with six types of macroeconomic releases from 2008 to April 2024, and high-frequency real trading data for five key U.S. financial assets; and (3) an LLM-based counterfactual event augmentation strategy. We compare CAMEF to state-of-the-art transformer-based time-series and multi-modal baselines, and perform ablation studies to validate the effectiveness of the causal learning mechanism and event types.

[AI-45] Rethinking Oversmoothing in Graph Neural Networks: A Rank-Based Perspective

链接: https://arxiv.org/abs/2502.04591
作者: Piero Deidda,Kaicheng Zhang,Desmond Higham,Francesco Tudisco
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Oversmoothing is a fundamental challenge in graph neural networks (GNNs): as the number of layers increases, node embeddings become increasingly similar, and model performance drops sharply. Traditionally, oversmoothing has been quantified using metrics that measure the similarity of neighbouring node features, such as the Dirichlet energy. While these metrics are related to oversmoothing, we argue they have critical limitations and fail to reliably capture oversmoothing in realistic scenarios. For instance, they provide meaningful insights only for very deep networks and under somewhat strict conditions on the norm of network weights and feature representations. As an alternative, we propose measuring oversmoothing by examining the numerical or effective rank of the feature representations. We provide theoretical support for this approach, demonstrating that the numerical rank of feature representations converges to one for a broad family of nonlinear activation functions under the assumption of nonnegative trained weights. To the best of our knowledge, this is the first result that proves the occurrence of oversmoothing without assumptions on the boundedness of the weight matrices. Along with the theoretical findings, we provide extensive numerical evaluation across diverse graph architectures. Our results show that rank-based metrics consistently capture oversmoothing, whereas energy-based metrics often fail. Notably, we reveal that a significant drop in the rank aligns closely with performance degradation, even in scenarios where energy metrics remain unchanged.

[AI-46] chnical Debt in In-Context Learning: Diminishing Efficiency in Long Context

链接: https://arxiv.org/abs/2502.04580
作者: Taejong Joo,Diego Klabjan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformers have demonstrated remarkable in-context learning (ICL) capabilities, adapting to new tasks by simply conditioning on demonstrations without parameter updates. Compelling empirical and theoretical evidence suggests that ICL, as a general-purpose learner, could outperform task-specific models. However, it remains unclear to what extent the transformers optimally learn in-context compared to principled learning algorithms. To bridge this gap, we introduce a new framework for quantifying optimality of ICL as a learning algorithm in stylized settings. Our findings reveal a striking dichotomy: while ICL initially matches the efficiency of a Bayes optimal estimator, its efficiency significantly deteriorates in long context. Through an information-theoretic analysis, we show that the diminishing efficiency is inherent to ICL. These results clarify the trade-offs in adopting ICL as a universal problem solver, motivating a new generation of on-the-fly adaptive methods without the diminishing efficiency.

[AI-47] Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer

链接: https://arxiv.org/abs/2502.04573
作者: Yulun Wu,Doron L. Bergman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present an Adversarially Pre-trained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without pre-training on any real-world dataset, extending on the recent development of Prior-Data Fitted Networks (PFNs) and TabPFN. Specifically, APT is pre-trained with adversarial synthetic data agents, who continue to shift their underlying data generating distribution and deliberately challenge the model with different synthetic datasets. In addition, we propose a mixture block architecture that is able to handle classification tasks with arbitrary number of classes, addressing the class size limitation – a crucial weakness of prior deep tabular zero-shot learners. In experiments, we show that our framework matches state-of-the-art performance on small classification tasks without filtering on dataset characteristics such as number of classes and number of missing values, while maintaining an average runtime under one second. On common benchmark dataset suites in both classification and regression, we show that adversarial pre-training was able to enhance TabPFN’s performance. In our analysis, we demonstrate that the adversarial synthetic data agents were able to generate a more diverse collection of data compared to the ordinary random generator in TabPFN. In addition, we demonstrate that our mixture block neural design has improved generalizability and greatly accelerated pre-training.

[AI-48] Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator

链接: https://arxiv.org/abs/2502.04567
作者: Zhuotong Chen,Fang Liu,Xuan Zhu,Yanjun Qi,Mohammad Ghavamzadeh
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.

[AI-49] WaferLLM : A Wafer-Scale LLM Inference System

链接: https://arxiv.org/abs/2502.04563
作者: Congjie He,Yeqi Huang,Pei Mu,Ziming Miao,Jilong Xue,Lingxiao Ma,Fan Yang,Luo Mai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh-based architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to fully exploit these accelerators. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR device model that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators. Evaluations show that WaferLLM achieves 200 \times better wafer-scale accelerator utilization than state-of-the-art systems. On a commodity wafer-scale accelerator, WaferLLM delivers 606 \times faster and 22 \times more energy-efficient GEMV compared to an advanced GPU. For LLMs, WaferLLM enables 39 \times faster decoding with 1.7 \times better energy efficiency. We anticipate these numbers will grow significantly as wafer-scale AI models, software, and hardware continue to mature.

[AI-50] Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

链接: https://arxiv.org/abs/2502.04558
作者: Hong Lu,Hengxu Li,Prithviraj Singh Shahani,Stephanie Herbers,Matthias Scheutz
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 Pages, 4 Figures

点击查看摘要

Abstract:Vision-language-action (VLA) models hold promise as generalist robotics solutions by translating visual and linguistic inputs into robot actions, yet they lack reliability due to their black-box nature and sensitivity to environmental changes. In contrast, cognitive architectures (CA) excel in symbolic reasoning and state monitoring but are constrained by rigid predefined execution. This work bridges these approaches by probing OpenVLA’s hidden layers to uncover symbolic representations of object properties, relations, and action states, enabling integration with a CA for enhanced interpretability and robustness. Through experiments on LIBERO-spatial pick-and-place tasks, we analyze the encoding of symbolic states across different layers of OpenVLA’s Llama backbone. Our probing results show consistently high accuracies ( 0.90) for both object and action states across most layers, though contrary to our hypotheses, we did not observe the expected pattern of object states being encoded earlier than action states. We demonstrate an integrated DIARC-OpenVLA system that leverages these symbolic representations for real-time state monitoring, laying the foundation for more interpretable and reliable robotic manipulation.

[AI-51] Unifying and Optimizing Data Values for Selection via Sequential-Decision-Making

链接: https://arxiv.org/abs/2502.04554
作者: Hongliang Chi,Qiong Wu,Zhengyi Zhou,Jonathan Light,Emily Dodwell,Yao Ma
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data selection has emerged as a crucial downstream application of data valuation. While existing data valuation methods have shown promise in selection tasks, the theoretical foundations and full potential of using data values for selection remain largely unexplored. In this work, we first demonstrate that data values applied for selection can be naturally reformulated as a sequential-decision-making problem, where the optimal data value can be derived through dynamic programming. We show this framework unifies and reinterprets existing methods like Data Shapley through the lens of approximate dynamic programming, specifically as myopic reward function approximations to this sequential problem. Furthermore, we analyze how sequential data selection optimality is affected when the ground-truth utility function exhibits monotonic submodularity with curvature. To address the computational challenges in obtaining optimal data values, we propose an efficient approximation scheme using learned bipartite graphs as surrogate utility models, ensuring greedy selection is still optimal when the surrogate utility is correctly specified and learned. Extensive experiments demonstrate the effectiveness of our approach across diverse datasets.

[AI-52] Robust Probabilistic Model Checking with Continuous Reward Domains

链接: https://arxiv.org/abs/2502.04530
作者: Xiaotong Ji,Hanchun Wang,Antonio Filieri,Ilenia Epifani
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: Accepted by the 20th International Conference on Software Engineering for Adaptive and Self-Managing Systems 2025

点击查看摘要

Abstract:Probabilistic model checking traditionally verifies properties on the expected value of a measure of interest. This restriction may fail to capture the quality of service of a significant proportion of a system’s runs, especially when the probability distribution of the measure of interest is poorly represented by its expected value due to heavy-tail behaviors or multiple modalities. Recent works inspired by distributional reinforcement learning use discrete histograms to approximate integer reward distribution, but they struggle with continuous reward space and present challenges in balancing accuracy and scalability. We propose a novel method for handling both continuous and discrete reward distributions in Discrete Time Markov Chains using moment matching with Erlang mixtures. By analytically deriving higher-order moments through Moment Generating Functions, our method approximates the reward distribution with theoretically bounded error while preserving the statistical properties of the true distribution. This detailed distributional insight enables the formulation and robust model checking of quality properties based on the entire reward distribution function, rather than restricting to its expected value. We include a theoretical foundation ensuring bounded approximation errors, along with an experimental evaluation demonstrating our method’s accuracy and scalability in practical model-checking problems.

[AI-53] ImprovNet: Generating Controllable Musical Improvisations with Iterative Corruption Refinement

链接: https://arxiv.org/abs/2502.04522
作者: Keshav Bhandari,Sungkyun Chang,Tongyu Lu,Fareza R. Enus,Louis B. Bradshaw,Dorien Herremans,Simon Colton
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Deep learning has enabled remarkable advances in style transfer across various domains, offering new possibilities for creative content generation. However, in the realm of symbolic music, generating controllable and expressive performance-level style transfers for complete musical works remains challenging due to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks. This paper presents ImprovNet, a transformer-based architecture that generates expressive and controllable musical improvisations through a self-supervised corruption-refinement training strategy. ImprovNet unifies multiple capabilities within a single model: it can perform cross-genre and intra-genre improvisations, harmonize melodies with genre-specific styles, and execute short prompt continuation and infilling tasks. The model’s iterative generation framework allows users to control the degree of style transfer and structural similarity to the original composition. Objective and subjective evaluations demonstrate ImprovNet’s effectiveness in generating musically coherent improvisations while maintaining structural relationships with the original pieces. The model outperforms Anticipatory Music Transformer in short continuation and infilling tasks and successfully achieves recognizable genre conversion, with 79% of participants correctly identifying jazz-style improvisations. Our code and demo page can be found at this https URL.

[AI-54] MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification WWW2025

链接: https://arxiv.org/abs/2502.04515
作者: Wei Fan,Jingru Fei,Dingyu Guo,Kun Yi,Xiaozhuang Song,Haolong Xiang,Hangting Ye,Min Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by WWW 2025

点击查看摘要

Abstract:Medical time series has been playing a vital role in real-world healthcare systems as valuable information in monitoring health conditions of patients. Accurate classification for medical time series, e.g., Electrocardiography (ECG) signals, can help for early detection and diagnosis. Traditional methods towards medical time series classification rely on handcrafted feature extraction and statistical methods; with the recent advancement of artificial intelligence, the machine learning and deep learning methods have become more popular. However, existing methods often fail to fully model the complex spatial dynamics under different scales, which ignore the dynamic multi-resolution spatial and temporal joint inter-dependencies. Moreover, they are less likely to consider the special baseline wander problem as well as the multi-view characteristics of medical time series, which largely hinders their prediction performance. To address these limitations, we propose a Multi-resolution Spatiotemporal Graph Learning framework, MedGNN, for medical time series classification. Specifically, we first propose to construct multi-resolution adaptive graph structures to learn dynamic multi-scale embeddings. Then, to address the baseline wander problem, we propose Difference Attention Networks to operate self-attention mechanisms on the finite difference for temporal modeling. Moreover, to learn the multi-view characteristics, we utilize the Frequency Convolution Networks to capture complementary information of medical time series from the frequency domain. In addition, we introduce the Multi-resolution Graph Transformer architecture to model the dynamic dependencies and fuse the information from different resolutions. Finally, we have conducted extensive experiments on multiple medical real-world datasets that demonstrate the superior performance of our method. Our Code is available.

[AI-55] Safety is Essential for Responsible Open-Ended Systems

链接: https://arxiv.org/abs/2502.04512
作者: Ivaxi Sheth,Jan Wehner,Sahar Abdelnabi,Ruta Binkyte,Mario Fritz
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:AI advancements have been significantly driven by a combination of foundation models and curiosity-driven learning aimed at increasing capability and adaptability. A growing area of interest within this field is Open-Endedness - the ability of AI systems to continuously and autonomously generate novel and diverse artifacts or solutions. This has become relevant for accelerating scientific discovery and enabling continual adaptation in AI agents. This position paper argues that the inherently dynamic and self-propagating nature of Open-Ended AI introduces significant, underexplored risks, including challenges in maintaining alignment, predictability, and control. This paper systematically examines these challenges, proposes mitigation strategies, and calls for action for different stakeholders to support the safe, responsible and successful development of Open-Ended AI.

[AI-56] CNN Autoencoders for Hierarchical Feature Extraction and Fusion in Multi-sensor Human Activity Recognition

链接: https://arxiv.org/abs/2502.04489
作者: Saeed Arabzadeh,Farshad Almasganj,Mohammad Mahdi Ahmadi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:Deep learning methods have been widely used for Human Activity Recognition (HAR) using recorded signals from Iner-tial Measurement Units (IMUs) sensors that are installed on various parts of the human body. For this type of HAR, sev-eral challenges exist, the most significant of which is the analysis of multivarious IMU sensors data. Here, we introduce a Hierarchically Unsupervised Fusion (HUF) model designed to extract, and fuse features from IMU sensors data via a hybrid structure of Convolutional Neural Networks (CNN)s and Autoencoders (AE)s. First, we design a stack CNN-AE to embed short-time signals into sets of high dimensional features. Second, we develop another CNN-AE network to locally fuse the extracted features from each sensor unit. Finally, we unify all the sensor features through a third CNN-AE architecture as globally feature fusion to create a unique feature set. Additionally, we analyze the effects of varying the model hyperparameters. The best results are achieved with eight convolutional layers in each AE. Furthermore, it is determined that an overcomplete AE with 256 kernels in the code layer is suitable for feature extraction in the first block of the proposed HUF model; this number reduces to 64 in the last block of the model to customize the size of the applied features to the classifier. The tuned model is applied to the UCI-HAR, DaLiAc, and Parkinson’s disease gait da-tasets, achieving the classification accuracies of 97%, 97%, and 88%, respectively, which are nearly 3% better com-pared to the state-of-the-art supervised methods.

[AI-57] ADIFF: Explaining audio difference using natural language ICLR2025

链接: https://arxiv.org/abs/2502.04476
作者: Soham Deshmukh,Shuo Han,Rita Singh,Bhiksha Raj
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted at ICLR 2025. Dataset and checkpoints are available at: this https URL

点击查看摘要

Abstract:Understanding and explaining differences between audio recordings is crucial for fields like audio forensics, quality assessment, and audio generation. This involves identifying and describing audio events, acoustic scenes, signal characteristics, and their emotional impact on listeners. This paper stands out as the first work to comprehensively study the task of explaining audio differences and then propose benchmark, baselines for the task. First, we present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets. Using Large Language Models (LLMs), we generate three levels of difference explanations: (1) concise descriptions of audio events and objects, (2) brief sentences about audio events, acoustic scenes, and signal properties, and (3) comprehensive explanations that include semantics and listener emotions. For the baseline, we use prefix tuning where audio embeddings from two audio files are used to prompt a frozen language model. Our empirical analysis and ablation studies reveal that the naive baseline struggles to distinguish perceptually similar sounds and generate detailed tier 3 explanations. To address these limitations, we propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model’s ability to produce detailed explanations. We evaluate our model using objective metrics and human evaluation and show our model enhancements lead to significant improvements in performance over naive baseline and SoTA Audio-Language Model (ALM) Qwen Audio. Lastly, we conduct multiple ablation studies to study the effects of cross-projection, language model parameters, position captioning, third stage fine-tuning, and present our findings. Our benchmarks, findings, and strong baseline pave the way for nuanced and human-like explanations of audio differences.

[AI-58] FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

链接: https://arxiv.org/abs/2502.04465
作者: Luca Della Libera,Francesco Paissan,Cem Subakan,Mirco Ravanelli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 18 pages

点击查看摘要

Abstract:Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at this https URL.

[AI-59] Assessing and Prioritizing Ransomware Risk Based on Historical Victim Data

链接: https://arxiv.org/abs/2502.04421
作者: Spencer Massengale,Philip Huff
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present an approach to identifying which ransomware adversaries are most likely to target specific entities, thereby assisting these entities in formulating better protection strategies. Ransomware poses a formidable cybersecurity threat characterized by profit-driven motives, a complex underlying economy supporting criminal syndicates, and the overt nature of its attacks. This type of malware has consistently ranked among the most prevalent, with a rapid escalation in activity observed. Recent estimates indicate that approximately two-thirds of organizations experienced ransomware attacks in 2023 \citeSophos2023Ransomware. A central tactic in ransomware campaigns is publicizing attacks to coerce victims into paying ransoms. Our study utilizes public disclosures from ransomware victims to predict the likelihood of an entity being targeted by a specific ransomware variant. We employ a Large Language Model (LLM) architecture that uses a unique chain-of-thought, multi-shot prompt methodology to define adversary SKRAM (Skills, Knowledge, Resources, Authorities, and Motivation) profiles from ransomware bulletins, threat reports, and news items. This analysis is enriched with publicly available victim data and is further enhanced by a heuristic for generating synthetic data that reflects victim profiles. Our work culminates in the development of a machine learning model that assists organizations in prioritizing ransomware threats and formulating defenses based on the tactics, techniques, and procedures (TTP) of the most likely attackers.

[AI-60] Autotelic Reinforcement Learning: Exploring Intrinsic Motivations for Skill Acquisition in Open-Ended Environments

链接: https://arxiv.org/abs/2502.04418
作者: Prakhar Srivastava,Jasmeet Singh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 12 figures

点击查看摘要

Abstract:This paper presents a comprehensive overview of autotelic Reinforcement Learning (RL), emphasizing the role of intrinsic motivations in the open-ended formation of skill repertoires. We delineate the distinctions between knowledge-based and competence-based intrinsic motivations, illustrating how these concepts inform the development of autonomous agents capable of generating and pursuing self-defined goals. The typology of Intrinsically Motivated Goal Exploration Processes (IMGEPs) is explored, with a focus on the implications for multi-goal RL and developmental robotics. The autotelic learning problem is framed within a reward-free Markov Decision Process (MDP), WHERE agents must autonomously represent, generate, and master their own goals. We address the unique challenges in evaluating such agents, proposing various metrics for measuring exploration, generalization, and robustness in complex environments. This work aims to advance the understanding of autotelic RL agents and their potential for enhancing skill acquisition in a diverse and dynamic setting.

[AI-61] NeuralMOVES: A lightweight and microscopic vehicle emission estimation model based on reverse engineering and surrogate learning

链接: https://arxiv.org/abs/2502.04417
作者: Edgar Ramirez-Sanchez,Catherine Tang,Yaosheng Xu,Nrithya Renganathan,Vindula Jayawardana,Zhengbing He,Cathy Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The transportation sector significantly contributes to greenhouse gas emissions, necessitating accurate emission models to guide mitigation strategies. Despite its field validation and certification, the industry-standard Motor Vehicle Emission Simulator (MOVES) faces challenges related to complexity in usage, high computational demands, and its unsuitability for microscopic real-time applications. To address these limitations, we present NeuralMOVES, a comprehensive suite of high-performance, lightweight surrogate models for vehicle CO2 emissions. Developed based on reverse engineering and Neural Networks, NeuralMOVES achieves a remarkable 6.013% Mean Average Percentage Error relative to MOVES across extensive tests spanning over two million scenarios with diverse trajectories and the factors regarding environments and vehicles. NeuralMOVES is only 2.4 MB, largely condensing the original MOVES and the reverse engineered MOVES into a compact representation, while maintaining high accuracy. Therefore, NeuralMOVES significantly enhances accessibility while maintaining the accuracy of MOVES, simplifying CO2 evaluation for transportation analyses and enabling real-time, microscopic applications across diverse scenarios without reliance on complex software or extensive computational resources. Moreover, this paper provides, for the first time, a framework for reverse engineering industrial-grade software tailored specifically to transportation scenarios, going beyond MOVES. The surrogate models are available at this https URL.

[AI-62] CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference

链接: https://arxiv.org/abs/2502.04416
作者: Zehua Pei,Lancheng Zou,Hui-Ling Zhen,Xianzhi Yu,Wulong Liu,Sinno Jialin Pan,Mingxuan Yuan,Bei Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve impressive performance by scaling model parameters, but this comes with significant inference overhead. Feed-forward networks (FFNs), which dominate LLM parameters, exhibit high activation sparsity in hidden neurons. To exploit this, researchers have proposed using a mixture-of-experts (MoE) architecture, where only a subset of parameters is activated. However, existing approaches often require extensive training data and resources, limiting their practicality. We propose CMoE (Carved MoE), a novel framework to efficiently carve MoE models from dense models. CMoE achieves remarkable performance through efficient expert grouping and lightweight adaptation. First, neurons are grouped into shared and routed experts based on activation rates. Next, we construct a routing mechanism without training from scratch, incorporating a differentiable routing process and load balancing. Using modest data, CMoE produces a well-designed, usable MoE from a 7B dense model within five minutes. With lightweight fine-tuning, it achieves high-performance recovery in under an hour. We make our code publicly available at this https URL.

[AI-63] ransforming Multimodal Models into Action Models for Radiotherapy

链接: https://arxiv.org/abs/2502.04408
作者: Matteo Ferrante,Alessandra Carosi,Rolando Maria D Angelillo,Nicola Toschi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Radiotherapy is a crucial cancer treatment that demands precise planning to balance tumor eradication and preservation of healthy tissue. Traditional treatment planning (TP) is iterative, time-consuming, and reliant on human expertise, which can potentially introduce variability and inefficiency. We propose a novel framework to transform a large multimodal foundation model (MLM) into an action model for TP using a few-shot reinforcement learning (RL) approach. Our method leverages the MLM’s extensive pre-existing knowledge of physics, radiation, and anatomy, enhancing it through a few-shot learning process. This allows the model to iteratively improve treatment plans using a Monte Carlo simulator. Our results demonstrate that this method outperforms conventional RL-based approaches in both quality and efficiency, achieving higher reward scores and more optimal dose distributions in simulations on prostate cancer data. This proof-of-concept suggests a promising direction for integrating advanced AI models into clinical workflows, potentially enhancing the speed, quality, and standardization of radiotherapy treatment planning.

[AI-64] Illuminating Spaces: Deep Reinforcement Learning and Laser-Wall Partitioning for Architectural Layout Generation

链接: https://arxiv.org/abs/2502.04407
作者: Reza Kakooee,Benjamin Dillenburger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Space layout design (SLD), occurring in the early stages of the design process, nonetheless influences both the functionality and aesthetics of the ultimate architectural outcome. The complexity of SLD necessitates innovative approaches to efficiently explore vast solution spaces. While image-based generative AI has emerged as a potential solution, they often rely on pixel-based space composition methods that lack intuitive representation of architectural processes. This paper leverages deep Reinforcement Learning (RL), as it offers a procedural approach that intuitively mimics the process of human designers. Effectively using RL for SLD requires an explorative space composing method to generate desirable design solutions. We introduce “laser-wall”, a novel space partitioning method that conceptualizes walls as emitters of imaginary light beams to partition spaces. This approach bridges vector-based and pixel-based partitioning methods, offering both flexibility and exploratory power in generating diverse layouts. We present two planning strategies: one-shot planning, which generates entire layouts in a single pass, and dynamic planning, which allows for adaptive refinement by continuously transforming laser-walls. Additionally, we introduce on-light and off-light wall transformations for smooth and fast layout refinement, as well as identity-less and identity-full walls for versatile room assignment. We developed SpaceLayoutGym, an open-source OpenAI Gym compatible simulator for generating and evaluating space layouts. The RL agent processes the input design scenarios and generates solutions following a reward function that balances geometrical and topological requirements. Our results demonstrate that the RL-based laser-wall approach can generate diverse and functional space layouts that satisfy both geometric constraints and topological requirements and is architecturally intuitive.

[AI-65] Calibrated Physics-Informed Uncertainty Quantification

链接: https://arxiv.org/abs/2502.04406
作者: Vignesh Gopakumar,Ander Gray,Lorenzo Zanisi,Timothy Nunn,Stanislas Pamela,Daniel Giles,Matt J. Kusner,Marc Peter Deisenroth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Neural PDEs offer efficient alternatives to computationally expensive numerical PDE solvers for simulating complex physical systems. However, their lack of robust uncertainty quantification (UQ) limits deployment in critical applications. We introduce a model-agnostic, physics-informed conformal prediction (CP) framework that provides guaranteed uncertainty estimates without requiring labelled data. By utilising a physics-based approach, we are able to quantify and calibrate the model’s inconsistencies with the PDE rather than the uncertainty arising from the data. Our approach uses convolutional layers as finite-difference stencils and leverages physics residual errors as nonconformity scores, enabling data-free UQ with marginal and joint coverage guarantees across prediction domains for a range of complex PDEs. We further validate the efficacy of our method on neural PDE models for plasma modelling and shot design in fusion reactors.

[AI-66] Agency Is Frame-Dependent

链接: https://arxiv.org/abs/2502.04403
作者: David Abel,André Barreto,Michael Bowling,Will Dabney,Shi Dong,Steven Hansen,Anna Harutyunyan,Khimya Khetarpal,Clare Lyle,Razvan Pascanu,Georgios Piliouras,Doina Precup,Jonathan Richens,Mark Rowland,Tom Schaul,Satinder Singh
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Agency is a system’s capacity to steer outcomes toward a goal, and is a central topic of study across biology, philosophy, cognitive science, and artificial intelligence. Determining if a system exhibits agency is a notoriously difficult question: Dennett (1989), for instance, highlights the puzzle of determining which principles can decide whether a rock, a thermostat, or a robot each possess agency. We here address this puzzle from the viewpoint of reinforcement learning by arguing that agency is fundamentally frame-dependent: Any measurement of a system’s agency must be made relative to a reference frame. We support this claim by presenting a philosophical argument that each of the essential properties of agency proposed by Barandiaran et al. (2009) and Moreno (2018) are themselves frame-dependent. We conclude that any basic science of agency requires frame-dependence, and discuss the implications of this claim for reinforcement learning.

[AI-67] Beyond Interpolation: Extrapolative Reasoning with Reinforcement Learning and Graph Neural Networks AAAI25

链接: https://arxiv.org/abs/2502.04402
作者: Niccolò Grillo,Andrea Toccaceli,Joël Mathys,Benjamin Estermann,Stefania Fresca,Roger Wattenhofer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The first two authors contributed equally to this work. Accepted as workshop paper at NEURMAD@AAAI25

点击查看摘要

Abstract:Despite incredible progress, many neural architectures fail to properly generalize beyond their training distribution. As such, learning to reason in a correct and generalizable way is one of the current fundamental challenges in machine learning. In this respect, logic puzzles provide a great testbed, as we can fully understand and control the learning environment. Thus, they allow to evaluate performance on previously unseen, larger and more difficult puzzles that follow the same underlying rules. Since traditional approaches often struggle to represent such scalable logical structures, we propose to model these puzzles using a graph-based approach. Then, we investigate the key factors enabling the proposed models to learn generalizable solutions in a reinforcement learning setting. Our study focuses on the impact of the inductive bias of the architecture, different reward systems and the role of recurrent modeling in enabling sequential reasoning. Through extensive experiments, we demonstrate how these elements contribute to successful extrapolation on increasingly complex this http URL insights and frameworks offer a systematic way to design learning-based systems capable of generalizable reasoning beyond interpolation.

[AI-68] Adaptive Prototype Knowledge Transfer for Federated Learning with Mixed Modalities and Heterogeneous Tasks

链接: https://arxiv.org/abs/2502.04400
作者: Keke Gai,Mohan Wang,Jing Yu,Dongjue Wang,Qi Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Multimodal Federated Learning (MFL) enables multiple clients to collaboratively train models on multimodal data while ensuring clients’ privacy. However, modality and task heterogeneity hinder clients from learning a unified representation, weakening local model generalization, especially in MFL with mixed modalities where only some clients have multimodal data. In this work, we propose an Adaptive prototype-based Multimodal Federated Learning (AproMFL) framework for mixed modalities and heterogeneous tasks to address the aforementioned issues. Our AproMFL transfers knowledge through adaptively-constructed prototypes without a prior public dataset. Clients adaptively select prototype construction methods in line with tasks; server converts client prototypes into unified multimodal prototypes and aggregates them to form global prototypes, avoid clients keeping unified labels. We divide the model into various modules and only aggregate mapping modules to reduce communication and computation overhead. To address aggregation issues in heterogeneity, we develop a client relationship graph-based scheme to dynamically adjust aggregation weights. Extensive experiments on representative datasets evidence effectiveness of AproMFL.

[AI-69] Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning

链接: https://arxiv.org/abs/2502.04399
作者: Bokeng Zheng,Bo Rao,Tianxiang Zhu,Chee Wei Tan,Jingpu Duan,Zhi Zhou,Xu Chen,Xiaoxi Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Advances in artificial intelligence (AI) including foundation models (FMs), are increasingly transforming human society, with smart city driving the evolution of urban this http URL, vehicle crowdsensing (VCS) has emerged as a key enabler, leveraging vehicles’ mobility and sensor-equipped capabilities. In particular, ride-hailing vehicles can effectively facilitate flexible data collection and contribute towards urban intelligence, despite resource limitations. Therefore, this work explores a promising scenario, where edge-assisted vehicles perform joint tasks of order serving and the emerging foundation model fine-tuning using various urban data. However, integrating the VCS AI task with the conventional order serving task is challenging, due to their inconsistent spatio-temporal characteristics: (i) The distributions of ride orders and data point-of-interests (PoIs) may not coincide in geography, both following a priori unknown patterns; (ii) they have distinct forms of temporal effects, i.e., prolonged waiting makes orders become instantly invalid while data with increased staleness gradually reduces its utility for model this http URL overcome these obstacles, we propose an online framework based on multi-agent reinforcement learning (MARL) with careful augmentation. A new quality-of-service (QoS) metric is designed to characterize and balance the utility of the two joint tasks, under the effects of varying data volumes and staleness. We also integrate graph neural networks (GNNs) with MARL to enhance state representations, capturing graph-structured, time-varying dependencies among vehicles and across locations. Extensive experiments on our testbed simulator, utilizing various real-world foundation model fine-tuning tasks and the New York City Taxi ride order dataset, demonstrate the advantage of our proposed method.

[AI-70] Overcoming Vision Language Model Challenges in Diagram Understanding: A Proof-of-Concept with XML-Driven Large Language Models Solutions

链接: https://arxiv.org/abs/2502.04389
作者: Shue Shiinoki,Ryo Koshihara,Hayato Motegi,Masumi Morishige
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: The related code is available at \url{ this https URL }, which provides the core library developed for this research. The experimental code using this library can be found at \url{ this https URL }

点击查看摘要

Abstract:Diagrams play a crucial role in visually conveying complex relationships and processes within business documentation. Despite recent advances in Vision-Language Models (VLMs) for various image understanding tasks, accurately identifying and extracting the structures and relationships depicted in diagrams continues to pose significant challenges. This study addresses these challenges by proposing a text-driven approach that bypasses reliance on VLMs’ visual recognition capabilities. Instead, it utilizes the editable source files–such as xlsx, pptx or docx–where diagram elements (e.g., shapes, lines, annotations) are preserved as textual metadata. In our proof-of-concept, we extracted diagram information from xlsx-based system design documents and transformed the extracted shape data into textual input for Large Language Models (LLMs). This approach allowed the LLM to analyze relationships and generate responses to business-oriented questions without the bottleneck of image-based processing. Experimental comparisons with a VLM-based method demonstrated that the proposed text-driven framework yielded more accurate answers for questions requiring detailed comprehension of diagram this http URL results obtained in this study are not limited to the tested .xlsx files but can also be extended to diagrams in other documents with source files, such as Office pptx and docx formats. These findings highlight the feasibility of circumventing VLM constraints through direct textual extraction from original source files. By enabling robust diagram understanding through LLMs, our method offers a promising path toward enhanced workflow efficiency and information analysis in real-world business scenarios.

[AI-71] Position: Emergent Machina Sapiens Urge Rethinking Multi-Agent Paradigms

链接: https://arxiv.org/abs/2502.04388
作者: Hepeng Li,Yuhong Liu,Jun Yan
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificially intelligent (AI) agents that are capable of autonomous learning and independent decision-making hold great promise for addressing complex challenges across domains like transportation, energy systems, and manufacturing. However, the surge in AI systems’ design and deployment driven by various stakeholders with distinct and unaligned objectives introduces a crucial challenge: how can uncoordinated AI systems coexist and evolve harmoniously in shared environments without creating chaos? To address this, we advocate for a fundamental rethinking of existing multi-agent frameworks, such as multi-agent systems and game theory, which are largely limited to predefined rules and static objective structures. We posit that AI agents should be empowered to dynamically adjust their objectives, make compromises, form coalitions, and safely compete or cooperate through evolving relationships and social feedback. Through this paper, we call for a shift toward the emergent, self-organizing, and context-aware nature of these systems.

[AI-72] Comparative Analysis of Community Detection Algorithms on the SNAP Social Circles Dataset WWW

链接: https://arxiv.org/abs/2502.04341
作者: Yash Malode,Amit Aylani,Arvind Bhardwaj,Deepak Hajoary
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: Presented at IDEA2k24: this https URL Submitted to Springer Lecture Notes in Electrical Engineering series ( this https URL )

点击查看摘要

Abstract:In network research, Community Detection has always been a topic of significant interest in network science, with numerous papers and algorithms proposing to uncover the underlying structures within networks. In this paper, we conduct a comparative analysis of several prominent community detection algorithms applied to the SNAP Social Circles Dataset, derived from the Facebook Social Media network. The algorithms implemented include Louvain, Girvan-Newman, Spectral Clustering, K-Means Clustering, etc. We evaluate the performance of these algorithms based on various metrics such as modularity, normalized cut-ratio, silhouette score, compactness, and separability. Our findings reveal insights into the effectiveness of each algorithm in detecting various meaningful communities within the social network, shedding light on their strength and limitations. This research contributes to the understanding of community detection methods and provides valuable guidance for their application in analyzing real-world social networks.

[AI-73] Shifting Attention to You: Personalized Brain-Inspired AI Models

链接: https://arxiv.org/abs/2502.04658
作者: Stephen Chong Zhao,Yang Hu,Jason Lee,Andrew Bender,Trisha Mazumdar,Mark Wallace,David A. Tovar
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 7 Figures, 3 Tables, 3 Supplemental Figures, 1 Supplemental Table

点击查看摘要

Abstract:The integration of human and artificial intelligence represents a scientific opportunity to advance our understanding of information processing, as each system offers unique computational insights that can enhance and inform the other. The synthesis of human cognitive principles with artificial intelligence has the potential to produce more interpretable and functionally aligned computational models, while simultaneously providing a formal framework for investigating the neural mechanisms underlying perception, learning, and decision-making through systematic model comparisons and representational analyses. In this study, we introduce personalized brain-inspired modeling that integrates human behavioral embeddings and neural data to align with cognitive processes. We took a stepwise approach, fine-tuning the Contrastive Language-Image Pre-training (CLIP) model with large-scale behavioral decisions, group-level neural data, and finally, participant-level neural data within a broader framework that we have named CLIP-Human-Based Analysis (CLIP-HBA). We found that fine-tuning on behavioral data enhances its ability to predict human similarity judgments while indirectly aligning it with dynamic representations captured via MEG. To further gain mechanistic insights into the temporal evolution of cognitive processes, we introduced a model specifically fine-tuned on millisecond-level MEG neural dynamics (CLIP-HBA-MEG). This model resulted in enhanced temporal alignment with human neural processing while still showing improvement on behavioral alignment. Finally, we trained individualized models on participant-specific neural data, effectively capturing individualized neural dynamics and highlighting the potential for personalized AI systems. These personalized systems have far-reaching implications for the fields of medicine, cognitive research, human-computer interfaces, and AI development.

机器学习

[LG-0] In-context denoising with one-layer transformers: connections between attention and associative memory retrieval

链接: https://arxiv.org/abs/2502.05164
作者: Matthew Smart,Alberto Bietti,Anirvan M. Sengupta
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:

点击查看摘要

Abstract:We introduce in-context denoising, a task that refines the connection between attention-based architectures and dense associative memory (DAM) networks, also known as modern Hopfield networks. Using a Bayesian framework, we show theoretically and empirically that certain restricted denoising problems can be solved optimally even by a single-layer transformer. We demonstrate that a trained attention layer processes each denoising prompt by performing a single gradient descent update on a context-aware DAM energy landscape, where context tokens serve as associative memories and the query token acts as an initial state. This one-step update yields better solutions than exact retrieval of either a context token or a spurious local minimum, providing a concrete example of DAM networks extending beyond the standard retrieval paradigm. Overall, this work solidifies the link between associative memory and attention mechanisms first identified by Ramsauer et al., and demonstrates the relevance of associative memory models in the study of in-context learning.

[LG-1] Efficient distributional regression trees learning algorithms for calibrated non-parametric probabilistic forecasts

链接: https://arxiv.org/abs/2502.05157
作者: Duchemin Quentin,Obozinski Guillaume
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:The perspective of developing trustworthy AI for critical applications in science and engineering requires machine learning techniques that are capable of estimating their own uncertainty. In the context of regression, instead of estimating a conditional mean, this can be achieved by producing a predictive interval for the output, or to even learn a model of the conditional probability p(y|x) of an output y given input features x . While this can be done under parametric assumptions with, e.g. generalized linear model, these are typically too strong, and non-parametric models offer flexible alternatives. In particular, for scalar outputs, learning directly a model of the conditional cumulative distribution function of y given x can lead to more precise probabilistic estimates, and the use of proper scoring rules such as the weighted interval score (WIS) and the continuous ranked probability score (CRPS) lead to better coverage and calibration properties. This paper introduces novel algorithms for learning probabilistic regression trees for the WIS or CRPS loss functions. These algorithms are made computationally efficient thanks to an appropriate use of known data structures - namely min-max heaps, weight-balanced binary trees and Fenwick trees. Through numerical experiments, we demonstrate that the performance of our methods is competitive with alternative approaches. Additionally, our methods benefit from the inherent interpretability and explainability of trees. As a by-product, we show how our trees can be used in the context of conformal prediction and explain why they are particularly well-suited for achieving group-conditional coverage guarantees. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2502.05157 [cs.LG] (or arXiv:2502.05157v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.05157 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Deep Dynamic Probabilistic Canonical Correlation Analysis ICASSP-25

链接: https://arxiv.org/abs/2502.05155
作者: Shiqin Tang,Shujian Yu,Yining Dong,S. Joe Qin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: accepted by ICASSP-25, code is available at \url{ this https URL }

点击查看摘要

Abstract:This paper presents Deep Dynamic Probabilistic Canonical Correlation Analysis (D2PCCA), a model that integrates deep learning with probabilistic modeling to analyze nonlinear dynamical systems. Building on the probabilistic extensions of Canonical Correlation Analysis (CCA), D2PCCA captures nonlinear latent dynamics and supports enhancements such as KL annealing for improved convergence and normalizing flows for a more flexible posterior approximation. D2PCCA naturally extends to multiple observed variables, making it a versatile tool for encoding prior knowledge about sequential datasets and providing a probabilistic understanding of the system’s dynamics. Experimental validation on real financial datasets demonstrates the effectiveness of D2PCCA and its extensions in capturing latent dynamics.

[LG-3] From Restless to Contextual: A Thresholding Bandit Approach to Improve Finite-horizon Performance

链接: https://arxiv.org/abs/2502.05145
作者: Jiamin Xu,Ivan Nazarov,Aditya Rastogi,África Periáñez,Kyra Gan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online restless bandits extend classic contextual bandits by incorporating state transitions and budget constraints, representing each agent as a Markov Decision Process (MDP). This framework is crucial for finite-horizon strategic resource allocation, optimizing limited costly interventions for long-term benefits. However, learning the underlying MDP for each agent poses a major challenge in finite-horizon settings. To facilitate learning, we reformulate the problem as a scalable budgeted thresholding contextual bandit problem, carefully integrating the state transitions into the reward design and focusing on identifying agents with action benefits exceeding a threshold. We establish the optimality of an oracle greedy solution in a simple two-state setting, and propose an algorithm that achieves minimax optimal constant regret in the online multi-state setting with heterogeneous agents and knowledge of outcomes under no intervention. We numerically show that our algorithm outperforms existing online restless bandit methods, offering significant improvements in finite-horizon performance.

[LG-4] Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech Music and Sound KR

链接: https://arxiv.org/abs/2502.05139
作者: Andros Tjandra,Yi-Chiao Wu,Baishan Guo,John Hoffman,Brian Ellis,Apoorv Vyas,Bowen Shi,Sanyuan Chen,Matt Le,Nick Zacharov,Carleigh Wood,Ann Lee,Wei-Ning Hsu
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Repository: this https URL Website: this https URL

点击查看摘要

Abstract:The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: this https URL

[LG-5] Data-Parallel Neural Network Training via Nonlinearly Preconditioned Trust-Region Method

链接: https://arxiv.org/abs/2502.05133
作者: Samuel A. Cruz Alegría,Ken Trotti,Alena Kopaničáková,Rolf Krause
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Parallel training methods are increasingly relevant in machine learning (ML) due to the continuing growth in model and dataset sizes. We propose a variant of the Additively Preconditioned Trust-Region Strategy (APTS) for training deep neural networks (DNNs). The proposed APTS method utilizes a data-parallel approach to construct a nonlinear preconditioner employed in the nonlinear optimization strategy. In contrast to the common employment of Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), which are both variants of gradient descent (GD) algorithms, the APTS method implicitly adjusts the step sizes in each iteration, thereby removing the need for costly hyperparameter tuning. We demonstrate the performance of the proposed APTS variant using the MNIST and CIFAR-10 datasets. The results obtained indicate that the APTS variant proposed here achieves comparable validation accuracy to SGD and Adam, all while allowing for parallel training and obviating the need for expensive hyperparameter tuning.

[LG-6] Optimizing Wireless Resource Management and Synchronization in Digital Twin Networks

链接: https://arxiv.org/abs/2502.05116
作者: Hanzhi Yu,Yuchen Liu,Zhaohui Yang,Haijian Sun,Mingzhe Chen
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:In this paper, we investigate an accurate synchronization between a physical network and its digital network twin (DNT), which serves as a virtual representation of the physical network. The considered network includes a set of base stations (BSs) that must allocate its limited spectrum resources to serve a set of users while also transmitting its partially observed physical network information to a cloud server to generate the DNT. Since the DNT can predict the physical network status based on its historical status, the BSs may not need to send their physical network information at each time slot, allowing them to conserve spectrum resources to serve the users. However, if the DNT does not receive the physical network information of the BSs over a large time period, the DNT’s accuracy in representing the physical network may degrade. To this end, each BS must decide when to send the physical network information to the cloud server to update the DNT, while also determining the spectrum resource allocation policy for both DNT synchronization and serving the users. We formulate this resource allocation task as an optimization problem, aiming to maximize the total data rate of all users while minimizing the asynchronization between the physical network and the DNT. To address this problem, we propose a method based on the GRUs and the value decomposition network (VDN). Simulation results show that our GRU and VDN based algorithm improves the weighted sum of data rates and the similarity between the status of the DNT and the physical network by up to 28.96%, compared to a baseline method combining GRU with the independent Q learning.

[LG-7] SpecTUS: Spectral Translator for Unknown Structures annotation from EI-MS spectra

链接: https://arxiv.org/abs/2502.05114
作者: Adam Hájek,Helge Hecht,Elliott J. Price,Aleš Křenek
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Compound identification and structure annotation from mass spectra is a well-established task widely applied in drug detection, criminal forensics, small molecule biomarker discovery and chemical engineering. We propose SpecTUS: Spectral Translator for Unknown Structures, a deep neural model that addresses the task of structural annotation of small molecules from low-resolution gas chromatography electron ionization mass spectra (GC-EI-MS). Our model analyzes the spectra in \textitde novo manner – a direct translation from the spectra into 2D-structural representation. Our approach is particularly useful for analyzing compounds unavailable in spectral libraries. In a rigorous evaluation of our model on the novel structure annotation task across different libraries, we outperformed standard database search techniques by a wide margin. On a held-out testing set, including \numprint28267 spectra from the NIST database, we show that our model’s single suggestion perfectly reconstructs 43% of the subset’s compounds. This single suggestion is strictly better than the candidate of the database hybrid search (common method among practitioners) in 76% of cases. In a~still affordable scenario of~10 suggestions, perfect reconstruction is achieved in 65%, and 84% are better than the hybrid search. Subjects: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an) Cite as: arXiv:2502.05114 [cs.LG] (or arXiv:2502.05114v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.05114 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ales Krenek [view email] [v1] Fri, 7 Feb 2025 17:36:13 UTC (8,923 KB)

[LG-8] Graph Contrastive Learning for Connectome Classification

链接: https://arxiv.org/abs/2502.05109
作者: Martín Schmidt,Sara Silva,Federico Larroca,Gonzalo Mateos,Pablo Musé
类目: Machine Learning (cs.LG)
*备注: Submitted to EMBC '25

点击查看摘要

Abstract:With recent advancements in non-invasive techniques for measuring brain activity, such as magnetic resonance imaging (MRI), the study of structural and functional brain networks through graph signal processing (GSP) has gained notable prominence. GSP stands as a key tool in unraveling the interplay between the brain’s function and structure, enabling the analysis of graphs defined by the connections between regions of interest – referred to as connectomes in this context. Our work represents a further step in this direction by exploring supervised contrastive learning methods within the realm of graph representation learning. The main objective of this approach is to generate subject-level (i.e., graph-level) vector representations that bring together subjects sharing the same label while separating those with different labels. These connectome embeddings are derived from a graph neural network Encoder-Decoder architecture, which jointly considers structural and functional connectivity. By leveraging data augmentation techniques, the proposed framework achieves state-of-the-art performance in a gender classification task using Human Connectome Project data. More broadly, our connectome-centric methodological advances support the promising prospect of using GSP to discover more about brain function, with potential impact to understanding heterogeneity in the neurodegeneration for precision medicine and diagnosis.

[LG-9] 3DMolFormer: A Dual-channel Framework for Structure-based Drug Discovery ICLR2025

链接: https://arxiv.org/abs/2502.05107
作者: Xiuyuan Hu,Guoqing Liu,Can Chen,Yang Zhao,Hao Zhang,Xue Liu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Structure-based drug discovery, encompassing the tasks of protein-ligand docking and pocket-aware 3D drug design, represents a core challenge in drug discovery. However, no existing work can deal with both tasks to effectively leverage the duality between them, and current methods for each task are hindered by challenges in modeling 3D information and the limitations of available data. To address these issues, we propose 3DMolFormer, a unified dual-channel transformer-based framework applicable to both docking and 3D drug design tasks, which exploits their duality by utilizing docking functionalities within the drug design process. Specifically, we represent 3D pocket-ligand complexes using parallel sequences of discrete tokens and continuous numbers, and we design a corresponding dual-channel transformer model to handle this format, thereby overcoming the challenges of 3D information modeling. Additionally, we alleviate data limitations through large-scale pre-training on a mixed dataset, followed by supervised and reinforcement learning fine-tuning techniques respectively tailored for the two tasks. Experimental results demonstrate that 3DMolFormer outperforms previous approaches in both protein-ligand docking and pocket-aware 3D drug design, highlighting its promising application in structure-based drug discovery. The code is available at: this https URL .

[LG-10] Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension

链接: https://arxiv.org/abs/2502.05075
作者: Yijun Dong,Yicheng Li,Yunai Li,Jason D. Lee,Qi Lei
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon through the observation that FT often occurs in intrinsically low-dimensional spaces. Leveraging the low intrinsic dimensionality of FT, we analyze W2S in the ridgeless regression setting from a variance reduction perspective. For a strong student - weak teacher pair with sufficiently expressive low-dimensional feature subspaces \mathcalV_s, \mathcalV_w , we provide an exact characterization of the variance that dominates the generalization error of W2S. This unveils a virtue of discrepancy between the strong and weak models in W2S: the variance of the weak teacher is inherited by the strong student in \mathcalV_s \cap \mathcalV_w , while reduced by a factor of \dim(\mathcalV_s)/N in the subspace of discrepancy \mathcalV_w \setminus \mathcalV_s with N pseudo-labels for W2S. Further, our analysis casts light on the sample complexities and the scaling of performance gap recovery in W2S. The analysis is supported with experiments on both synthetic regression problems and real vision tasks.

[LG-11] Hybrid machine learning based scale bridging framework for permeability prediction of fibrous structures

链接: https://arxiv.org/abs/2502.05044
作者: Denis Korolev,Tim Schmidt,Dinesh K. Natarajan,Stefano Cassola,David May,Miro Duhovic,Michael Hintermüller
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study introduces a hybrid machine learning-based scale-bridging framework for predicting the permeability of fibrous textile structures. By addressing the computational challenges inherent to multiscale modeling, the proposed approach evaluates the efficiency and accuracy of different scale-bridging methodologies combining traditional surrogate models and even integrating physics-informed neural networks (PINNs) with numerical solvers, enabling accurate permeability predictions across micro- and mesoscales. Four methodologies were evaluated: Single Scale Method (SSM), Simple Upscaling Method (SUM), Scale-Bridging Method (SBM), and Fully Resolved Model (FRM). SSM, the simplest method, neglects microscale permeability and exhibited permeability values deviating by up to 150% of the FRM model, which was taken as ground truth at an equivalent lower fiber volume content. SUM improved predictions by considering uniform microscale permeability, yielding closer values under similar conditions, but still lacked structural variability. The SBM method, incorporating segment-based microscale permeability assignments, showed significant enhancements, achieving almost equivalent values while maintaining computational efficiency and modeling runtimes of ~45 minutes per simulation. In contrast, FRM, which provides the highest fidelity by fully resolving microscale and mesoscale geometries, required up to 270 times more computational time than SSM, with model files exceeding 300 GB. Additionally, a hybrid dual-scale solver incorporating PINNs has been developed and shows the potential to overcome generalization errors and the problem of data scarcity of the data-driven surrogate approaches. The hybrid framework advances permeability modelling by balancing computational cost and prediction reliability, laying the foundation for further applications in fibrous composite manufacturing.

[LG-12] Leverag ing a Simulator for Learning Causal Representations from Post-Treatment Covariates for CATE

链接: https://arxiv.org/abs/2502.05037
作者: Lokesh Nagalapatti,Pranava Singhal,Avishek Ghosh,Sunita Sarawagi
类目: Machine Learning (cs.LG)
*备注: Accepted at TMLR-25

点击查看摘要

Abstract:Treatment effect estimation involves assessing the impact of different treatments on individual outcomes. Current methods estimate Conditional Average Treatment Effect (CATE) using observational datasets where covariates are collected before treatment assignment and outcomes are observed afterward, under assumptions like positivity and unconfoundedness. In this paper, we address a scenario where both covariates and outcomes are gathered after treatment. We show that post-treatment covariates render CATE unidentifiable, and recovering CATE requires learning treatment-independent causal representations. Prior work shows that such representations can be learned through contrastive learning if counterfactual supervision is available in observational data. However, since counterfactuals are rare, other works have explored using simulators that offer synthetic counterfactual supervision. Our goal in this paper is to systematically analyze the role of simulators in estimating CATE. We analyze the CATE error of several baselines and highlight their limitations. We then establish a generalization bound that characterizes the CATE error from jointly training on real and simulated distributions, as a function of the real-simulator mismatch. Finally, we introduce SimPONet, a novel method whose loss function is inspired from our generalization bound. We further show how SimPONet adjusts the simulator’s influence on the learning objective based on the simulator’s relevance to the CATE task. We experiment with various DGPs, by systematically varying the real-simulator distribution gap to evaluate SimPONet’s efficacy against state-of-the-art CATE baselines.

[LG-13] News about Global North considered Truthful! The Geo-political Veracity Gradient in Global South News

链接: https://arxiv.org/abs/2502.05032
作者: Sujit Mandava,Deepak P,Sahely Bhadra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While there has been much research into developing AI techniques for fake news detection aided by various benchmark datasets, it has often been pointed out that fake news in different geo-political regions traces different contours. In this work we uncover, through analytical arguments and empirical evidence, the existence of an important characteristic in news originating from the Global South viz., the geo-political veracity gradient. In particular, we show that Global South news about topics from Global North – such as news from an Indian news agency on US elections – tend to be less likely to be fake. Observing through the prism of the political economy of fake news creation, we posit that this pattern could be due to the relative lack of monetarily aligned incentives in producing fake news about a different region than the regional remit of the audience. We provide empirical evidence for this from benchmark datasets. We also empirically analyze the consequences of this effect in applying AI-based fake news detection models for fake news AI trained on one region within another regional context. We locate our work within emerging critical scholarship on geo-political biases within AI in general, particularly with AI usage in fake news identification; we hope our insight into the geo-political veracity gradient could help steer fake news AI scholarship towards positively impacting Global South societies.

[LG-14] Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency ICLR2025

链接: https://arxiv.org/abs/2502.05028
作者: Qixin Zhang,Zongqi Wan,Yu Yang,Li Shen,Dacheng Tao
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Coordinating multiple agents to collaboratively maximize submodular functions in unpredictable environments is a critical task with numerous applications in machine learning, robot planning and control. The existing approaches, such as the OSG algorithm, are often hindered by their poor approximation guarantees and the rigid requirement for a fully connected communication graph. To address these challenges, we firstly present a \textbfMA-OSMA algorithm, which employs the multi-linear extension to transfer the discrete submodular maximization problem into a continuous optimization, thereby allowing us to reduce the strict dependence on a complete graph through consensus techniques. Moreover, \textbfMA-OSMA leverages a novel surrogate gradient to avoid sub-optimal stationary points. To eliminate the computationally intensive projection operations in \textbfMA-OSMA , we also introduce a projection-free \textbfMA-OSEA algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution. Theoretically, we confirm that both algorithms achieve a regret bound of \widetildeO(\sqrt\fracC_TT1-\beta) against a (\frac1-e^-cc) -approximation to the best comparator in hindsight, where C_T is the deviation of maximizer sequence, \beta is the spectral gap of the network and c is the joint curvature of submodular objectives. This result significantly improves the (\frac11+c) -approximation provided by the state-of-the-art OSG algorithm. Finally, we demonstrate the effectiveness of our proposed algorithms through simulation-based multi-target tracking.

[LG-15] Analog and Multi-modal Manufacturing Datasets Acquired on the Future Factories Platform V2

链接: https://arxiv.org/abs/2502.05020
作者: Ramy Harik,Fadi El Kalach,Jad Samaha,Philip Samaha,Devon Clark,Drew Sander,Liam Burns,Ibrahim Yousif,Victor Gadow,Ahmed Mahmoud,Thorsten Wuest
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents two industry-grade datasets captured during an 8-hour continuous operation of the manufacturing assembly line at the Future Factories Lab, University of South Carolina, on 08/13/2024. The datasets adhere to industry standards, covering communication protocols, actuators, control mechanisms, transducers, sensors, and cameras. Data collection utilized both integrated and external sensors throughout the laboratory, including sensors embedded within the actuators and externally installed devices. Additionally, high-performance cameras captured key aspects of the operation. In a prior experiment [1], a 30-hour continuous run was conducted, during which all anomalies were documented. Maintenance procedures were subsequently implemented to reduce potential errors and operational disruptions. The two datasets include: (1) a time-series analog dataset, and (2) a multi-modal time-series dataset containing synchronized system data and images. These datasets aim to support future research in advancing manufacturing processes by providing a platform for testing novel algorithms without the need to recreate physical manufacturing environments. Moreover, the datasets are open-source and designed to facilitate the training of artificial intelligence models, streamlining research by offering comprehensive, ready-to-use resources for various applications and projects.

[LG-16] O(sqrtT) Static Regret and Instance Dependent Constraint Violation for Constrained Online Convex Optimization

链接: https://arxiv.org/abs/2502.05019
作者: Rahul Vaze,Abhishek Sinha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The constrained version of the standard online convex optimization (OCO) framework, called COCO is considered, where on every round, a convex cost function and a convex constraint function are revealed to the learner after it chooses the action for that round. The objective is to simultaneously minimize the static regret and cumulative constraint violation (CCV). An algorithm is proposed that guarantees a static regret of O(\sqrtT) and a CCV of \min\cV, O(\sqrtT\log T) \ , where \cV depends on the distance between the consecutively revealed constraint sets, the shape of constraint sets, dimension of action space and the diameter of the action space. For special cases of constraint sets, \cV=O(1) . Compared to the state of the art results, static regret of O(\sqrtT) and CCV of O(\sqrtT\log T) , that were universal, the new result on CCV is instance dependent, which is derived by exploiting the geometric properties of the constraint sets.

[LG-17] Seasonal Station-Keeping of Short Duration High Altitude Balloons using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.05014
作者: Tristan K. Schuler,Chinthan Prasad,Georgiy Kiselev,Donald Sofge
类目: Machine Learning (cs.LG); Robotics (cs.RO); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Station-Keeping short-duration high-altitude balloons (HABs) in a region of interest is a challenging path-planning problem due to partially observable, complex, and dynamic wind flows. Deep reinforcement learning is a popular strategy for solving the station-keeping problem. A custom simulation environment was developed to train and evaluate Deep Q-Learning (DQN) for short-duration HAB agents in the simulation. To train the agents on realistic winds, synthetic wind forecasts were generated from aggregated historical radiosonde data to apply horizontal kinematics to simulated agents. The synthetic forecasts were closely correlated with ECWMF ERA5 Reanalysis forecasts, providing a realistic simulated wind field and seasonal and altitudinal variances between the wind models. DQN HAB agents were then trained and evaluated across different seasonal months. To highlight differences and trends in months with vastly different wind fields, a Forecast Score algorithm was introduced to independently classify forecasts based on wind diversity, and trends between station-keeping success and the Forecast Score were evaluated across all seasons.

[LG-18] Learning the Language of NVMe Streams for Ransomware Detection

链接: https://arxiv.org/abs/2502.05011
作者: Barak Bringoltz,Elisha Halperin,Ran Feraru,Evgeny Blaichman,Amit Berman
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:We apply language modeling techniques to detect ransomware activity in NVMe command sequences. We design and train two types of transformer-based models: the Command-Level Transformer (CLT) performs in-context token classification to determine whether individual commands are initiated by ransomware, and the Patch-Level Transformer (PLT) predicts the volume of data accessed by ransomware within a patch of commands. We present both model designs and the corresponding tokenization and embedding schemes and show that they improve over state-of-the-art tabular methods by up to 24% in missed-detection rate, 66% in data loss prevention, and 84% in identifying data accessed by ransomware.

[LG-19] QuEST: Stable Training of LLM s with 1-Bit Weights and Activations

链接: https://arxiv.org/abs/2502.05003
作者: Andrei Panferov,Jiale Chen,Soroush Tabesh,Roberto L. Castro,Mahdi Nikdan,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the “optimal” bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the “true” (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.05003 [cs.LG] (or arXiv:2502.05003v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.05003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] Enhancing Pre-Trained Decision Transformers with Prompt-Tuning Bandits

链接: https://arxiv.org/abs/2502.04979
作者: Finn Rietz,Oleg Smirnov,Sara Karimi,Lele Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Harnessing large offline datasets is vital for training foundation models that can generalize across diverse tasks. Offline Reinforcement Learning (RL) offers a powerful framework for these scenarios, enabling the derivation of optimal policies even from suboptimal data. The Prompting Decision Transformer (PDT) is an offline RL multi-task model that distinguishes tasks through stochastic trajectory prompts, which are task-specific tokens maintained in context during rollouts. However, PDT samples these tokens uniformly at random from per-task demonstration datasets, failing to account for differences in token informativeness and potentially leading to performance degradation. To address this limitation, we introduce a scalable bandit-based prompt-tuning method that dynamically learns to construct high-performance trajectory prompts. Our approach significantly enhances downstream task performance without modifying the pre-trained Transformer backbone. Empirical results on benchmark tasks and a newly designed multi-task environment demonstrate the effectiveness of our method, creating a seamless bridge between general multi-task offline pre-training and task-specific online adaptation.

[LG-21] DE-PADA: Personalized Augmentation and Domain Adaptation for ECG Biometrics Across Physiological States

链接: https://arxiv.org/abs/2502.04973
作者: Amro Abu Saleh,Elliot Sprecher,Kfir Y. Levy,Daniel H. Lange
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electrocardiogram (ECG)-based biometrics offer a promising method for user identification, combining intrinsic liveness detection with morphological uniqueness. However, elevated heart rates introduce significant physiological variability, posing challenges to pattern recognition systems and leading to a notable performance gap between resting and post-exercise conditions. Addressing this gap is critical for advancing ECG-based biometric systems for real-world applications. We propose DE-PADA, a Dual Expert model with Personalized Augmentation and Domain Adaptation, designed to enhance robustness across diverse physiological states. The model is trained primarily on resting-state data from the evaluation dataset, without direct exposure to their exercise data. To address variability, DE-PADA incorporates ECG-specific innovations, including heartbeat segmentation into the PQRS interval, known for its relative temporal consistency, and the heart rate-sensitive ST interval, enabling targeted feature extraction tailored to each region’s unique characteristics. Personalized augmentation simulates subject-specific T-wave variability across heart rates using individual T-wave peak predictions to adapt augmentation ranges. Domain adaptation further improves generalization by leveraging auxiliary data from supplementary subjects used exclusively for training, including both resting and exercise conditions. Experiments on the University of Toronto ECG Database demonstrate the model’s effectiveness. DE-PADA achieves relative improvements in post-exercise identification rates of 26.75% in the initial recovery phase and 11.72% in the late recovery phase, while maintaining a 98.12% identification rate in the sitting position. These results highlight DE-PADA’s ability to address intra-subject variability and enhance the robustness of ECG-based biometric systems across diverse physiological states.

[LG-22] No Task Left Behind: Isotropic Model Merging with Common and Task-Specific Subspaces

链接: https://arxiv.org/abs/2502.04959
作者: Daniel Marczak,Simone Magistri,Sebastian Cygert,Bartłomiej Twardowski,Andrew D. Bagdanov,Joost van de Weijer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model merging integrates the weights of multiple task-specific models into a single multi-task model. Despite recent interest in the problem, a significant performance gap between the combined and single-task models remains. In this paper, we investigate the key characteristics of task matrices – weight update matrices applied to a pre-trained model – that enable effective merging. We show that alignment between singular components of task-specific and merged matrices strongly correlates with performance improvement over the pre-trained model. Based on this, we propose an isotropic merging framework that flattens the singular value spectrum of task matrices, enhances alignment, and reduces the performance gap. Additionally, we incorporate both common and task-specific subspaces to further improve alignment and performance. Our proposed approach achieves state-of-the-art performance across multiple scenarios, including various sets of tasks and model scales. This work advances the understanding of model merging dynamics, offering an effective methodology to merge models without requiring additional training. Code is available at this https URL .

[LG-23] Generative-enhanced optimization for knapsack problems: an industry-relevant study

链接: https://arxiv.org/abs/2502.04928
作者: Yelyzaveta Vodovozova,Abhishek Awasthi,Caitlin Jones,Joseph Doetsch,Karen Wintersperger,Florian Krellner,Carlos A. Riofrío
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Optimization is a crucial task in various industries such as logistics, aviation, manufacturing, chemical, pharmaceutical, and insurance, where finding the best solution to a problem can result in significant cost savings and increased efficiency. Tensor networks (TNs) have gained prominence in recent years in modeling classical systems with quantum-inspired approaches. More recently, TN generative-enhanced optimization (TN-GEO) has been proposed as a strategy which uses generative modeling to efficiently sample valid solutions with respect to certain constraints of optimization problems. Moreover, it has been shown that symmetric TNs (STNs) can encode certain constraints of optimization problems, thus aiding in their solution process. In this work, we investigate the applicability of TN- and STN-GEO to an industry relevant problem class, a multi-knapsack problem, in which each object must be assigned to an available knapsack. We detail a prescription for practitioners to use the TN-and STN-GEO methodology and study its scaling behavior and dependence on its hyper-parameters. We benchmark 60 different problem instances and find that TN-GEO and STN-GEO produce results of similar quality to simulated annealing.

[LG-24] On the Power of Heuristics in Temporal Graphs

链接: https://arxiv.org/abs/2502.04910
作者: Filip Cornell,Oleg Smirnov,Gabriela Zarzar Gandler,Lele Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic graph datasets often exhibit strong temporal patterns, such as recency, which prioritizes recent interactions, and popularity, which favors frequently occurring nodes. We demonstrate that simple heuristics leveraging only these patterns can perform on par or outperform state-of-the-art neural network models under standard evaluation protocols. To further explore these dynamics, we introduce metrics that quantify the impact of recency and popularity across datasets. Our experiments on BenchTemp and the Temporal Graph Benchmark show that our approaches achieve state-of-the-art performance across all datasets in the latter and secure top ranks on multiple datasets in the former. These results emphasize the importance of refined evaluation schemes to enable fair comparisons and promote the development of more robust temporal graph models. Additionally, they reveal that current deep learning methods often struggle to capture the key patterns underlying predictions in real-world temporal graphs. For reproducibility, we have made our code publicly available.

[LG-25] On the Difficulty of Constructing a Robust and Publicly-Detectable Watermark

链接: https://arxiv.org/abs/2502.04901
作者: Jaiden Fairoze,Guillermo Ortiz-Jiménez,Mel Vecerik,Somesh Jha,Sven Gowal
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work investigates the theoretical boundaries of creating publicly-detectable schemes to enable the provenance of watermarked imagery. Metadata-based approaches like C2PA provide unforgeability and public-detectability. ML techniques offer robust retrieval and watermarking. However, no existing scheme combines robustness, unforgeability, and public-detectability. In this work, we formally define such a scheme and establish its existence. Although theoretically possible, we find that at present, it is intractable to build certain components of our scheme without a leap in deep learning capabilities. We analyze these limitations and propose research directions that need to be addressed before we can practically realize robust and publicly-verifiable provenance.

[LG-26] Deep Learning Models for Physical Layer Communications

链接: https://arxiv.org/abs/2502.04895
作者: Nunzio A. Letizia
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: PhD Thesis

点击查看摘要

Abstract:The increased availability of data and computing resources has enabled researchers to successfully adopt machine learning (ML) techniques and make significant contributions in several engineering areas. ML and in particular deep learning (DL) algorithms have shown to perform better in tasks where a physical bottom-up description of the phenomenon is lacking and/or is mathematically intractable. Indeed, they take advantage of the observations of natural phenomena to automatically acquire knowledge and learn internal relations. Despite the historical model-based mindset, communications engineering recently started shifting the focus towards top-down data-driven learning models, especially in domains such as channel modeling and physical layer design, where in most of the cases no general optimal strategies are known. In this thesis, we aim at solving some fundamental open challenges in physical layer communications exploiting new DL paradigms. In particular, we mathematically formulate, under ML terms, classic problems such as channel capacity and optimal coding-decoding schemes, for any arbitrary communication medium. We design and develop the architecture, algorithm and code necessary to train the equivalent DL model, and finally, we propose novel solutions to long-standing problems in the field. Comments: PhD Thesis Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2502.04895 [cs.LG] (or arXiv:2502.04895v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.04895 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] A Foundational Brain Dynamics Model via Stochastic Optimal Control

链接: https://arxiv.org/abs/2502.04892
作者: Joonhyeong Park,Byoungwoo Park,Chang-Bae Bang,Jungwon Choi,Hyungjin Chung,Byung-Hoon Kim,Juho Lee
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注: The first two authors contributed equally

点击查看摘要

Abstract:We introduce a foundational model for brain dynamics that utilizes stochastic optimal control (SOC) and amortized inference. Our method features a continuous-discrete state space model (SSM) that can robustly handle the intricate and noisy nature of fMRI signals. To address computational limitations, we implement an approximation strategy grounded in the SOC framework. Additionally, we present a simulation-free latent dynamics approach that employs locally linear approximations, facilitating efficient and scalable inference. For effective representation learning, we derive an Evidence Lower Bound (ELBO) from the SOC formulation, which integrates smoothly with recent advancements in self-supervised learning (SSL), thereby promoting robust and transferable representations. Pre-trained on extensive datasets such as the UKB, our model attains state-of-the-art results across a variety of downstream tasks, including demographic prediction, trait analysis, disease diagnosis, and prognosis. Moreover, evaluating on external datasets such as HCP-A, ABIDE, and ADHD200 further validates its superior abilities and resilience across different demographic and clinical distributions. Our foundational model provides a scalable and efficient approach for deciphering brain dynamics, opening up numerous applications in neuroscience.

[LG-28] GNNs Getting ComFy: Community and Feature Similarity Guided Rewiring ICLR2025

链接: https://arxiv.org/abs/2502.04891
作者: Celia Rubio-Madrigal,Adarsh Jamadandi,Rebekka Burkholz
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Maximizing the spectral gap through graph rewiring has been proposed to enhance the performance of message-passing graph neural networks (GNNs) by addressing over-squashing. However, as we show, minimizing the spectral gap can also improve generalization. To explain this, we analyze how rewiring can benefit GNNs within the context of stochastic block models. Since spectral gap optimization primarily influences community strength, it improves performance when the community structure aligns with node labels. Building on this insight, we propose three distinct rewiring strategies that explicitly target community structure, node labels, and their alignment: (a) community structure-based rewiring (ComMa), a more computationally efficient alternative to spectral gap optimization that achieves similar goals; (b) feature similarity-based rewiring (FeaSt), which focuses on maximizing global homophily; and © a hybrid approach (ComFy), which enhances local feature similarity while preserving community structure to optimize label-community alignment. Extensive experiments confirm the effectiveness of these strategies and support our theoretical insights.

[LG-29] Exploit Gradient Skewness to Circumvent Byzantine Defenses for Federated Learning

链接: https://arxiv.org/abs/2502.04890
作者: Yuchen Liu,Chen Chen,Lingjuan Lyu,Yaochu Jin,Gang Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is notorious for its vulnerability to Byzantine attacks. Most current Byzantine defenses share a common inductive bias: among all the gradients, the densely distributed ones are more likely to be honest. However, such a bias is a poison to Byzantine robustness due to a newly discovered phenomenon in this paper - gradient skew. We discover that a group of densely distributed honest gradients skew away from the optimal gradient (the average of honest gradients) due to heterogeneous data. This gradient skew phenomenon allows Byzantine gradients to hide within the densely distributed skewed gradients. As a result, Byzantine defenses are confused into believing that Byzantine gradients are honest. Motivated by this observation, we propose a novel skew-aware attack called STRIKE: first, we search for the skewed gradients; then, we construct Byzantine gradients within the skewed gradients. Experiments on three benchmark datasets validate the effectiveness of our attack

[LG-30] Aequa: Fair Model Rewards in Collaborative Learning via Slimmable Networks

链接: https://arxiv.org/abs/2502.04850
作者: Nurbek Tastan,Samuel Horvath,Karthik Nandakumar
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Collaborative learning enables multiple participants to learn a single global model by exchanging focused updates instead of sharing data. One of the core challenges in collaborative learning is ensuring that participants are rewarded fairly for their contributions, which entails two key sub-problems: contribution assessment and reward allocation. This work focuses on fair reward allocation, where the participants are incentivized through model rewards - differentiated final models whose performance is commensurate with the contribution. In this work, we leverage the concept of slimmable neural networks to collaboratively learn a shared global model whose performance degrades gracefully with a reduction in model width. We also propose a post-training fair allocation algorithm that determines the model width for each participant based on their contributions. We theoretically study the convergence of our proposed approach and empirically validate it using extensive experiments on different datasets and architectures. We also extend our approach to enable training-time model reward allocation.

[LG-31] Memory Capacity of Nonlinear Recurrent Networks: Is it Informative?

链接: https://arxiv.org/abs/2502.04832
作者: Giovanni Ballarin,Lyudmila Grigoryeva,Juan-Pablo Ortega
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:The total memory capacity (MC) of linear recurrent neural networks (RNNs) has been proven to be equal to the rank of the corresponding Kalman controllability matrix, and it is almost surely maximal for connectivity and input weight matrices drawn from regular distributions. This fact questions the usefulness of this metric in distinguishing the performance of linear RNNs in the processing of stochastic signals. This note shows that the MC of random nonlinear RNNs yields arbitrary values within established upper and lower bounds depending just on the input process scale. This confirms that the existing definition of MC in linear and nonlinear cases has no practical value.

[LG-32] Harnessing omnipresent oscillator networks as computational resource

链接: https://arxiv.org/abs/2502.04818
作者: Thomas Geert de Jong,Hirofumi Notsu,Kohei Nakajima
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Adaptation and Self-Organizing Systems (nlin.AO); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Nature is pervaded with oscillatory behavior. In networks of coupled oscillators patterns can arise when the system synchronizes to an external input. Hence, these networks provide processing and memory of input. We present a universal framework for harnessing oscillator networks as computational resource. This reservoir computing framework is introduced by the ubiquitous model for phase-locking, the Kuramoto model. We force the Kuramoto model by a nonlinear target-system, then after substituting the target-system with a trained feedback-loop it emulates the target-system. Our results are two-fold. Firstly, the trained network inherits performance properties of the Kuramoto model, where all-to-all coupling is performed in linear time with respect to the number of nodes and parameters for synchronization are abundant. Secondly, the learning capabilities of the oscillator network can be explained using Kuramoto model’s order parameter. This work provides the foundation for utilizing nature’s oscillator networks as a new class of information processing systems.

[LG-33] Describing Nonstationary Data Streams in Frequency Domain

链接: https://arxiv.org/abs/2502.04813
作者: Joanna Komorniczak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept drift is among the primary challenges faced by the data stream processing methods. The drift detection strategies, designed to counteract the negative consequences of such changes, often rely on analyzing the problem metafeatures. This work presents the Frequency Filtering Metadescriptor – a tool for characterizing the data stream that searches for the informative frequency components visible in the sample’s feature vector. The frequencies are filtered according to their variance across all available data batches. The presented solution is capable of generating a metadescription of the data stream, separating chunks into groups describing specific concepts on its basis, and visualizing the frequencies in the original spatial domain. The experimental analysis compared the proposed solution with two state-of-the-art strategies and with the PCA baseline in the post-hoc concept identification task. The research is followed by the identification of concepts in the real-world data streams. The generalization in the frequency domain adapted in the proposed solution allows to capture the complex feature dependencies as a reduced number of frequency components, while maintaining the semantic meaning of data.

[LG-34] Humans Co-exist So Must Embodied Artificial Agents

链接: https://arxiv.org/abs/2502.04809
作者: Hannah Kuehn,Joseph La Delfa,Miguel Vasco,Danica Kragic,Iolanda Leite
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern embodied artificial agents excel in static, predefined tasks but fall short in dynamic and long-term interactions with humans. On the other hand, humans can adapt and evolve continuously, exploiting the situated knowledge embedded in their environment and other agents, thus contributing to meaningful interactions. We introduce the concept of co-existence for embodied artificial agents and argues that it is a prerequisite for meaningful, long-term interaction with humans. We take inspiration from biology and design theory to understand how human and non-human organisms foster entities that co-exist within their specific niches. Finally, we propose key research directions for the machine learning community to foster co-existing embodied agents, focusing on the principles, hardware and learning methods responsible for shaping them.

[LG-35] An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks

链接: https://arxiv.org/abs/2502.04773
作者: George Papadopoulos,Andreas Kontogiannis,Foteini Papadopoulou,Chaido Poulianou,Ioannis Koumentis,George Vouros
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning (MARL) has recently emerged as a significant area of research. However, MARL evaluation often lacks systematic diversity, hindering a comprehensive understanding of algorithms’ capabilities. In particular, cooperative MARL algorithms are predominantly evaluated on benchmarks such as SMAC and GRF, which primarily feature team game scenarios without assessing adequately various aspects of agents’ capabilities required in fully cooperative real-world tasks such as multi-robot cooperation and warehouse, resource management, search and rescue, and human-AI cooperation. Moreover, MARL algorithms are mainly evaluated on low dimensional state spaces, and thus their performance on high-dimensional (e.g., image) observations is not well-studied. To fill this gap, this paper highlights the crucial need for expanding systematic evaluation across a wider array of existing benchmarks. To this end, we conduct extensive evaluation and comparisons of well-known MARL algorithms on complex fully cooperative benchmarks, including tasks with images as agents’ observations. Interestingly, our analysis shows that many algorithms, hailed as state-of-the-art on SMAC and GRF, may underperform standard MARL baselines on fully cooperative benchmarks. Finally, towards more systematic and better evaluation of cooperative MARL algorithms, we have open-sourced PyMARLzoo+, an extension of the widely used (E)PyMARL libraries, which addresses an open challenge from [TBG++21], facilitating seamless integration and support with all benchmarks of PettingZoo, as well as Overcooked, PressurePlate, Capture Target and Box Pushing.

[LG-36] Shapley Value Approximation Based on k-Additive Games

链接: https://arxiv.org/abs/2502.04763
作者: Guilherme Dean Pelegrina,Patrick Kolpaczki,Eyke Hüllermeier
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Shapley value is the prevalent solution for fair division problems in which a payout is to be divided among multiple agents. By adopting a game-theoretic view, the idea of fair division and the Shapley value can also be used in machine learning to quantify the individual contribution of features or data points to the performance of a predictive model. Despite its popularity and axiomatic justification, the Shapley value suffers from a computational complexity that scales exponentially with the number of entities involved, and hence requires approximation methods for its reliable estimation. We propose SVA k_\textADD , a novel approximation method that fits a k -additive surrogate game. By taking advantage of k -additivity, we are able to elicit the exact Shapley values of the surrogate game and then use these values as estimates for the original fair division problem. The efficacy of our method is evaluated empirically and compared to competing methods.

[LG-37] Learning Universal Multi-level Market Irrationality Factors to Improve Stock Return Forecasting KDD2025

链接: https://arxiv.org/abs/2502.04737
作者: Chen Yang,Jingyuan Wang,Xiaohan Jiang,Junjie Wu
类目: Machine Learning (cs.LG)
*备注: KDD2025

点击查看摘要

Abstract:Recent years have witnessed the perfect encounter of deep learning and quantitative trading has achieved great success in stock investment. Numerous deep learning-based models have been developed for forecasting stock returns, leveraging the powerful representation capabilities of neural networks to identify patterns and factors influencing stock prices. These models can effectively capture general patterns in the market, such as stock price trends, volume-price relationships, and time variations. However, the impact of special irrationality factors – such as market sentiment, speculative behavior, market manipulation, and psychological biases – have not been fully considered in existing deep stock forecasting models due to their relative abstraction as well as lack of explicit labels and data description. To fill this gap, we propose UMI, a Universal multi-level Market Irrationality factor model to enhance stock return forecasting. The UMI model learns factors that can reflect irrational behaviors in market from both individual stock and overall market levels. For the stock-level, UMI construct an estimated rational price for each stock, which is cointegrated with the stock’s actual price. The discrepancy between the actual and the rational prices serves as a factor to indicate stock-level irrational events. Additionally, we define market-level irrational behaviors as anomalous synchronous fluctuations of stocks within a market. Using two self-supervised representation learning tasks, i.e., sub-market comparative learning and market synchronism prediction, the UMI model incorporates market-level irrationalities into a market representation vector, which is then used as the market-level irrationality factor.

[LG-38] Singing Voice Conversion with Accompaniment Using Self-Supervised Representation-Based Melody Features ICASSP2025

链接: https://arxiv.org/abs/2502.04722
作者: Wei Chen,Binzhu Sha,Jing Yang,Zhuo Wang,Fan Fan,Zhiyong Wu
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted by ICASSP2025

点击查看摘要

Abstract:Melody preservation is crucial in singing voice conversion (SVC). However, in many scenarios, audio is often accompanied with background music (BGM), which can cause audio distortion and interfere with the extraction of melody and other key features, significantly degrading SVC performance. Previous methods have attempted to address this by using more robust neural network-based melody extractors, but their performance drops sharply in the presence of complex accompaniment. Other approaches involve performing source separation before conversion, but this often introduces noticeable artifacts, leading to a significant drop in conversion quality and increasing the user’s operational costs. To address these issues, we introduce a novel SVC method that uses self-supervised representation-based melody features to improve melody modeling accuracy in the presence of BGM. In our experiments, we compare the effectiveness of different self-supervised learning (SSL) models for melody extraction and explore for the first time how SSL benefits the task of melody extraction. The experimental results demonstrate that our proposed SVC model significantly outperforms existing baseline methods in terms of melody accuracy and shows higher similarity and naturalness in both subjective and objective evaluations across noisy and clean audio environments.

[LG-39] Symbolic Regression of Data-Driven Reduced Order Model Closures for Under-Resolved Convection-Dominated Flows

链接: https://arxiv.org/abs/2502.04703
作者: Simone Manti,Ping-Hsuan Tsai,Alessandro Lucantonio,Traian Iliescu
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Data-driven closures correct the standard reduced order models (ROMs) to increase their accuracy in under-resolved, convection-dominated flows. There are two types of data-driven ROM closures in current use: (i) structural, with simple ansatzes (e.g., linear or quadratic); and (ii) machine learning-based, with neural network ansatzes. We propose a novel symbolic regression (SR) data-driven ROM closure strategy, which combines the advantages of current approaches and eliminates their drawbacks. As a result, the new data-driven SR closures yield ROMs that are interpretable, parsimonious, accurate, generalizable, and robust. To compare the data-driven SR-ROM closures with the structural and machine learning-based ROM closures, we consider the data-driven variational multiscale ROM framework and two under-resolved, convection-dominated test problems: the flow past a cylinder and the lid-driven cavity flow at Reynolds numbers Re = 10000, 15000, and 20000. This numerical investigation shows that the new data-driven SR-ROM closures yield more accurate and robust ROMs than the structural and machine learning ROM closures.

[LG-40] Nearly Tight Bounds for Cross-Learning Contextual Bandits with Graphical Feedback

链接: https://arxiv.org/abs/2502.04678
作者: Ruiyuan Huang,Zengfeng Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The cross-learning contextual bandit problem with graphical feedback has recently attracted significant attention. In this setting, there is a contextual bandit with a feedback graph over the arms, and pulling an arm reveals the loss for all neighboring arms in the feedback graph across all contexts. Initially proposed by Han et al. (2024), this problem has broad applications in areas such as bidding in first price auctions, and explores a novel frontier in the feedback structure of bandit problems. A key theoretical question is whether an algorithm with \widetildeO(\sqrt\alpha T) regret exists, where \alpha represents the independence number of the feedback graph. This question is particularly interesting because it concerns whether an algorithm can achieve a regret bound entirely independent of the number of contexts and matching the minimax regret of vanilla graphical bandits. Previous work has demonstrated that such an algorithm is impossible for adversarial contexts, but the question remains open for stochastic contexts. In this work, we affirmatively answer this open question by presenting an algorithm that achieves the minimax \widetildeO(\sqrt\alpha T) regret for cross-learning contextual bandits with graphical feedback and stochastic contexts. Notably, although that question is open even for stochastic bandits, we directly solve the strictly stronger adversarial bandit version of the problem.

[LG-41] Implicit Bias of SignGD and Adam on Multiclass Separable Data

链接: https://arxiv.org/abs/2502.04664
作者: Chen Fan,Mark Schmidt,Christos Thrampoulidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the optimization of overparameterized models, different gradient-based methods can achieve zero training error yet converge to distinctly different solutions inducing different generalization properties. While a decade of research on implicit optimization bias has illuminated this phenomenon in various settings, even the foundational case of linear classification with separable data still has important open questions. We resolve a fundamental gap by characterizing the implicit bias of both Adam and Sign Gradient Descent in multi-class cross-entropy minimization: we prove that their iterates converge to solutions that maximize the margin with respect to the classifier matrix’s max-norm and characterize the rate of convergence. We extend our results to general p-norm normalized steepest descent algorithms and to other multi-class losses.

[LG-42] Adversarially-Robust TD Learning with Markovian Data: Finite-Time Rates and Fundamental Limits AISTATS2025

链接: https://arxiv.org/abs/2502.04662
作者: Sreejeet Maity,Aritra Mitra
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: Accepted to AISTATS 2025

点击查看摘要

Abstract:One of the most basic problems in reinforcement learning (RL) is policy evaluation: estimating the long-term return, i.e., value function, corresponding to a given fixed policy. The celebrated Temporal Difference (TD) learning algorithm addresses this problem, and recent work has investigated finite-time convergence guarantees for this algorithm and variants thereof. However, these guarantees hinge on the reward observations being always generated from a well-behaved (e.g., sub-Gaussian) true reward distribution. Motivated by harsh, real-world environments where such an idealistic assumption may no longer hold, we revisit the policy evaluation problem from the perspective of adversarial robustness. In particular, we consider a Huber-contaminated reward model where an adversary can arbitrarily corrupt each reward sample with a small probability \epsilon . Under this observation model, we first show that the adversary can cause the vanilla TD algorithm to converge to any arbitrary value function. We then develop a novel algorithm called Robust-TD and prove that its finite-time guarantees match that of vanilla TD with linear function approximation up to a small O(\epsilon) term that captures the effect of corruption. We complement this result with a minimax lower bound, revealing that such an additive corruption-induced term is unavoidable. To our knowledge, these results are the first of their kind in the context of adversarial robustness of stochastic approximation schemes driven by Markov noise. The key new technical tool that enables our results is an analysis of the Median-of-Means estimator with corrupted, time-correlated data that might be of independent interest to the literature on robust statistics.

[LG-43] End-to-End Learning Framework for Solving Non-Markovian Optimal Control

链接: https://arxiv.org/abs/2502.04649
作者: Xiaole Zhang,Peiyu Zhang,Xiongye Xiao,Shixuan Li,Vasileios Tzoumas,Vijay Gupta,Paul Bogdan
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Integer-order calculus often falls short in capturing the long-range dependencies and memory effects found in many real-world processes. Fractional calculus addresses these gaps via fractional-order integrals and derivatives, but fractional-order dynamical systems pose substantial challenges in system identification and optimal control due to the lack of standard control methodologies. In this paper, we theoretically derive the optimal control via \textitlinear quadratic regulator (LQR) for \textitfractional-order linear time-invariant (FOLTI) systems and develop an end-to-end deep learning framework based on this theoretical foundation. Our approach establishes a rigorous mathematical model, derives analytical solutions, and incorporates deep learning to achieve data-driven optimal control of FOLTI systems. Our key contributions include: (i) proposing an innovative system identification method control strategy for FOLTI systems, (ii) developing the first end-to-end data-driven learning framework, \textbfFractional-\textbfOrder \textbfLearning for \textbfOptimal \textbfControl (FOLOC), that learns control policies from observed trajectories, and (iii) deriving a theoretical analysis of sample complexity to quantify the number of samples required for accurate optimal control in complex real-world problems. Experimental results indicate that our method accurately approximates fractional-order system behaviors without relying on Gaussian noise assumptions, pointing to promising avenues for advanced optimal control.

[LG-44] LATTEO: A Framework to Support Learning Asynchronously Tempered with Trusted Execution and Obfuscation

链接: https://arxiv.org/abs/2502.04601
作者: Abhinav Kumar,George Torres,Noah Guzinski,Gaurav Panwar,Reza Tourani,Satyajayant Misra,Marcin Spoczynski,Mona Vij,Nageen Himayat
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The privacy vulnerabilities of the federated learning (FL) paradigm, primarily caused by gradient leakage, have prompted the development of various defensive measures. Nonetheless, these solutions have predominantly been crafted for and assessed in the context of synchronous FL systems, with minimal focus on asynchronous FL. This gap arises in part due to the unique challenges posed by the asynchronous setting, such as the lack of coordinated updates, increased variability in client participation, and the potential for more severe privacy risks. These concerns have stymied the adoption of asynchronous FL. In this work, we first demonstrate the privacy vulnerabilities of asynchronous FL through a novel data reconstruction attack that exploits gradient updates to recover sensitive client data. To address these vulnerabilities, we propose a privacy-preserving framework that combines a gradient obfuscation mechanism with Trusted Execution Environments (TEEs) for secure asynchronous FL aggregation at the network edge. To overcome the limitations of conventional enclave attestation, we introduce a novel data-centric attestation mechanism based on Multi-Authority Attribute-Based Encryption. This mechanism enables clients to implicitly verify TEE-based aggregation services, effectively handle on-demand client participation, and scale seamlessly with an increasing number of asynchronous connections. Our gradient obfuscation mechanism reduces the structural similarity index of data reconstruction by 85% and increases reconstruction error by 400%, while our framework improves attestation efficiency by lowering average latency by up to 1500% compared to RA-TLS, without additional overhead.

[LG-45] Overcoming Fake Solutions in Semi-Dual Neural Optimal Transport: A Smoothing Approach for Learning the Optimal Transport Plan

链接: https://arxiv.org/abs/2502.04583
作者: Jaemoo Choi,Jaewoong Choi,Dohyun Kwon
类目: Machine Learning (cs.LG)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:We address the convergence problem in learning the Optimal Transport (OT) map, where the OT Map refers to a map from one distribution to another while minimizing the transport cost. Semi-dual Neural OT, a widely used approach for learning OT Maps with neural networks, often generates fake solutions that fail to transfer one distribution to another accurately. We identify a sufficient condition under which the max-min solution of Semi-dual Neural OT recovers the true OT Map. Moreover, to address cases when this sufficient condition is not satisfied, we propose a novel method, OTP, which learns both the OT Map and the Optimal Transport Plan, representing the optimal coupling between two distributions. Under sharp assumptions on the distributions, we prove that our model eliminates the fake solution issue and correctly solves the OT problem. Our experiments show that the OTP model recovers the optimal transport map where existing methods fail and outperforms current OT-based models in image-to-image translation tasks. Notably, the OTP model can learn stochastic transport maps when deterministic OT Maps do not exist, such as one-to-many tasks like colorization.

[LG-46] Learning Semantics-aware Search Operators for Genetic Programming GECCO2025

链接: https://arxiv.org/abs/2502.04568
作者: Piotr Wyrwiński,Krzysztof Krawiec
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Submitted to GECCO 2025

点击查看摘要

Abstract:Fitness landscapes in test-based program synthesis are known to be extremely rugged, with even minimal modifications of programs often leading to fundamental changes in their behavior and, consequently, fitness values. Relying on fitness as the only guidance in iterative search algorithms like genetic programming is thus unnecessarily limiting, especially when combined with purely syntactic search operators that are agnostic about their impact on program behavior. In this study, we propose a semantics-aware search operator that steers the search towards candidate programs that are valuable not only actually (high fitness) but also only potentially, i.e. are likely to be turned into high-quality solutions even if their current fitness is low. The key component of the method is a graph neural network that learns to model the interactions between program instructions and processed data, and produces a saliency map over graph nodes that represents possible search decisions. When applied to a suite of symbolic regression benchmarks, the proposed method outperforms conventional tree-based genetic programming and the ablated variant of the method.

[LG-47] Private Federated Learning In Real World Application – A Case Study

链接: https://arxiv.org/abs/2502.04565
作者: An Ji,Bortik Bandyopadhyay,Congzheng Song,Natarajan Krishnaswami,Prabal Vashish,Rigel Smiroldo,Isabel Litton,Sayantan Mahinder,Mona Chitnis,Andrew W Hill
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents an implementation of machine learning model training using private federated learning (PFL) on edge devices. We introduce a novel framework that uses PFL to address the challenge of training a model using users’ private data. The framework ensures that user data remain on individual devices, with only essential model updates transmitted to a central server for aggregation with privacy guarantees. We detail the architecture of our app selection model, which incorporates a neural network with attention mechanisms and ambiguity handling through uncertainty management. Experiments conducted through off-line simulations and on device training demonstrate the feasibility of our approach in real-world scenarios. Our results show the potential of PFL to improve the accuracy of an app selection model by adapting to changes in user behavior over time, while adhering to privacy standards. The insights gained from this study are important for industries looking to implement PFL, offering a robust strategy for training a predictive model directly on edge devices while ensuring user data privacy.

[LG-48] Mixture of neural operator experts for learning boundary conditions and model selection

链接: https://arxiv.org/abs/2502.04562
作者: Dwyer Deighan,Jonas A. Actor,Ravi G. Patel
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:While Fourier-based neural operators are best suited to learning mappings between functions on periodic domains, several works have introduced techniques for incorporating non trivial boundary conditions. However, all previously introduced methods have restrictions that limit their applicability. In this work, we introduce an alternative approach to imposing boundary conditions inspired by volume penalization from numerical methods and Mixture of Experts (MoE) from machine learning. By introducing competing experts, the approach additionally allows for model selection. To demonstrate the method, we combine a spatially conditioned MoE with the Fourier based, Modal Operator Regression for Physics (MOR-Physics) neural operator and recover a nonlinear operator on a disk and quarter disk. Next, we extract a large eddy simulation (LES) model from direct numerical simulation of channel flow and show the domain decomposition provided by our approach. Finally, we train our LES model with Bayesian variational inference and obtain posterior predictive samples of flow far past the DNS simulation time horizon.

[LG-49] Speeding up Speculative Decoding via Approximate Verification

链接: https://arxiv.org/abs/2502.04557
作者: Meiyu Zhong,Noel Teku,Ravi Tandon
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Speculative Decoding (SD) is a recently proposed technique for faster inference using Large Language Models (LLMs). SD operates by using a smaller draft LLM for autoregressively generating a sequence of tokens and a larger target LLM for parallel verification to ensure statistical consistency. However, periodic parallel calls to the target LLM for verification prevent SD from achieving even lower latencies. We propose SPRINTER, which utilizes a low-complexity verifier trained to predict if tokens generated from a draft LLM would be accepted by the target LLM. By performing approximate sequential verification, SPRINTER does not require verification by the target LLM and is only invoked when a token is deemed unacceptable. This leads to reducing the number of calls to the larger LLM and can achieve further speedups. We present a theoretical analysis of SPRINTER, examining the statistical properties of the generated tokens, as well as the expected reduction in latency as a function of the verifier. We evaluate SPRINTER on several datasets and model pairs, demonstrating that approximate verification can still maintain high quality generation while further reducing latency. For instance, on Wiki-Summaries dataset, SPRINTER achieves a 1.7x latency speedup and requires 8.3x fewer flops relative to SD, while still generating high-quality responses when using GPT2-Small and GPT2-XL as draft/target models.

[LG-50] Mechanisms of Projective Composition of Diffusion Models

链接: https://arxiv.org/abs/2502.04549
作者: Arwen Bradley,Preetum Nakkiran,David Berthelot,James Thornton,Joshua M. Susskind
类目: Machine Learning (cs.LG)
*备注: 9 pages, 7 figures. The first two authors contributed equally

点击查看摘要

Abstract:We study the theoretical foundations of composition in diffusion models, with a particular focus on out-of-distribution extrapolation and length-generalization. Prior work has shown that composing distributions via linear score combination can achieve promising results, including length-generalization in some cases (Du et al., 2023; Liu et al., 2022). However, our theoretical understanding of how and why such compositions work remains incomplete. In fact, it is not even entirely clear what it means for composition to “work”. This paper starts to address these fundamental gaps. We begin by precisely defining one possible desired result of composition, which we call projective composition. Then, we investigate: (1) when linear score combinations provably achieve projective composition, (2) whether reverse-diffusion sampling can generate the desired composition, and (3) the conditions under which composition fails. Finally, we connect our theoretical analysis to prior empirical observations where composition has either worked or failed, for reasons that were unclear at the time.

[LG-51] Discovering Physics Laws of Dynamical Systems via Invariant Function Learning

链接: https://arxiv.org/abs/2502.04495
作者: Shurui Gui,Xiner Li,Shuiwang Ji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider learning underlying laws of dynamical systems governed by ordinary differential equations (ODE). A key challenge is how to discover intrinsic dynamics across multiple environments while circumventing environment-specific mechanisms. Unlike prior work, we tackle more complex environments where changes extend beyond function coefficients to entirely different function forms. For example, we demonstrate the discovery of ideal pendulum’s natural motion \alpha^2 \sin\theta_t by observing pendulum dynamics in different environments, such as the damped environment \alpha^2 \sin(\theta_t) - \rho \omega_t and powered environment \alpha^2 \sin(\theta_t) + \rho \frac\omega_t\left|\omega_t\right| . Here, we formulate this problem as an \emphinvariant function learning task and propose a new method, known as \textbfDisentanglement of \textbfInvariant \textbfFunctions (DIF), that is grounded in causal analysis. We propose a causal graph and design an encoder-decoder hypernetwork that explicitly disentangles invariant functions from environment-specific dynamics. The discovery of invariant functions is guaranteed by our information-based principle that enforces the independence between extracted invariant functions and environments. Quantitative comparisons with meta-learning and invariant learning baselines on three ODE systems demonstrate the effectiveness and efficiency of our method. Furthermore, symbolic regression explanation results highlight the ability of our framework to uncover intrinsic laws.

[LG-52] Provable Sample-Efficient Transfer Learning Conditional Diffusion Models via Representation Learning

链接: https://arxiv.org/abs/2502.04491
作者: Ziheng Cheng,Tianyu Xie,Shiyue Zhang,Cheng Zhang
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While conditional diffusion models have achieved remarkable success in various applications, they require abundant data to train from scratch, which is often infeasible in practice. To address this issue, transfer learning has emerged as an essential paradigm in small data regimes. Despite its empirical success, the theoretical underpinnings of transfer learning conditional diffusion models remain unexplored. In this paper, we take the first step towards understanding the sample efficiency of transfer learning conditional diffusion models through the lens of representation learning. Inspired by practical training procedures, we assume that there exists a low-dimensional representation of conditions shared across all tasks. Our analysis shows that with a well-learned representation from source tasks, the samplecomplexity of target tasks can be reduced substantially. In addition, we investigate the practical implications of our theoretical results in several real-world applications of conditional diffusion models. Numerical experiments are also conducted to verify our results.

[LG-53] he ML Supply Chain in the Era of Software 2.0: Lessons Learned from Hugging Face

链接: https://arxiv.org/abs/2502.04484
作者: Trevor Stalnaker,Nathan Wintersgill,Oscar Chaparro,Laura A. Heymann,Massimiliano Di Penta,Daniel M German,Denys Poshyvanyk
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The last decade has seen widespread adoption of Machine Learning (ML) components in software systems. This has occurred in nearly every domain, from natural language processing to computer vision. These ML components range from relatively simple neural networks to complex and resource-intensive large language models. However, despite this widespread adoption, little is known about the supply chain relationships that produce these models, which can have implications for compliance and security. In this work, we conduct an extensive analysis of 760,460 models and 175,000 datasets mined from the popular model-sharing site Hugging Face. First, we evaluate the current state of documentation in the Hugging Face supply chain, report real-world examples of shortcomings, and offer actionable suggestions for improvement. Next, we analyze the underlying structure of the extant supply chain. Finally, we explore the current licensing landscape against what was reported in prior work and discuss the unique challenges posed in this domain. Our results motivate multiple research avenues, including the need for better license management for ML models/datasets, better support for model documentation, and automated inconsistency checking and validation. We make our research infrastructure and dataset available to facilitate future research.

[LG-54] Identifying Flaky Tests in Quantum Code: A Machine Learning Approach

链接: https://arxiv.org/abs/2502.04471
作者: Khushdeep Kaur,Dongchan Kim,Ainaz Jamshidi,Lei Zhang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure, accepted by Q-SANER 2025

点击查看摘要

Abstract:Testing and debugging quantum software pose significant challenges due to the inherent complexities of quantum mechanics, such as superposition and entanglement. One challenge is indeterminacy, a fundamental characteristic of quantum systems, which increases the likelihood of flaky tests in quantum programs. To the best of our knowledge, there is a lack of comprehensive studies on quantum flakiness in the existing literature. In this paper, we present a novel machine learning platform that leverages multiple machine learning models to automatically detect flaky tests in quantum programs. Our evaluation shows that the extreme gradient boosting and decision tree-based models outperform other models (i.e., random forest, k-nearest neighbors, and support vector machine), achieving the highest F1 score and Matthews Correlation Coefficient in a balanced dataset and an imbalanced dataset, respectively. Furthermore, we expand the currently limited dataset for researchers interested in quantum flaky tests. In the future, we plan to explore the development of unsupervised learning techniques to detect and classify quantum flaky tests more effectively. These advancements aim to improve the reliability and robustness of quantum software testing.

[LG-55] Iterative Importance Fine-tuning of Diffusion Models

链接: https://arxiv.org/abs/2502.04468
作者: Alexander Denker,Shreyas Padhy,Francisco Vargas,Johannes Hertrich
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Diffusion models are an important tool for generative modelling, serving as effective priors in applications such as imaging and protein design. A key challenge in applying diffusion models for downstream tasks is efficiently sampling from resulting posterior distributions, which can be addressed using the h -transform. This work introduces a self-supervised algorithm for fine-tuning diffusion models by estimating the h -transform, enabling amortised conditional sampling. Our method iteratively refines the h -transform using a synthetic dataset resampled with path-based importance weights. We demonstrate the effectiveness of this framework on class-conditional sampling and reward fine-tuning for text-to-image diffusion models.

[LG-56] Learning low-dimensional representations of ensemble forecast fields using autoencoder-based methods

链接: https://arxiv.org/abs/2502.04409
作者: Jieyu Chen,Kevin Höhlein,Sebastian Lerch
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Large-scale numerical simulations often produce high-dimensional gridded data that is challenging to process for downstream applications. A prime example is numerical weather prediction, where atmospheric processes are modeled using discrete gridded representations of the physical variables and dynamics. Uncertainties are assessed by running the simulations multiple times, yielding ensembles of simulated fields as a high-dimensional stochastic representation of the forecast distribution. The high-dimensionality and large volume of ensemble datasets poses major computing challenges for subsequent forecasting stages. Data-driven dimensionality reduction techniques could help to reduce the data volume before further processing by learning meaningful and compact representations. However, existing dimensionality reduction methods are typically designed for deterministic and single-valued inputs, and thus cannot handle ensemble data from multiple randomized simulations. In this study, we propose novel dimensionality reduction approaches specifically tailored to the format of ensemble forecast fields. We present two alternative frameworks, which yield low-dimensional representations of ensemble forecasts while respecting their probabilistic character. The first approach derives a distribution-based representation of an input ensemble by applying standard dimensionality reduction techniques in a member-by-member fashion and merging the member representations into a joint parametric distribution model. The second approach achieves a similar representation by encoding all members jointly using a tailored variational autoencoder. We evaluate and compare both approaches in a case study using 10 years of temperature and wind speed forecasts over Europe. The approaches preserve key spatial and statistical characteristics of the ensemble and enable probabilistic reconstructions of the forecast fields.

[LG-57] XMTC: Explainable Early Classification of Multivariate Time Series in Reach-to-Grasp Hand Kinematics

链接: https://arxiv.org/abs/2502.04398
作者: Reyhaneh Sabbagh Gol,Dimitar Valkov,Lars Linsen
类目: Machine Learning (cs.LG); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Hand kinematics can be measured in Human-Computer Interaction (HCI) with the intention to predict the user’s intention in a reach-to-grasp action. Using multiple hand sensors, multivariate time series data are being captured. Given a number of possible actions on a number of objects, the goal is to classify the multivariate time series data, where the class shall be predicted as early as possible. Many machine-learning methods have been developed for such classification tasks, where different approaches produce favorable solutions on different data sets. We, therefore, employ an ensemble approach that includes and weights different approaches. To provide a trustworthy classification production, we present the XMTC tool that incorporates coordinated multiple-view visualizations to analyze the predictions. Temporal accuracy plots, confusion matrix heatmaps, temporal confidence heatmaps, and partial dependence plots allow for the identification of the best trade-off between early prediction and prediction quality, the detection and analysis of challenging classification conditions, and the investigation of the prediction evolution in an overview and detail manner. We employ XMTC to real-world HCI data in multiple scenarios and show that good classification predictions can be achieved early on with our classifier as well as which conditions are easy to distinguish, which multivariate time series measurements impose challenges, and which features have most impact.

[LG-58] Discrete GCBF Proximal Policy Optimization for Multi-agent Safe Optimal Control ICLR2025

链接: https://arxiv.org/abs/2502.03640
作者: Songyuan Zhang,Oswin So,Mitchell Black,Chuchu Fan
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
*备注: 31 pages, 15 figures, accepted by the thirteenth International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:Control policies that can achieve high task performance and satisfy safety constraints are desirable for any system, including multi-agent systems (MAS). One promising technique for ensuring the safety of MAS is distributed control barrier functions (CBF). However, it is difficult to design distributed CBF-based policies for MAS that can tackle unknown discrete-time dynamics, partial observability, changing neighborhoods, and input constraints, especially when a distributed high-performance nominal policy that can achieve the task is unavailable. To tackle these challenges, we propose DGPPO, a new framework that simultaneously learns both a discrete graph CBF which handles neighborhood changes and input constraints, and a distributed high-performance safe policy for MAS with unknown discrete-time dynamics. We empirically validate our claims on a suite of multi-agent tasks spanning three different simulation engines. The results suggest that, compared with existing methods, our DGPPO framework obtains policies that achieve high task performance (matching baselines that ignore the safety constraints), and high safety rates (matching the most conservative baselines), with a constant set of hyperparameters across all environments.

[LG-59] Sketchy Moment Matching: Toward Fast and Provable Data Selection for Finetuning NEURIPS2024

链接: https://arxiv.org/abs/2407.06120
作者: Yijun Dong,Hoang Phan,Xiang Pan,Qi Lei
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:We revisit data selection in a modern context of finetuning from a fundamental perspective. Extending the classical wisdom of variance minimization in low dimensions to high-dimensional finetuning, our generalization analysis unveils the importance of additionally reducing bias induced by low-rank approximation. Inspired by the variance-bias tradeoff in high dimensions from the theory, we introduce Sketchy Moment Matching (SkMM), a scalable data selection scheme with two stages. (i) First, the bias is controlled using gradient sketching that explores the finetuning parameter space for an informative low-dimensional subspace \mathcalS ; (ii) then the variance is reduced over \mathcalS via moment matching between the original and selected datasets. Theoretically, we show that gradient sketching is fast and provably accurate: selecting n samples by reducing variance over \mathcalS preserves the fast-rate generalization O(\dim(\mathcalS)/n) , independent of the parameter dimension. Empirically, we concretize the variance-bias balance via synthetic experiments and demonstrate the effectiveness of SkMM for finetuning in real vision tasks.

[LG-60] SMT 2.0: A Surrogate Modeling Toolbox with a focus on Hierarchical and Mixed Variables Gaussian Processes

链接: https://arxiv.org/abs/2305.13998
作者: Paul Saves,Remi Lafage,Nathalie Bartoli,Youssef Diouane,Jasper Bussemaker,Thierry Lefebvre,John T. Hwang,Joseph Morlier,Joaquim R. R. A. Martins
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS); Optimization and Control (math.OC); Computation (stat.CO)
*备注: https://doi.org/10.1016/j.advengsoft.2023.103571

点击查看摘要

Abstract:The Surrogate Modeling Toolbox (SMT) is an open-source Python package that offers a collection of surrogate modeling methods, sampling techniques, and a set of sample problems. This paper presents SMT 2.0, a major new release of SMT that introduces significant upgrades and new features to the toolbox. This release adds the capability to handle mixed-variable surrogate models and hierarchical variables. These types of variables are becoming increasingly important in several surrogate modeling applications. SMT 2.0 also improves SMT by extending sampling methods, adding new surrogate models, and computing variance and kernel derivatives for Kriging. This release also includes new functions to handle noisy and use multifidelity data. To the best of our knowledge, SMT 2.0 is the first open-source surrogate library to propose surrogate models for hierarchical and mixed inputs. This open-source software is distributed under the New BSD license.

[LG-61] Distinguishing Cause from Effect with Causal Velocity Models

链接: https://arxiv.org/abs/2502.05122
作者: Johnny Xi,Hugh Dance,Peter Orbanz,Benjamin Bloem-Reddy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Bivariate structural causal models (SCM) are often used to infer causal direction by examining their goodness-of-fit under restricted model classes. In this paper, we describe a parametrization of bivariate SCMs in terms of a causal velocity by viewing the cause variable as time in a dynamical system. The velocity implicitly defines counterfactual curves via the solution of initial value problems where the observation specifies the initial condition. Using tools from measure transport, we obtain a unique correspondence between SCMs and the score function of the generated distribution via its causal velocity. Based on this, we derive an objective function that directly regresses the velocity against the score function, the latter of which can be estimated non-parametrically from observational data. We use this to develop a method for bivariate causal discovery that extends beyond known model classes such as additive or location scale noise, and that requires no assumptions on the noise distributions. When the score is estimated well, the objective is also useful for detecting model non-identifiability and misspecification. We present positive results in simulation and benchmark experiments where many existing methods fail, and perform ablation studies to examine the method’s sensitivity to accurate score estimation.

[LG-62] Refining Integration-by-Parts Reduction of Feynman Integrals with Machine Learning

链接: https://arxiv.org/abs/2502.05121
作者: Matt von Hippel,Matthias Wilhelm
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 28 pages, 9 figures

点击查看摘要

Abstract:Integration-by-parts reductions of Feynman integrals pose a frequent bottle-neck in state-of-the-art calculations in theoretical particle and gravitational-wave physics, and rely on heuristic approaches for selecting integration-by-parts identities, whose quality heavily influences the performance. In this paper, we investigate the use of machine-learning techniques to find improved heuristics. We use funsearch, a genetic programming variant based on code generation by a Large Language Model, in order to explore possible approaches, then use strongly typed genetic programming to zero in on useful solutions. Both approaches manage to re-discover the state-of-the-art heuristics recently incorporated into integration-by-parts solvers, and in one example find a small advance on this state of the art.

[LG-63] Non-linear Quantum Monte Carlo

链接: https://arxiv.org/abs/2502.05094
作者: Jose Blanchet,Yassine Hamoudi,Mario Szegedy,Guanyang Wang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 30 pages

点击查看摘要

Abstract:The mean of a random variable can be understood as a \textitlinear functional on the space of probability distributions. Quantum computing is known to provide a quadratic speedup over classical Monte Carlo methods for mean estimation. In this paper, we investigate whether a similar quadratic speedup is achievable for estimating \textitnon-linear functionals of probability distributions. We propose a quantum-inside-quantum Monte Carlo algorithm that achieves such a speedup for a broad class of non-linear estimation problems, including nested conditional expectations and stochastic optimization. Our algorithm improves upon the direct application of the quantum multilevel Monte Carlo algorithm introduced by An et al… The existing lower bound indicates that our algorithm is optimal up polylogarithmic factors. A key innovation of our approach is a new sequence of multilevel Monte Carlo approximations specifically designed for quantum computing, which is central to the algorithm’s improved performance.

[LG-64] wo-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

链接: https://arxiv.org/abs/2502.05074
作者: Alexander Atanasov,Blake Bordelon,Jacob A. Zavatone-Veth,Courtney Paquette,Cengiz Pehlevan
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We derive a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this result, we give a unified derivation of the performance of a wide variety of high-dimensional linear models trained with stochastic gradient descent. This includes high-dimensional linear regression, kernel regression, and random feature models. Our results include previously known asymptotics as well as novel ones.

[LG-65] Noise Sensitivity of Hierarchical Functions and Deep Learning Lower Bounds in General Product Measures

链接: https://arxiv.org/abs/2502.05073
作者: Rupert Li,Elchanan Mossel
类目: Probability (math.PR); Computational Complexity (cs.CC); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: 17 pages

点击查看摘要

Abstract:Recent works explore deep learning’s success by examining functions or data with hierarchical structure. Complementarily, research on gradient descent performance for deep nets has shown that noise sensitivity of functions under independent and identically distributed (i.i.d.) Bernoulli inputs establishes learning complexity bounds. This paper aims to bridge these research streams by demonstrating that functions constructed through repeated composition of non-linear functions are noise sensitive under general product measures.

[LG-66] Gradient-based Explanations for Deep Learning Survival Models

链接: https://arxiv.org/abs/2502.04970
作者: Sophie Hanna Langbein,Niklas Koenen,Marvin N. Wright
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning survival models often outperform classical methods in time-to-event predictions, particularly in personalized medicine, but their “black box” nature hinders broader adoption. We propose a framework for gradient-based explanation methods tailored to survival neural networks, extending their use beyond regression and classification. We analyze the implications of their theoretical assumptions for time-dependent explanations in the survival setting and propose effective visualizations incorporating the temporal dimension. Experiments on synthetic data show that gradient-based methods capture the magnitude and direction of local and global feature effects, including time dependencies. We introduce GradSHAP(t), a gradient-based counterpart to SurvSHAP(t), which outperforms SurvSHAP(t) and SurvLIME in a computational speed vs. accuracy trade-off. Finally, we apply these methods to medical data with multi-modal inputs, revealing relevant tabular features and visual patterns, as well as their temporal dynamics.

[LG-67] owards Smarter Sensing: 2D Clutter Mitigation in RL-Driven Cognitive MIMO Radar

链接: https://arxiv.org/abs/2502.04967
作者: Adam Umra,Aya Mostafa Ahmed,Aydin Sezgin
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 8 figures. Submitted to EuCNC 2025

点击查看摘要

Abstract:Motivated by the growing interest in integrated sensing and communication for 6th generation (6G) networks, this paper presents a cognitive Multiple-Input Multiple-Output (MIMO) radar system enhanced by reinforcement learning (RL) for robust multitarget detection in dynamic environments. The system employs a planar array configuration and adapts its transmitted waveforms and beamforming patterns to optimize detection performance in the presence of unknown two-dimensional (2D) disturbances. A robust Wald-type detector is integrated with a SARSA-based RL algorithm, enabling the radar to learn and adapt to complex clutter environments modeled by a 2D autoregressive process. Simulation results demonstrate significant improvements in detection probability compared to omnidirectional methods, particularly for low Signal-to-Noise Ratio (SNR) targets masked by clutter.

[LG-68] Does Unsupervised Domain Adaptation Improve the Robustness of Amortized Bayesian Inference? A Systematic Evaluation

链接: https://arxiv.org/abs/2502.04949
作者: Lasse Elsemüller,Valentin Pratz,Mischa von Krause,Andreas Voss,Paul-Christian Bürkner,Stefan T. Radev
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Neural networks are fragile when confronted with data that significantly deviates from their training distribution. This is true in particular for simulation-based inference methods, such as neural amortized Bayesian inference (ABI), where models trained on simulated data are deployed on noisy real-world observations. Recent robust approaches employ unsupervised domain adaptation (UDA) to match the embedding spaces of simulated and observed data. However, the lack of comprehensive evaluations across different domain mismatches raises concerns about the reliability in high-stakes applications. We address this gap by systematically testing UDA approaches across a wide range of misspecification scenarios in both a controlled and a high-dimensional benchmark. We demonstrate that aligning summary spaces between domains effectively mitigates the impact of unmodeled phenomena or noise. However, the same alignment mechanism can lead to failures under prior misspecifications - a critical finding with practical consequences. Our results underscore the need for careful consideration of misspecification types when using UDA techniques to increase the robustness of ABI in practice.

[LG-69] Explainable and externally validated machine learning for neuropsychiatric diagnosis via electrocardiograms ALT

链接: https://arxiv.org/abs/2502.04918
作者: Juan Miguel Lopez Alcaraz,Ebenezer Oloyede,David Taylor,Wilhelm Haverkamp,Nils Strodthoff
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures, source code under this https URL

点击查看摘要

Abstract:Electrocardiogram (ECG) analysis has emerged as a promising tool for identifying physiological changes associated with neuropsychiatric conditions. The relationship between cardiovascular health and neuropsychiatric disorders suggests that ECG abnormalities could serve as valuable biomarkers for more efficient detection, therapy monitoring, and risk stratification. However, the potential of the ECG to accurately distinguish neuropsychiatric conditions, particularly among diverse patient populations, remains underexplored. This study utilized ECG markers and basic demographic data to predict neuropsychiatric conditions using machine learning models, with targets defined through ICD-10 codes. Both internal and external validation were performed using the MIMIC-IV and ECG-View datasets respectively. Performance was assessed using AUROC scores. To enhance model interpretability, Shapley values were applied to provide insights into the contributions of individual ECG features to the predictions. Significant predictive performance was observed for conditions within the neurological and psychiatric groups. For the neurological group, Alzheimer’s disease (G30) achieved an internal AUROC of 0.813 (0.812-0.814) and an external AUROC of 0.868 (0.867-0.868). In the psychiatric group, unspecified dementia (F03) showed an internal AUROC of 0.849 (0.848-0.849) and an external AUROC of 0.862 (0.861-0.863). Discriminative features align with known ECG markers but also provide hints on potentially new markers. ECG offers significant promise for diagnosing and monitoring neuropsychiatric conditions, with robust predictive performance across internal and external cohorts. Future work should focus on addressing potential confounders, such as therapy-related cardiotoxicity, and expanding the scope of ECG applications, including personalized care and early intervention strategies.

[LG-70] Scalable and consistent embedding of probability measures into Hilbert spaces via measure quantization

链接: https://arxiv.org/abs/2502.04907
作者: Erell Gachon,Jérémie Bigot,Elsa Cazelles
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper is focused on statistical learning from data that come as probability measures. In this setting, popular approaches consist in embedding such data into a Hilbert space with either Linearized Optimal Transport or Kernel Mean Embedding. However, the cost of computing such embeddings prohibits their direct use in large-scale settings. We study two methods based on measure quantization for approximating input probability measures with discrete measures of small-support size. The first one is based on optimal quantization of each input measure, while the second one relies on mean-measure quantization. We study the consistency of such approximations, and its implication for scalable embeddings of probability measures into a Hilbert space at a low computational cost. We finally illustrate our findings with various numerical experiments.

[LG-71] Any-stepsize Gradient Descent for Separable Data under Fenchel–Young Losses

链接: https://arxiv.org/abs/2502.04889
作者: Han Bao,Shinsaku Sakaue,Yuki Takezawa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The gradient descent (GD) has been one of the most common optimizer in machine learning. In particular, the loss landscape of a neural network is typically sharpened during the initial phase of training, making the training dynamics hover on the edge of stability. This is beyond our standard understanding of GD convergence in the stable regime where arbitrarily chosen stepsize is sufficiently smaller than the edge of stability. Recently, Wu et al. (COLT2024) have showed that GD converges with arbitrary stepsize under linearly separable logistic regression. Although their analysis hinges on the self-bounding property of the logistic loss, which seems to be a cornerstone to establish a modified descent lemma, our pilot study shows that other loss functions without the self-bounding property can make GD converge with arbitrary stepsize. To further understand what property of a loss function matters in GD, we aim to show arbitrary-stepsize GD convergence for a general loss function based on the framework of \emphFenchel–Young losses. We essentially leverage the classical perceptron argument to derive the convergence rate for achieving \epsilon -optimal loss, which is possible for a majority of Fenchel–Young losses. Among typical loss functions, the Tsallis entropy achieves the GD convergence rate T=\Omega(\epsilon^-1/2) , and the Rényi entropy achieves the far better rate T=\Omega(\epsilon^-1/3) . We argue that these better rate is possible because of \emphseparation margin of loss functions, instead of the self-bounding property.

[LG-72] Statistical Collusion by Collectives on Learning Platforms

链接: https://arxiv.org/abs/2502.04879
作者: Etienne Gauthier,Francis Bach,Michael I. Jordan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Code available at: this https URL

点击查看摘要

Abstract:As platforms increasingly rely on learning algorithms, collectives may form and seek ways to influence these platforms to align with their own interests. This can be achieved by coordinated submission of altered data. To evaluate the potential impact of such behavior, it is essential to understand the computations that collectives must perform to impact platforms in this way. In particular, collectives need to make a priori assessments of the effect of the collective before taking action, as they may face potential risks when modifying their data. Moreover they need to develop implementable coordination algorithms based on quantities that can be inferred from observed data. We develop a framework that provides a theoretical and algorithmic treatment of these issues and present experimental results in a product evaluation domain.

[LG-73] Advancing Wasserstein Convergence Analysis of Score-Based Models: Insights from Discretization and Second-Order Acceleration

链接: https://arxiv.org/abs/2502.04849
作者: Yifeng Yu,Lu Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Score-based diffusion models have emerged as powerful tools in generative modeling, yet their theoretical foundations remain underexplored. In this work, we focus on the Wasserstein convergence analysis of score-based diffusion models. Specifically, we investigate the impact of various discretization schemes, including Euler discretization, exponential integrators, and midpoint randomization methods. Our analysis provides a quantitative comparison of these discrete approximations, emphasizing their influence on convergence behavior. Furthermore, we explore scenarios where Hessian information is available and propose an accelerated sampler based on the local linearization method. We demonstrate that this Hessian-based approach achieves faster convergence rates of order \widetilde\mathcalO\left(\frac1\varepsilon\right) significantly improving upon the standard rate \widetilde\mathcalO\left(\frac1\varepsilon^2\right) of vanilla diffusion models, where \varepsilon denotes the target accuracy.

[LG-74] Coherent Local Explanations for Mathematical Optimization

链接: https://arxiv.org/abs/2502.04840
作者: Daan Otto,Jannis Kurtz,S. Ilker Birbil
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The surge of explainable artificial intelligence methods seeks to enhance transparency and explainability in machine learning models. At the same time, there is a growing demand for explaining decisions taken through complex algorithms used in mathematical optimization. However, current explanation methods do not take into account the structure of the underlying optimization problem, leading to unreliable outcomes. In response to this need, we introduce Coherent Local Explanations for Mathematical Optimization (CLEMO). CLEMO provides explanations for multiple components of optimization models, the objective value and decision variables, which are coherent with the underlying model structure. Our sampling-based procedure can provide explanations for the behavior of exact and heuristic solution algorithms. The effectiveness of CLEMO is illustrated by experiments for the shortest path problem, the knapsack problem, and the vehicle routing problem.

[LG-75] Robust Conformal Outlier Detection under Contaminated Reference Data

链接: https://arxiv.org/abs/2502.04807
作者: Meshi Bashari,Matteo Sesia,Yaniv Romano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Conformal prediction is a flexible framework for calibrating machine learning predictions, providing distribution-free statistical guarantees. In outlier detection, this calibration relies on a reference set of labeled inlier data to control the type-I error rate. However, obtaining a perfectly labeled inlier reference set is often unrealistic, and a more practical scenario involves access to a contaminated reference set containing a small fraction of outliers. This paper analyzes the impact of such contamination on the validity of conformal methods. We prove that under realistic, non-adversarial settings, calibration on contaminated data yields conservative type-I error control, shedding light on the inherent robustness of conformal methods. This conservativeness, however, typically results in a loss of power. To alleviate this limitation, we propose a novel, active data-cleaning framework that leverages a limited labeling budget and an outlier detection model to selectively annotate data points in the contaminated reference set that are suspected as outliers. By removing only the annotated outliers in this ``suspicious’’ subset, we can effectively enhance power while mitigating the risk of inflating the type-I error rate, as supported by our theoretical analysis. Experiments on real datasets validate the conservative behavior of conformal methods under contamination and show that the proposed data-cleaning strategy improves power without sacrificing validity.

[LG-76] A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees

链接: https://arxiv.org/abs/2502.04799
作者: Yuhao Zhou,Jintao Xu,Chenglong Bao,Chao Ding,Jun Zhu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of finding an \epsilon -stationary point of a nonconvex function with a Lipschitz continuous Hessian and propose a quadratic regularized Newton method incorporating a new class of regularizers constructed from the current and previous gradients. The method leverages a recently developed linear conjugate gradient approach with a negative curvature monitor to solve the regularized Newton equation. Notably, our algorithm is adaptive, requiring no prior knowledge of the Lipschitz constant of the Hessian, and achieves a global complexity of O(\epsilon^-\frac32) + \tilde O(1) in terms of the second-order oracle calls, and \tilde O(\epsilon^-\frac74) for Hessian-vector products, respectively. Moreover, when the iterates converge to a point where the Hessian is positive definite, the method exhibits quadratic local convergence. Preliminary numerical results illustrate the competitiveness of our algorithm.

[LG-77] t-Testing the Waters: Empirically Validating Assumptions for Reliable A/B-Testing

链接: https://arxiv.org/abs/2502.04793
作者: Olivier Jeunen
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A/B-tests are a cornerstone of experimental design on the web, with wide-ranging applications and use-cases. The statistical t -test comparing differences in means is the most commonly used method for assessing treatment effects, often justified through the Central Limit Theorem (CLT). The CLT ascertains that, as the sample size grows, the sampling distribution of the Average Treatment Effect converges to normality, making the t -test valid for sufficiently large sample sizes. When outcome measures are skewed or non-normal, quantifying what “sufficiently large” entails is not straightforward. To ensure that confidence intervals maintain proper coverage and that p -values accurately reflect the false positive rate, it is critical to validate this normality assumption. We propose a practical method to test this, by analysing repeatedly resampled A/A-tests. When the normality assumption holds, the resulting p -value distribution should be uniform, and this property can be tested using the Kolmogorov-Smirnov test. This provides an efficient and effective way to empirically assess whether the t -test’s assumptions are met, and the A/B-test is valid. We demonstrate our methodology and highlight how it helps to identify scenarios prone to inflated Type-I errors. Our approach provides a practical framework to ensure and improve the reliability and robustness of A/B-testing practices. Subjects: Methodology (stat.ME); Machine Learning (cs.LG) Cite as: arXiv:2502.04793 [stat.ME] (or arXiv:2502.04793v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2502.04793 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-78] Efficient Evaluation of Quantization-Effects in Neural Codecs

链接: https://arxiv.org/abs/2502.04770
作者: Wolfgang Mack,Ahmed Mustafa,Rafał Łaganowski,Samer Hijazy
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural codecs, comprising an encoder, quantizer, and decoder, enable signal transmission at exceptionally low bitrates. Training these systems requires techniques like the straight-through estimator, soft-to-hard annealing, or statistical quantizer emulation to allow a non-zero gradient across the quantizer. Evaluating the effect of quantization in neural codecs, like the influence of gradient passing techniques on the whole system, is often costly and time-consuming due to training demands and the lack of affordable and reliable metrics. This paper proposes an efficient evaluation framework for neural codecs using simulated data with a defined number of bits and low-complexity neural encoders/decoders to emulate the non-linear behavior in larger networks. Our system is highly efficient in terms of training time and computational and hardware requirements, allowing us to uncover distinct behaviors in neural codecs. We propose a modification to stabilize training with the straight-through estimator based on our findings. We validate our findings against an internal neural audio codec and against the state-of-the-art descript-audio-codec.

[LG-79] Differential Privacy of Quantum and Quantum-Inspired-Classical Recommendation Algorithms

链接: https://arxiv.org/abs/2502.04758
作者: Chenjian Li,Mingsheng Ying
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 17 pages, 6 figures in total(including appendix)

点击查看摘要

Abstract:We analyze the DP (differential privacy) properties of the quantum recommendation algorithm and the quantum-inspired-classical recommendation algorithm. We discover that the quantum recommendation algorithm is a privacy curating mechanism on its own, requiring no external noise, which is different from traditional differential privacy mechanisms. In our analysis, a novel perturbation method tailored for SVD (singular value decomposition) and low-rank matrix approximation problems is introduced. Using the perturbation method and random matrix theory, we are able to derive that both the quantum and quantum-inspired-classical algorithms are \big(\tilde\mathcalO\big(\frac 1n\big),,, \tilde\mathcalO\big(\frac1\min\m,n\big)\big) -DP under some reasonable restrictions, where m and n are numbers of users and products in the input preference database respectively. Nevertheless, a comparison shows that the quantum algorithm has better privacy preserving potential than the classical one.

[LG-80] ghter sparse variational Gaussian processes

链接: https://arxiv.org/abs/2502.04750
作者: Thang D. Bui,Matthew Ashman,Richard E. Turner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse variational Gaussian process (GP) approximations based on inducing points have become the de facto standard for scaling GPs to large datasets, owing to their theoretical elegance, computational efficiency, and ease of implementation. This paper introduces a provably tighter variational approximation by relaxing the standard assumption that the conditional approximate posterior given the inducing points must match that in the prior. The key innovation is to modify the conditional posterior to have smaller variances than that of the prior at the training points. We derive the collapsed bound for the regression case, describe how to use the proposed approximation in large data settings, and discuss its application to handle orthogonally structured inducing points and GP latent variable models. Extensive experiments on regression benchmarks, classification, and latent variable models demonstrate that the proposed approximation consistently matches or outperforms standard sparse variational GPs while maintaining the same computational cost. An implementation will be made available in all popular GP packages.

[LG-81] PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders ICLR2025

链接: https://arxiv.org/abs/2502.04730
作者: Tianyu Xie,Harry Richman,Jiansi Gao,Frederick A. Matsen IV,Cheng Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: ICLR 2025. 22 pages, 14 figures

点击查看摘要

Abstract:Learning informative representations of phylogenetic tree structures is essential for analyzing evolutionary relationships. Classical distance-based methods have been widely used to project phylogenetic trees into Euclidean space, but they are often sensitive to the choice of distance metric and may lack sufficient resolution. In this paper, we introduce phylogenetic variational autoencoders (PhyloVAEs), an unsupervised learning framework designed for representation learning and generative modeling of tree topologies. Leveraging an efficient encoding mechanism inspired by autoregressive tree topology generation, we develop a deep latent-variable generative model that facilitates fast, parallelized topology generation. PhyloVAE combines this generative model with a collaborative inference model based on learnable topological features, allowing for high-resolution representations of phylogenetic tree samples. Extensive experiments demonstrate PhyloVAE’s robust representation learning capabilities and fast generation of phylogenetic tree topologies.

[LG-82] A Meta-learner for Heterogeneous Effects in Difference-in-Differences

链接: https://arxiv.org/abs/2502.04699
作者: Hui Lan,Haoge Chang,Eleanor Dillon,Vasilis Syrgkanis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of estimating heterogeneous treatment effects in panel data, adopting the popular Difference-in-Differences (DiD) framework under the conditional parallel trends assumption. We propose a novel doubly robust meta-learner for the Conditional Average Treatment Effect on the Treated (CATT), reducing the estimation to a convex risk minimization problem involving a set of auxiliary models. Our framework allows for the flexible estimation of the CATT, when conditioning on any subset of variables of interest using generic machine learning. Leveraging Neyman orthogonality, our proposed approach is robust to estimation errors in the auxiliary models. As a generalization to our main result, we develop a meta-learning approach for the estimation of general conditional functionals under covariate shift. We also provide an extension to the instrumented DiD setting with non-compliance. Empirical results demonstrate the superiority of our approach over existing baselines.

[LG-83] Optimistic Algorithms for Adaptive Estimation of the Averag e Treatment Effect

链接: https://arxiv.org/abs/2502.04673
作者: Ojash Neopane,Aaditya Ramdas,Aarti Singh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 15 pages, 2 Figures

点击查看摘要

Abstract:Estimation and inference for the Average Treatment Effect (ATE) is a cornerstone of causal inference and often serves as the foundation for developing procedures for more complicated settings. Although traditionally analyzed in a batch setting, recent advances in martingale theory have paved the way for adaptive methods that can enhance the power of downstream inference. Despite these advances, progress in understanding and developing adaptive algorithms remains in its early stages. Existing work either focus on asymptotic analyses that overlook exploration-exploitation tradeoffs relevant in finite-sample regimes or rely on simpler but suboptimal estimators. In this work, we address these limitations by studying adaptive sampling procedures that take advantage of the asymptotically optimal Augmented Inverse Probability Weighting (AIPW) estimator. Our analysis uncovers challenges obscured by asymptotic approaches and introduces a novel algorithmic design principle reminiscent of optimism in multiarmed bandits. This principled approach enables our algorithm to achieve significant theoretical and empirical gains compared to prior methods. Our findings mark a step forward in advancing adaptive causal inference methods in theory and practice.

[LG-84] Machine-Learning Interatomic Potentials for Long-Range Systems

链接: https://arxiv.org/abs/2502.04668
作者: Yajie Ji,Jiuyang Liang,Zhenli Xu
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Machine-learning interatomic potentials have emerged as a revolutionary class of force-field models in molecular simulations, delivering quantum-mechanical accuracy at a fraction of the computational cost and enabling the simulation of large-scale systems over extended timescales. However, they often focus on modeling local environments, neglecting crucial long-range interactions. We propose a Sum-of-Gaussians Neural Network (SOG-Net), a lightweight and versatile framework for integrating long-range interactions into machine learning force field. The SOG-Net employs a latent-variable learning network that seamlessly bridges short-range and long-range components, coupled with an efficient Fourier convolution layer that incorporates long-range effects. By learning sum-of-Gaussian multipliers across different convolution layers, the SOG-Net adaptively captures diverse long-range decay behaviors while maintaining close-to-linear computational complexity during training and simulation via non-uniform fast Fourier transforms. The method is demonstrated effective for a broad range of long-range systems.

[LG-85] Complexity Analysis of Normalizing Constant Estimation: from Jarzynski Equality to Annealed Importance Sampling and beyond

链接: https://arxiv.org/abs/2502.04575
作者: Wei Guo,Molei Tao,Yongxin Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Given an unnormalized probability density \pi\propto\mathrme^-V , estimating its normalizing constant Z=\int_\mathbbR^d\mathrme^-V(x)\mathrmdx or free energy F=-\log Z is a crucial problem in Bayesian statistics, statistical mechanics, and machine learning. It is challenging especially in high dimensions or when \pi is multimodal. To mitigate the high variance of conventional importance sampling estimators, annealing-based methods such as Jarzynski equality and annealed importance sampling are commonly adopted, yet their quantitative complexity guarantees remain largely unexplored. We take a first step toward a non-asymptotic analysis of annealed importance sampling. In particular, we derive an oracle complexity of \widetildeO\left(\fracd\beta^2\mathcalA^2\varepsilon^4\right) for estimating Z within \varepsilon relative error with high probability, where \beta is the smoothness of V and \mathcalA denotes the action of a curve of probability measures interpolating \pi and a tractable reference distribution. Our analysis, leveraging Girsanov theorem and optimal transport, does not explicitly require isoperimetric assumptions on the target distribution. Finally, to tackle the large action of the widely used geometric interpolation of probability distributions, we propose a new normalizing constant estimation algorithm based on reverse diffusion samplers and establish a framework for analyzing its complexity.

[LG-86] Sparsity-Based Interpolation of External Internal and Swap Regret

链接: https://arxiv.org/abs/2502.04543
作者: Zhou Lu,Y. Jennifer Sun,Zhiyu Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Equal contribution, alphabetical order

点击查看摘要

Abstract:Focusing on the expert problem in online learning, this paper studies the interpolation of several performance metrics via \phi -regret minimization, which measures the performance of an algorithm by its regret with respect to an arbitrary action modification rule \phi . With d experts and T\gg d rounds in total, we present a single algorithm achieving the instance-adaptive \phi -regret bound \beginequation* \tilde O\left(\min\left\sqrtd-d^\mathrmunif_\phi+1,\sqrtd-d^\mathrmself_\phi\right\cdot\sqrtT\right), \endequation* where d^\mathrmunif_\phi is the maximum amount of experts modified identically by \phi , and d^\mathrmself_\phi is the amount of experts that \phi trivially modifies to themselves. By recovering the optimal O(\sqrtT\log d) external regret bound when d^\mathrmunif_\phi=d , the standard \tilde O(\sqrtT) internal regret bound when d^\mathrmself_\phi=d-1 and the optimal \tilde O(\sqrtdT) swap regret bound in the worst case, we improve existing results in the intermediate regimes. In addition, the same algorithm achieves the optimal quantile regret bound, which corresponds to even easier settings of \phi than the external regret. Building on the classical reduction from \phi -regret minimization to external regret minimization on stochastic matrices, our main idea is to further convert the latter to online linear regression using Haar-wavelet-inspired matrix features. Then, we apply a particular L_1 -version of comparator-adaptive online learning algorithms to exploit the sparsity in this regression subroutine. Comments: Equal contribution, alphabetical order Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2502.04543 [stat.ML] (or arXiv:2502.04543v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2502.04543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-87] GenVC: Self-Supervised Zero-Shot Voice Conversion

链接: https://arxiv.org/abs/2502.04519
作者: Zexin Cai,Henry Li Xinyuan,Ashi Garg,Leibny Paola García-Perera,Kevin Duh,Sanjeev Khudanpur,Matthew Wiesner,Nicholas Andrews
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-shot voice conversion has recently made substantial progress, but many models still depend on external supervised systems to disentangle speaker identity and linguistic content. Furthermore, current methods often use parallel conversion, where the converted speech inherits the source utterance’s temporal structure, restricting speaker similarity and privacy. To overcome these limitations, we introduce GenVC, a generative zero-shot voice conversion model. GenVC learns to disentangle linguistic content and speaker style in a self-supervised manner, eliminating the need for external models and enabling efficient training on large, unlabeled datasets. Experimental results show that GenVC achieves state-of-the-art speaker similarity while maintaining naturalness competitive with leading approaches. Its autoregressive generation also allows the converted speech to deviate from the source utterance’s temporal structure. This feature makes GenVC highly effective for voice anonymization, as it minimizes the preservation of source prosody and speaker characteristics, enhancing privacy protection.

[LG-88] Analysis of Diffusion Models for Manifold Data

链接: https://arxiv.org/abs/2502.04339
作者: Anand Jerry George,Rodrigo Veiga,Nicolas Macris
类目: atistics Theory (math.ST); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We analyze the time reversed dynamics of generative diffusion models. If the exact empirical score function is used in a regime of large dimension and exponentially large number of samples, these models are known to undergo transitions between distinct dynamical regimes. We extend this analysis and compute the transitions for an analytically tractable manifold model where the statistical model for the data is a mixture of lower dimensional Gaussians embedded in higher dimensional space. We compute the so-called speciation and collapse transition times, as a function of the ratio of manifold-to-ambient space dimensions, and other characteristics of the data model. An important tool used in our analysis is the exact formula for the mutual information (or free energy) of Generalized Linear Models.

[LG-89] High-Dimensional Bayesian Optimization Using Both Random and Supervised Embeddings

链接: https://arxiv.org/abs/2502.00854
作者: Rémy Priem,Youssef Diouane,Nathalie Bartoli,Sylvain Dubreuil,Paul Saves
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is one of the most powerful strategies to solve computationally expensive-to-evaluate blackbox optimization problems. However, BO methods are conventionally used for optimization problems of small dimension because of the curse of dimensionality. In this paper, a high-dimensionnal optimization method incorporating linear embedding subspaces of small dimension is proposed to efficiently perform the optimization. An adaptive learning strategy for these linear embeddings is carried out in conjunction with the optimization. The resulting BO method, named efficient global optimization coupled with random and supervised embedding (EGORSE), combines in an adaptive way both random and supervised linear embeddings. EGORSE has been compared to state-of-the-art algorithms and tested on academic examples with a number of design variables ranging from 10 to 600. The obtained results show the high potential of EGORSE to solve high-dimensional blackbox optimization problems, in terms of both CPU time and the limited number of calls to the expensive blackbox simulation.

信息检索

[IR-0] Enhancing Health Information Retrieval with RAG by Prioritizing Topical Relevance and Factual Accuracy

链接: https://arxiv.org/abs/2502.04666
作者: Rishabh Uapadhyay,Marco Viviani
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The exponential surge in online health information, coupled with its increasing use by non-experts, highlights the pressing need for advanced Health Information Retrieval models that consider not only topical relevance but also the factual accuracy of the retrieved information, given the potential risks associated with health misinformation. To this aim, this paper introduces a solution driven by Retrieval-Augmented Generation (RAG), which leverages the capabilities of generative Large Language Models (LLMs) to enhance the retrieval of health-related documents grounded in scientific evidence. In particular, we propose a three-stage model: in the first stage, the user’s query is employed to retrieve topically relevant passages with associated references from a knowledge base constituted by scientific literature. In the second stage, these passages, alongside the initial query, are processed by LLMs to generate a contextually relevant rich text (GenText). In the last stage, the documents to be retrieved are evaluated and ranked both from the point of view of topical relevance and factual accuracy by means of their comparison with GenText, either through stance detection or semantic similarity. In addition to calculating factual accuracy, GenText can offer a layer of explainability for it, aiding users in understanding the reasoning behind the retrieval. Experimental evaluation of our model on benchmark datasets and against baseline models demonstrates its effectiveness in enhancing the retrieval of both topically relevant and factually accurate health information, thus presenting a significant step forward in the health misinformation mitigation problem.

附件下载

点击下载今日全部论文列表