Arxiv今日论文 | 2025-02-05

本篇博文主要内容为 2025-02-05 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决手语翻译（Sign Language Translation, SLT）系统在处理手势变化多样性和长序列翻译支持方面的局限性。论文的关键解决方案在于提出了一种基于Transformer的架构，该架构通过使用多层卷积和注意力机制来编码时空运动手势，从而保留局部和远距离空间信息。这种方案在哥伦比亚手语翻译数据集（CoL-SLTD）和RWTH-PHOENIX-Weather-2014T数据集上的表现均优于基线方法，分别达到了BLEU4评分46.84%和30.77%，证明了其有效性和鲁棒性。

链接: https://arxiv.org/abs/2502.02587
作者: Christian Ruiz,Fabio Martinez
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sign Language Translation (SLT) systems support hearing-impaired people communication by finding equivalences between signed and spoken languages. This task is however challenging due to multiple sign variations, complexity in language and inherent richness of expressions. Computational approaches have evidenced capabilities to support SLT. Nonetheless, these approaches remain limited to cover gestures variability and support long sequence translations. This paper introduces a Transformer-based architecture that encodes spatio-temporal motion gestures, preserving both local and long-range spatial information through the use of multiple convolutional and attention mechanisms. The proposed approach was validated on the Colombian Sign Language Translation Dataset (CoL-SLTD) outperforming baseline approaches, and achieving a BLEU4 of 46.84%. Additionally, the proposed approach was validated on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T), achieving a BLEU4 score of 30.77%, demonstrating its robustness and effectiveness in handling real-world variations
zh

[NLP-1] A comparison of translation performance between DeepL and Supertext

【速读】：该论文旨在解决机器翻译（Machine Translation, MT）系统评估方法的问题，特别是在大型语言模型（Large Language Models, LLMs）广泛应用的背景下。现有评估方法主要基于段落级别（segment-level），未能充分反映MT系统的整体性能，尤其是当处理较长文本时。论文的关键解决方案在于引入文档级别（document-level）的评估方法，以更全面地捕捉MT系统在处理扩展上下文时的能力。通过这种方法，论文发现Supertext在大多数语言方向上的表现优于DeepL，这表明文档级别的评估能够更好地反映实际应用中的翻译质量。

链接: https://arxiv.org/abs/2502.02577
作者: Alex Flückiger,Chantal Amrhein,Tim Graf,Philippe Schläpfer,Florian Schottmann,Samuel Läubli
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As strong machine translation (MT) systems are increasingly based on large language models (LLMs), reliable quality benchmarking requires methods that capture their ability to leverage extended context. This study compares two commercial MT systems – DeepL and Supertext – by assessing their performance on unsegmented texts. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions, suggesting superior consistency across longer texts. We advocate for more context-sensitive evaluation methodologies to ensure that MT quality assessments reflect real-world usability. We release all evaluation data and scripts for further analysis and reproduction at this https URL.
zh

[NLP-2] Are Language Models Up to Sequential Optimization Problems? From Evaluation to a Hegelian-Inspired Enhancement

【速读】：该论文旨在解决大型语言模型（LLMs）在处理复杂序列优化问题（SOPs）时性能显著下降的问题。关键解决方案在于引入ACE框架，该框架受到黑格尔辩证法（Hegelian Dialectics）的启发，通过哲学推理方法的改进来提升LLMs在SOP中的表现，而无需进行额外的训练或微调。

链接: https://arxiv.org/abs/2502.02573
作者: Soheil Abbasloo
机构: Microsoft Research (微软研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across numerous fields, presenting an opportunity to revolutionize optimization problem-solving, a crucial, ubiquitous, and complex domain. This paper explores the proficiency of LLMs in handling Sequential Optimization Problems (SOPs). We introduce WorldGen, a dynamic framework for generating unseen SOPs with controllable complexities, to evaluate LLM performance. Our initial observations reveal that while LLMs perform well on simple SOPs, their performance significantly degrades with increased complexity. Motivated by this, we revisit philosophical hypotheses on reasoning to enhance LLM performance. Inspired by the influential framework of Hegelian Dialectics, we propose ACE, demonstrating how the performance of LLMs in SOP contexts can be significantly improved without any retraining or further fine-tuning.
zh

[NLP-3] Adaptive Self-improvement LLM Agent ic System for ML Library Development

【速读】：该论文旨在解决使用大型语言模型（Large Language Models, LLMs）生成针对特定架构的编程语言（Architecture-Specific Programming Languages, ASPLs）的机器学习（ML）库的挑战。这些挑战源于任务本身的复杂性以及ASPL示例代码的稀缺。为了解决这些问题，论文提出了一种自适应自我改进代理系统。该系统的引入是解决方案的关键，通过增强LLMs在有限数据下的复杂推理能力，从而有效生成高效的ML库代码。

链接: https://arxiv.org/abs/2502.02534
作者: Genghan Zhang,Weixin Liang,Olivia Hsu,Kunle Olukotun
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:ML libraries, often written in architecture-specific programming languages (ASPLs) that target domain-specific architectures, are key to efficient ML systems. However, writing these high-performance ML libraries is challenging because it requires expert knowledge of ML algorithms and the ASPL. Large language models (LLMs), on the other hand, have shown general coding capabilities. However, challenges remain when using LLMs for generating ML libraries using ASPLs because 1) this task is complicated even for experienced human programmers and 2) there are limited code examples because of the esoteric and evolving nature of ASPLs. Therefore, LLMs need complex reasoning with limited data in order to complete this task. To address these challenges, we introduce an adaptive self-improvement agentic system. In order to evaluate the effectiveness of our system, we construct a benchmark of a typical ML library and generate ASPL code with both open and closed-source LLMs on this benchmark. Our results show improvements of up to 3.9\times over a baseline single LLM.
zh

[NLP-4] Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

【速读】：该论文旨在解决多智能体系统（MAS）设计过程中的复杂性问题，特别是自动化设计过程中提示（prompts）与拓扑结构（topologies）的优化。论文的关键解决方案是提出了一种名为多智能体系统搜索（Multi-Agent System Search, MASS）的框架，该框架通过分阶段优化，从局部块级提示优化到全局工作流级提示优化，并结合工作流拓扑优化，从而高效地探索复杂的MAS设计空间。研究表明，MASS优化的多智能体系统在性能上显著优于现有方法。

链接: https://arxiv.org/abs/2502.02533
作者: Han Zhou,Xingchen Wan,Ruoxi Sun,Hamid Palangi,Shariq Iqbal,Ivan Vulić,Anna Korhonen,Sercan Ö. Arık
机构: Google(谷歌); University of Cambridge(剑桥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 11 pages, 7 figures, 1 table (30 pages, 9 figures, 5 tables including references and appendices)

点击查看摘要

Abstract:Large language models, employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with prompts that declare their functionality, along with the topologies that orchestrate interactions across agents. Designing prompts and topologies for multi-agent systems (MAS) is inherently complex. To automate the entire design process, we first conduct an in-depth analysis of the design space aiming to understand the factors behind building effective MAS. We reveal that prompts together with topologies play critical roles in enabling more effective MAS design. Based on the insights, we propose Multi-Agent System Search (MASS), a MAS optimization framework that efficiently exploits the complex MAS design space by interleaving its optimization stages, from local to global, from prompts to topologies, over three stages: 1) block-level (local) prompt optimization; 2) workflow topology optimization; 3) workflow-level (global) prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies from former stages. We show that MASS-optimized multi-agent systems outperform a spectrum of existing alternatives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems.
zh

[NLP-5] Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

【速读】：该论文旨在探讨如何通过内部化搜索能力来根本性地提升单个大语言模型（Large Language Models, LLMs）的推理能力。关键解决方案在于提出了一种名为Chain-of-Action-Thought (COAT)的推理方法，并采用两阶段训练范式：首先是小规模格式调整阶段以内部化COAT推理格式，其次是大规模自我提升阶段利用强化学习。这一方法最终使得Satori，一个基于开源模型和数据训练的70亿参数LLM，实现了在数学推理基准测试中的最先进性能，并展示了强大的跨领域泛化能力。

链接: https://arxiv.org/abs/2502.02508
作者: Maohao Shen,Guangtao Zeng,Zhenting Qi,Zhang-Wei Hong,Zhenfang Chen,Wei Lu,Gregory Wornell,Subhro Das,David Cox,Chuang Gan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs’ reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.
zh

[NLP-6] Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

【速读】：该论文旨在分析现有嵌入模型（Embedding Models）在语言模型预训练数据整理中的适用性。论文的关键解决方案在于引入了一个框架，用于量化嵌入空间中的相似度与不同训练样本之间预训练损失之间的相关性，并评估嵌入空间中的多样化如何影响预训练质量。研究发现，简单的每令牌嵌入平均方法在数据整理任务中表现出色，甚至可与更复杂的嵌入模型竞争，这可能是因为后者并非专门为此目的设计。论文建议，该分析和评估框架可以作为设计专为预训练数据集相似性推理而优化的嵌入模型的基础。

链接: https://arxiv.org/abs/2502.02494
作者: Dylan Sam,Ayan Chakrabarti,Afshin Rostamizadeh,Srikumar Ramalingam,Gui Citovsky,Sanjiv Kumar
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages

点击查看摘要

Abstract:Similarity between training examples is used to curate pretraining datasets for language models by many methods – for diversification and to select examples similar to high-quality data. However, similarity is typically measured with off-the-shelf embedding models that are generic or trained for tasks such as retrieval. This paper introduces a framework to analyze the suitability of embedding models specifically for data curation in the language model pretraining setting. We quantify the correlation between similarity in the embedding space to similarity in pretraining loss between different training examples, and how diversifying in the embedding space affects pretraining quality. We analyze a variety of embedding models in our framework, with experiments using the Pile dataset for pretraining a 1.7B parameter decoder-only language model. We find that the embedding models we consider are all useful for pretraining data curation. Moreover, a simple approach of averaging per-token embeddings proves to be surprisingly competitive with more sophisticated embedding models – likely because the latter are not designed specifically for pretraining data curation. Indeed, we believe our analysis and evaluation framework can serve as a foundation for the design of embedding models that specifically reason about similarity in pretraining datasets.
zh

[NLP-7] Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study NAACL2025

【速读】：该论文旨在探索参数少于十亿的开源大型语言模型 (LLMs) 在多语种机器翻译 (MT) 任务中的能力，并提出了一种名为“先平行后单语”（Parallel-First Monolingual-Second, PFMS）的数据混合策略以进一步提升翻译性能。论文的关键在于通过引入PFMS策略，在持续预训练阶段改进模型，从而开发出GemmaX2-28这一在28种语言上表现出顶尖多语种翻译能力的90亿参数模型，其性能超越了当前最先进的模型如TowerInstruct和XALMA，并与Google Translate和GPT-4-turbo相媲美。

链接: https://arxiv.org/abs/2502.02481
作者: Menglong Cui,Pengzhi Gao,Wei Liu,Jian Luan,BinWang
机构: Xiaomi Inc. (小米公司)
类目: Computation and Language (cs.CL)
备注: Accept to NAACL2025 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) have shown continuously improving multilingual capabilities, and even small-scale open-source models have demonstrated rapid performance enhancement. In this paper, we systematically explore the abilities of open LLMs with less than ten billion parameters to handle multilingual machine translation (MT) tasks. We conduct comprehensive evaluations on six popular LLMs and find that models like Gemma2-9B exhibit impressive multilingual translation capabilities. We then introduce the Parallel-First Monolingual-Second (PFMS) data mixing strategy in the continual pretraining stage to further enhance the MT performance and present GemmaX2-28, a 9B model achieving top-tier multilingual translation performance across 28 languages. Specifically, GemmaX2-28 consistently outperforms the state-of-the-art (SOTA) models such as TowerInstruct and XALMA and achieves competitive performance with Google Translate and GPT-4-turbo.
zh

[NLP-8] Rankify: A Comprehensive Python Toolkit for Retrieval Re-Ranking and Retrieval-Augmented Generation

【速读】：该论文旨在解决现有自然语言处理（NLP）应用中信息检索、问答及知识驱动文本生成领域内，缺乏统一框架整合检索、重排序及检索增强生成（RAG）等关键组件的问题。解决方案的关键在于引入了一个名为\textbf{Rankify}的强大且模块化的开源工具包，它提供了一个统一的框架来整合上述组件，并支持广泛的检索技术与最先进的重排序模型，从而提升检索质量。此外，Rankify还提供了预检索数据集以促进基准测试，并通过详尽的文档、GitHub源代码实现及PyPI安装包，促进了工具的广泛应用和易于集成。

链接: https://arxiv.org/abs/2502.02464
作者: Abdelrahman Abdallah,Jamshid Mozafari,Bhawna Piryani,Mohammed Ali,Adam Jatowt
机构: University of Innsbruck(因斯布鲁克大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Retrieval, re-ranking, and retrieval-augmented generation (RAG) are critical components of modern natural language processing (NLP) applications in information retrieval, question answering, and knowledge-based text generation. However, existing solutions are often fragmented, lacking a unified framework that easily integrates these essential processes. The absence of a standardized implementation, coupled with the complexity of retrieval and re-ranking workflows, makes it challenging for researchers to compare and evaluate different approaches in a consistent environment. While existing toolkits such as Rerankers and RankLLM provide general-purpose reranking pipelines, they often lack the flexibility required for fine-grained experimentation and benchmarking. In response to these challenges, we introduce \textbfRankify, a powerful and modular open-source toolkit designed to unify retrieval, re-ranking, and RAG within a cohesive framework. Rankify supports a wide range of retrieval techniques, including dense and sparse retrievers, while incorporating state-of-the-art re-ranking models to enhance retrieval quality. Additionally, Rankify includes a collection of pre-retrieved datasets to facilitate benchmarking, available at Huggingface (this https URL). To encourage adoption and ease of integration, we provide comprehensive documentation (this http URL), an open-source implementation on GitHub(this https URL), and a PyPI package for effortless installation(this https URL). By providing a unified and lightweight framework, Rankify allows researchers and practitioners to advance retrieval and re-ranking methodologies while ensuring consistency, scalability, and ease of use.
zh

[NLP-9] SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在训练和推理效率之间的权衡问题。目前主要存在两种架构：嵌入空间对齐（如LLaVA-1.5）在推理阶段效率低下，而交叉注意力空间对齐（如Flamingo）在训练阶段效率低下。论文的关键解决方案是提出了一种新的自注意力机制NAAViT（无视觉令牌间注意力），并通过实验验证了视觉令牌间的注意力高度冗余。基于此，论文引入了一种新型架构SAISA（自注意力输入空间对齐），它通过直接对齐视觉特征与NAAViT自注意力块的输入空间，减少了计算开销，从而同时提升了训练和推理效率。使用与LLaVA-1.5相同的配置，SAISA将推理操作数（FLOPs）降低了66%，训练预算减少了26%，同时在准确性方面表现出色。

链接: https://arxiv.org/abs/2502.02458
作者: Qianhao Yuan,Yanjiang Liu,Yaojie Lu,Hongyu Lin,Ben He,Xianpei Han,Le Sun
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所中文信息技术处理实验室); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g., LLaVA-1.5) is inefficient during inference, while cross-attention space alignment (e.g., Flamingo) is inefficient in training. In this paper, we compare these two architectures and identify the key factors for building efficient MLLMs. A primary difference between them lies in how attention is applied to visual tokens, particularly in their interactions with each other. To investigate whether attention among visual tokens is necessary, we propose a new self-attention mechanism, NAAViT (\textbfNo \textbfAttention \textbfAmong \textbfVisual \textbfTokens), which eliminates this type of attention. Our pilot experiment on LLaVA-1.5 shows that attention among visual tokens is highly redundant. Based on these insights, we introduce SAISA (\textbfSelf-\textbfAttention \textbfInput \textbfSpace \textbfAlignment), a novel architecture that enhance both training and inference efficiency. SAISA directly aligns visual features with the input spaces of NAAViT self-attention blocks, reducing computational overhead in both self-attention blocks and feed-forward networks (FFNs). Using the same configuration as LLaVA-1.5, SAISA reduces inference FLOPs by 66% and training budget by 26%, while achieving superior performance in terms of accuracy. Comprehensive ablation studies further validate the effectiveness of SAISA across various LLMs and visual encoders. The code and model will be publicly available at this https URL.
zh

[NLP-10] Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study

【速读】：该论文旨在解决在非英语语料库中测量道德基础（Moral Foundations, MFs）的计算方法问题。研究对比了应用英语资源于机器翻译文本、本地语言词典、多语言语言模型（Multilingual Language Models）以及大规模语言模型（Large Language Models, LLMs）的效果。关键在于发现多语言模型和LLMs能够可靠地进行跨语言性能评估，并且LLMs在数据效率方面表现尤为突出。同时，论文强调需要人工验证自动化道德基础评估结果，以弥补文化细微差异。

链接: https://arxiv.org/abs/2502.02451
作者: Calvin Yixiang Cheng,Scott A Hale
机构: Oxford Internet Institute, University of Oxford (牛津互联网研究院，牛津大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 12 pages, 2 figures, 6 tables

点击查看摘要

Abstract:This study explores computational approaches for measuring moral foundations (MFs) in non-English corpora. Since most resources are developed primarily for English, cross-linguistic applications of moral foundation theory remain limited. Using Chinese as a case study, this paper evaluates the effectiveness of applying English resources to machine translated text, local language lexicons, multilingual language models, and large language models (LLMs) in measuring MFs in non-English texts. The results indicate that machine translation and local lexicon approaches are insufficient for complex moral assessments, frequently resulting in a substantial loss of cultural information. In contrast, multilingual models and LLMs demonstrate reliable cross-language performance with transfer learning, with LLMs excelling in terms of data efficiency. Importantly, this study also underscores the need for human-in-the-loop validation of automated MF assessment, as the most advanced models may overlook cultural nuances in cross-language measurements. The findings highlight the potential of LLMs for cross-language MF measurements and other complex multilingual deductive coding tasks.
zh

[NLP-11] Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）内在价值体系缺失的问题。解决方案的关键在于引入生成式心理词典方法（Generative Psycho-Lexical Approach, GPLA），这是一种可扩展、可适应且理论依据充分的方法，用于构建心理层面的LLM价值系统。通过GPLA，研究提出了一个专为LLMs设计的心理学基础五因素价值体系，并通过三个基准任务验证其有效性和优越性。

链接: https://arxiv.org/abs/2502.02444
作者: Haoran Ye,Tianze Zhang,Yuhang Xie,Liyuan Zhang,Yuanyi Ren,Xin Zhang,Guojie Song
机构: State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室), School of Intelligence Science and Technology (智能科学与技术学院), Peking University (北京大学); Yuanpei College (元培学院), Peking University (北京大学); School of Psychological and Cognitive Sciences (心理与认知科学学院), Peking University (北京大学); Key Laboratory of Machine Perception (Ministry of Education) (机器感知重点实验室 (教育部)), Peking University (北京大学); PKU-Wuhan Institute for Artificial Intelligence (北京大学-武汉人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Values are core drivers of individual and collective perception, cognition, and behavior. Value systems, such as Schwartz’s Theory of Basic Human Values, delineate the hierarchy and interplay among these values, enabling cross-disciplinary investigations into decision-making and societal dynamics. Recently, the rise of Large Language Models (LLMs) has raised concerns regarding their elusive intrinsic values. Despite growing efforts in evaluating, understanding, and aligning LLM values, a psychologically grounded LLM value system remains underexplored. This study addresses the gap by introducing the Generative Psycho-Lexical Approach (GPLA), a scalable, adaptable, and theoretically informed method for constructing value systems. Leveraging GPLA, we propose a psychologically grounded five-factor value system tailored for LLMs. For systematic validation, we present three benchmarking tasks that integrate psychological principles with cutting-edge AI priorities. Our results reveal that the proposed value system meets standard psychological criteria, better captures LLM values, improves LLM safety prediction, and enhances LLM alignment, when compared to the canonical Schwartz’s values.
zh

[NLP-12] Activation-Informed Merging of Large Language Models

【速读】：该论文旨在解决多模型融合过程中性能提升与计算效率之间的权衡问题。论文提出的关键解决方案是Activation-Informed Merging (AIM)，通过整合大型语言模型（LLMs）激活空间的信息来优化融合过程，从而提高模型性能和鲁棒性。AIM利用任务无关的校准集，在融合过程中选择性地保留关键权重，体现了连续学习（CL）和模型压缩的原则。实验结果表明，AIM显著提升了多个基准测试中的模型性能，最高可提升40%。

链接: https://arxiv.org/abs/2502.02421
作者: Amin Heyrani Nobari,Kaveh Alimohammadi,Ali ArjomandBigdeli,Akash Srivastava,Faez Ahmed,Navid Azizan
机构: Massachusetts Institute of Technology; Stony Brook University; RedHat AI Innovation (红帽人工智能创新); MIT-IBM Watson AI Lab (麻省理工学院-IBM沃森人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model merging, a method that combines the parameters and embeddings of multiple fine-tuned large language models (LLMs), offers a promising approach to enhance model performance across various tasks while maintaining computational efficiency. This paper introduces Activation-Informed Merging (AIM), a technique that integrates the information from the activation space of LLMs into the merging process to improve performance and robustness. AIM is designed as a flexible, complementary solution that is applicable to any existing merging method. It aims to preserve critical weights from the base model, drawing on principles from continual learning~(CL) and model compression. Utilizing a task-agnostic calibration set, AIM selectively prioritizes essential weights during merging. We empirically demonstrate that AIM significantly enhances the performance of merged models across multiple benchmarks. Our findings suggest that considering the activation-space information can provide substantial advancements in the model merging strategies for LLMs with up to 40% increase in benchmark performance.
zh

[NLP-13] Avoiding spurious sharpness minimization broadens applicability of SAM

【速读】：该论文旨在解决曲率正则化技术（如Sharpness Aware Minimization, SAM）在自然语言处理（NLP）领域表现不佳的问题。论文的关键在于发现SAM在NLP任务中主要通过调整logit统计量来实现正则化，而未能有效改善函数本身的几何结构。基于这一观察，作者提出了Functional-SAM算法，该算法仅通过对神经网络实现的整体函数统计量进行修改来正则化曲率，并避免通过logit操作进行误导性最小化。此外，预处理SAM扰动也被证明能够进一步防止误导性最小化，从而与Functional-SAM结合使用时带来额外的性能提升。

链接: https://arxiv.org/abs/2502.02407
作者: Sidak Pal Singh,Hossein Mobahi,Atish Agarwala,Yann Dauphin
机构: Google DeepMind; ETH Zürich
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance – even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics – instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional-SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional-SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs).
zh

[NLP-14] FewTopNER: Integrating Few-Shot Learning with Topic Modeling and Named Entity Recognition in a Multilingual Framework

【速读】：本文旨在解决跨语言及低资源场景下的命名实体识别（Named Entity Recognition, NER）挑战。解决方案的关键在于FewTopNER框架，它集成了少量样本学习的命名实体识别与主题感知的上下文建模。该框架利用基于XLM-RoBERTa的共享多语言编码器，并辅以语言特定的校准机制，生成稳健的上下文嵌入。其架构包括基于原型的实体识别分支以及主题建模分支，通过双向长短时记忆网络（BiLSTM）和条件随机场（CRF）进行序列标注，并结合概率和神经方法提取文档级语义特征。此外，跨任务桥梁实现了实体与主题表示之间的动态双向注意力和特征融合，从而通过整合全局语义上下文增强实体消歧能力。这些设计显著提升了在英语、法语、西班牙语、德语和意大利语等多语言基准数据集上的性能，特别是在F1分数上提高了2.5-4.0个百分点，并增强了主题连贯性。

链接: https://arxiv.org/abs/2502.02391
作者: Ibrahim Bouabdallaoui,Fatima Guerouate,Samya Bouhaddour,Chaimae Saadi,Mohammed Sbihi
机构: LASTIMI Laboratory (LASTIMI实验室), High School of Technology Salé (萨勒高等技术学院), Mohammed V University in Rabat (拉巴特穆罕默德五世大学), Morocco (摩洛哥)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code source : this https URL

点击查看摘要

Abstract:We introduce FewTopNER, a novel framework that integrates few-shot named entity recognition (NER) with topic-aware contextual modeling to address the challenges of cross-lingual and low-resource scenarios. FewTopNER leverages a shared multilingual encoder based on XLM-RoBERTa, augmented with language-specific calibration mechanisms, to generate robust contextual embeddings. The architecture comprises a prototype-based entity recognition branch, employing BiLSTM and Conditional Random Fields for sequence labeling, and a topic modeling branch that extracts document-level semantic features through hybrid probabilistic and neural methods. A cross-task bridge facilitates dynamic bidirectional attention and feature fusion between entity and topic representations, thereby enhancing entity disambiguation by incorporating global semantic context. Empirical evaluations on multilingual benchmarks across English, French, Spanish, German, and Italian demonstrate that FewTopNER significantly outperforms existing state-of-the-art few-shot NER models. In particular, the framework achieves improvements of 2.5-4.0 percentage points in F1 score and exhibits enhanced topic coherence, as measured by normalized pointwise mutual information. Ablation studies further confirm the critical contributions of the shared encoder and cross-task integration mechanisms to the overall performance. These results underscore the efficacy of incorporating topic-aware context into few-shot NER and highlight the potential of FewTopNER for robust cross-lingual applications in low-resource settings.
zh

[NLP-15] CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning

【速读】：该论文旨在解决现有大型语言模型（LLMs）在推理过程中仅依赖单一查询和模型自身推理能力的问题，这限制了其搜索空间和结果的准确性及全面性。论文的关键解决方案是提出了Chain-of-Associated-Thoughts (CoAT)框架，该框架通过结合蒙特卡洛树搜索（MCTS）算法与一种动态集成新关键信息的机制——关联记忆（associative memory），显著扩展了LLM的搜索空间。这种方法不仅允许模型探索多样化的推理路径，还能实时更新知识库，从而实现更准确和全面的输出。

链接: https://arxiv.org/abs/2502.02390
作者: Jianfeng Pan,Senyou Deng,Shaomang Huang
机构: Qihoo 360 (奇虎360)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research on LLM technologies is rapidly emerging, with most of them employing a ‘fast thinking’ approach to inference. Most LLMs generate the final result based solely on a single query and LLM’s reasoning capabilities. However, with the advent of OpenAI-o1, ‘slow thinking’ techniques have garnered increasing attention because its process is closer to the human thought process. Inspired by the human ability to constantly associate and replenish knowledge during thinking, we developed the novel Chain-of-Associated-Thoughts (CoAT) framework, which introduces an innovative synergy between the Monte Carlo Tree Search (MCTS) algorithm and a dynamic mechanism for integrating new key information, termed ‘associative memory’. By combining the structured exploration capabilities of MCTS with the adaptive learning capacity of associative memory, CoAT significantly expands the LLM search space, enabling our framework to explore diverse reasoning pathways and dynamically update its knowledge base in real-time. This allows the framework to not only revisit and refine earlier inferences but also adaptively incorporate evolving information, ensuring that the final output is both accurate and comprehensive. To validate the effectiveness of our framework, we conducted extensive experiments across a range of generative and reasoning tasks. These experiments demonstrated that our framework outperforms conventional inference processes on accuracy, coherence, and diversity. The framework’s ability to iteratively expand its search space while retaining contextually relevant information results.
zh

[NLP-16] STAIR: Improving Safety Alignment with Introspective Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）的安全性和无害性问题，同时保持其性能。现有安全对齐方法通常面临安全与性能之间的权衡，并且容易受到越狱攻击的影响，主要是由于它们依赖于直接拒绝恶意查询。论文的关键解决方案是提出STAIR框架，它集成了安全性对齐与内省推理。通过自我改进的逐步分析链式思维（CoT）推理，使模型具备安全意识来识别风险。STAIR首先赋予模型结构化推理能力，然后通过迭代偏好优化在步骤级推理数据上推进安全对齐，这些数据利用新提出的Safety-Informed Monte Carlo Tree Search (SI-MCTS)生成。此外，训练过程奖励模型以引导测试时搜索，从而改进响应。实验表明，STAIR有效地减少了有害输出，同时更好地保留了有用性，并在测试时扩展后达到了与Claude-3.5相当的安全性能。

链接: https://arxiv.org/abs/2502.02384
作者: Yichi Zhang,Siyuan Zhang,Yao Huang,Zeyu Xia,Zhengwei Fang,Xiao Yang,Ranjie Duan,Dong Yan,Yinpeng Dong,Jun Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at this https URL.
zh

[NLP-17] Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLM s

【速读】：该论文旨在解决长链数学推理过程中因语言模型冗长输出导致的推理步骤难以验证及错误追踪困难的问题。关键解决方案是引入前提链接（premise links），将传统的线性推理链重构为前提增强推理链（Premise Augmented Reasoning Chains, PARC），形成有向无环图（directed acyclic graph）。这种结构通过明确标识每一步的前提，显著提升了复杂推理链中前提识别的可靠性，并使错误识别的准确性提高了6%到16%。

链接: https://arxiv.org/abs/2502.02362
作者: Sagnik Mukherjee,Abhinav Chinta,Takyoung Kim,Tarun Anoop Sharma,Dilek Hakkani Tur
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) by enabling detailed step-by-step solutions. However, due to the verbosity of LLMs, the resulting reasoning chains can be long, making it harder to verify the reasoning steps and trace issues resulting from dependencies between the steps that may be farther away in the sequence of steps. Importantly, mathematical reasoning allows each step to be derived from a small set of premises, which are a subset of the preceding steps in the reasoning chain. In this paper, we present a framework that identifies the premises for each step, to improve the evaluation of reasoning. We restructure conventional linear reasoning chains into Premise Augmented Reasoning Chains (PARC) by introducing premise links, resulting in a directed acyclic graph where the nodes are the steps and the edges are the premise links. Through experiments with a PARC-based dataset that we built, namely PERL (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-by-step verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.
zh

[NLP-18] Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在复杂视觉推理中的挑战，特别是在增强其推理能力的同时平衡性能与效率的问题。关键在于引入AStar框架，通过Monte Carlo Tree Search (MCTS) 自动从有限数据中推导出高层次的认知推理模式，从而构建一个统一的推理框架，实现高效推理与最小化树迭代次数之间的平衡。

链接: https://arxiv.org/abs/2502.02339
作者: Jinyang Wu,Mingkuan Feng,Shuai Zhang,Ruihan Jin,Feihu Che,Zengqi Wen,Jianhua Tao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) exhibit impressive capabilities but still face challenges in complex visual reasoning. While recent efforts attempt to enhance MLLMs’ reasoning by incorporating OpenAI o1-like structured thinking through explicit search structures or teacher-guided distillation, they often struggle to balance performance and efficiency. A critical limitation is their heavy reliance on extensive data and search spaces, resulting in low-efficiency implicit insight extraction and data utilization. To address this, we propose AStar, an Automated Structured thinking paradigm for multimodal reasoning via Monte Carlo Tree Search (MCTS). AStar automatically derives high-level cognitive reasoning patterns from limited data using MCTS-powered hierarchical structures. Building on these explicit patterns, we design a unified reasoning framework that seamlessly integrates models’ internal reasoning capabilities and external reasoning guidelines, enabling efficient inference with minimal tree iterations. This novel paradigm strikes a compelling balance between performance and efficiency. Extensive experiments demonstrate AStar’s effectiveness, achieving superior accuracy (54.0 % ) on the MathVerse benchmark with a 7B backbone, surpassing GPT-4o (50.2 % ) while maintaining substantial data and computational efficiency.
zh

[NLP-19] ReSpark: Leverag ing Previous Data Reports as References to Generate New Reports with LLM s

【速读】：该论文旨在解决数据报告生成耗时且难以满足用户期望的问题。主要挑战在于有效地将整个分析逻辑传达给大型语言模型（Large Language Models, LLMs），以及确定全面的分析逻辑对用户而言也较为困难。为应对这些挑战，论文提出了一种基于LLM的方法ReSpark，其关键是利用现有数据报告作为参考来创建新的数据报告。ReSpark通过搜索相似主题的报告，将其解析为与分析目标相关的相互依赖的片段，并使用新数据执行这些片段。它能够识别不一致之处，定制分析目标、数据转换和文本描述，并允许用户实时查看输出、插入新目标并修改报告内容。

链接: https://arxiv.org/abs/2502.02329
作者: Yuan Tian,Chuhan Zhang,Xiaotong Wang,Sitong Pan,Weiwei Cui,Haidong Zhang,Dazhen Deng,Yingcai Wu
机构: Zhejiang University(浙江大学); Microsoft(微软)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Creating data reports is time-consuming, as it requires iterative exploration and understanding of data, followed by summarizing the insights. While large language models (LLMs) are powerful tools for data processing and text generation, they often struggle to produce complete data reports that fully meet user expectations. One significant challenge is effectively communicating the entire analysis logic to LLMs. Moreover, determining a comprehensive analysis logic can be mentally taxing for users. To address these challenges, we propose ReSpark, an LLM-based method that leverages existing data reports as references for creating new ones. Given a data table, ReSpark searches for similar-topic reports, parses them into interdependent segments corresponding to analytical objectives, and executes them with new data. It identifies inconsistencies and customizes the objectives, data transformations, and textual descriptions. ReSpark allows users to review real-time outputs, insert new objectives, and modify report content. Its effectiveness was evaluated through comparative and user studies.
zh

[NLP-20] VaiBot: Shuttle Between the Instructions and Parameters

【速读】：该论文旨在解决指令与大语言模型（LLMs）训练数据之间分离的问题，提出了一种神经网络框架VaiBot，该框架集成了变分自编码器（VAE）和变分信息瓶颈（VIB），以统一的方式在LLMs下建模、学习和推理演绎与归纳任务。关键在于通过实验验证VaiBot不仅在演绎能力方面可与现有基线方法媲美，同时在归纳能力方面显著超越这些方法，并且能够利用通用指令跟随数据进行扩展，展示出卓越的一次性归纳能力。最终，通过协同整合演绎和归纳过程，观察到其参数分布得到显著改善，在归纳推理任务中表现出色。

链接: https://arxiv.org/abs/2502.02315
作者: Wangtao Sun,Haotian Xu,Huanxuan Liao,Xuanqing Yu,Zhongtao Jiang,Shizhu He,Jun Zhao,Kang Liu
机构: Institute of Automation, Chinese Academy of Sciences(自动化研究所，中国科学院), Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences(人工智能学院，中国科学院大学), Beijing, China; Xiaohongshu Inc(小红书公司); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How to interact with LLMs through \emphinstructions has been widely studied by researchers. However, previous studies have treated the emergence of instructions and the training of LLMs on task data as separate processes, overlooking the inherent unity between the two. This paper proposes a neural network framework, VaiBot, that integrates VAE and VIB, designed to uniformly model, learn, and infer both deduction and induction tasks under LLMs. Through experiments, we demonstrate that VaiBot performs on par with existing baseline methods in terms of deductive capabilities while significantly surpassing them in inductive capabilities. We also find that VaiBot can scale up using general instruction-following data and exhibits excellent one-shot induction abilities. We finally synergistically integrate the deductive and inductive processes of VaiBot. Through T-SNE dimensionality reduction, we observe that its inductive-deductive process significantly improves the distribution of training parameters, enabling it to outperform baseline methods in inductive reasoning tasks. The code and data for this paper can be found at this https URL.
zh

[NLP-21] Evalita-LLM : Benchmarking Large Language Models on Italian

【速读】：该论文旨在解决评估意大利语言环境中大型语言模型（Large Language Models, LLMs）性能的问题。为实现这一目标，论文的关键解决方案在于设计Evalita-LLM基准测试，其特点是：(i) 所有任务均为原生意大利语，避免了从意大利语翻译带来的问题及潜在文化偏见；(ii) 包含多项选择任务以及生成式任务（Generative Tasks），支持更自然的交互方式；(iii) 所有任务采用多个提示进行评估，从而减轻模型对特定提示的敏感性，确保评价更加公平和客观。这种方法通过迭代验证候选任务和候选提示来优化基准测试。

链接: https://arxiv.org/abs/2502.02289
作者: Bernardo Magnini,Roberto Zanoli,Michele Resta,Martin Cimmino,Paolo Albano,Marco Madeddu,Viviana Patti
机构: fbk.eu(特伦托自由大学); igenius.ai(未知机构); unito.it(都灵大学)
类目: Computation and Language (cs.CL)
备注: 42 pages, 1 figure, 32 tables

点击查看摘要

Abstract:We describe Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation. We propose an iterative methodology, where candidate tasks and candidate prompts are validated against a set of LLMs used for development. We report experimental results from the benchmark’s development phase, and provide performance statistics for several state-of-the-art LLMs.
zh

[NLP-22] Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation

【速读】：该论文旨在通过对比分析两种关键技术，即使用LoRA（Low-Rank Adaptation）进行微调和 Retrieval-Augmented Generation (RAG) 框架，在多领域的医疗对话数据集中评估其在医生与患者聊天对话中的表现。关键在于通过综合评估包括语言质量（困惑度、BLEU分数）、事实准确性（与医学知识库的事实核查）、遵循医学指南以及整体人类判断（连贯性、同理心、安全性）等指标，来全面评价三个最先进的模型（Llama-2、GPT 和 LSTM）的性能，并探索这些方法在处理多样化的患者查询及整合特定领域知识方面的鲁棒性和潜力。

链接: https://arxiv.org/abs/2502.02249
作者: Atharva Mangeshkumar Agrawal,Rutika Pandurang Shinde,Vasanth Kumar Bhukya,Ashmita Chakraborty,Sagar Bharat Shah,Tanmay Shukla,Sree Pradeep Kumar Relangi,Nilesh Mutyam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities in natural language processing tasks, including dialogue generation. This research aims to conduct a novel comparative analysis of two prominent techniques, fine-tuning with LoRA (Low-Rank Adaptation) and the Retrieval-Augmented Generation (RAG) framework, in the context of doctor-patient chat conversations with multiple datasets of mixed medical domains. The analysis involves three state-of-the-art models: Llama-2, GPT, and the LSTM model. Employing real-world doctor-patient dialogues, we comprehensively evaluate the performance of models, assessing key metrics such as language quality (perplexity, BLEU score), factual accuracy (fact-checking against medical knowledge bases), adherence to medical guidelines, and overall human judgments (coherence, empathy, safety). The findings provide insights into the strengths and limitations of each approach, shedding light on their suitability for healthcare applications. Furthermore, the research investigates the robustness of the models in handling diverse patient queries, ranging from general health inquiries to specific medical conditions. The impact of domain-specific knowledge integration is also explored, highlighting the potential for enhancing LLM performance through targeted data augmentation and retrieval strategies.
zh

[NLP-23] Can You Move These Over There? An LLM -based VR Mover for Supporting Object Manipulation

【速读】：该论文旨在解决在虚拟现实（Virtual Reality, VR）环境中自然地理解和执行用户指令以操作物体的问题。解决方案的关键在于利用大型语言模型（LLM）来理解用户的语音指令，并通过简单的指向和说话动作实现对物体的操作，无需结构化的输入。研究表明，这种方法可以提高多物体操作中的用户体验、可用性和性能，同时减少工作量和手臂疲劳。

链接: https://arxiv.org/abs/2502.02201
作者: Xiangzhi Eric Wang,Zackary P. T. Sin,Ye Jia,Daniel Archer,Wynonna H. Y. Fong,Qing Li,Chen Li
机构: The Hong Kong Polytechnic University (香港理工大学); University College London (伦敦大学学院); Heep Yunn School (喜wick学校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: 64 pages (30 in main text), 22 figures (19 in main text)

点击查看摘要

Abstract:In our daily lives, we can naturally convey instructions for the spatial manipulation of objects using words and gestures. Transposing this form of interaction into virtual reality (VR) object manipulation can be beneficial. We propose VR Mover, an LLM-empowered solution that can understand and interpret the user’s vocal instruction to support object manipulation. By simply pointing and speaking, the LLM can manipulate objects without structured input. Our user study demonstrates that VR Mover enhances user usability, overall experience and performance on multi-object manipulation, while also reducing workload and arm fatigue. Users prefer the proposed natural interface for broad movements and may complementarily switch to gizmos or virtual hands for finer adjustments. These findings are believed to contribute to design implications for future LLM-based object manipulation interfaces, highlighting the potential for more intuitive and efficient user interactions in VR environments.
zh

[NLP-24] When Dimensionality Hurts: The Role of LLM Embedding Compression for Noisy Regression Tasks

【速读】：该论文旨在探究压缩文本嵌入（embedding compression）在大型语言模型（LLMs）为基础的回归任务中的有效性。论文的关键解决方案在于采用自编码器（autoencoder）的隐藏表示，在最小监督的情况下压缩嵌入，以此来缓解过拟合问题，并提升在信号噪声比低的任务（如金融回报预测）中的表现。然而，对于输入数据与目标数据之间存在高因果依赖性的任务，压缩嵌入则会降低性能。

链接: https://arxiv.org/abs/2502.02199
作者: Felix Drinkall,Janet B. Pierrehumbert,Stefan Zohren
机构: Department of Engineering Science, University of Oxford (工程科学系，牛津大学); Faculty of Linguistics, University of Oxford (语言学系，牛津大学)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable success in language modelling due to scaling laws found in model size and the hidden dimension of the model’s text representation. Yet, we demonstrate that compressed representations of text can yield better performance in LLM-based regression tasks. In this paper, we compare the relative performance of embedding compression in three different signal-to-noise contexts: financial return prediction, writing quality assessment and review scoring. Our results show that compressing embeddings, in a minimally supervised manner using an autoencoder’s hidden representation, can mitigate overfitting and improve performance on noisy tasks, such as financial return prediction; but that compression reduces performance on tasks that have high causal dependencies between the input and target data. Our results suggest that the success of interpretable compressed representations such as sentiment may be due to a regularising effect.
zh

[NLP-25] Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge

【速读】：该论文旨在解决大型语言模型中事实知识更新和修改的问题，并探索现有知识编辑方法在不同语言中的有效性及注意力机制在此过程中的作用。论文的关键在于提出Mass-Editing Memory with Attention in Transformers (MEMAT)，该方法通过利用注意力机制，在无需大量参数调整的情况下显著提升了各项性能指标，同时实现了跨语言的支持和高移植性。

链接: https://arxiv.org/abs/2502.02173
作者: Daniel Tamayo,Aitor Gonzalez-Agirre,Javier Hernando,Marta Villegas
机构: Barcelona Supercomputing Center (巴塞罗那超级计算中心); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research has explored methods for updating and modifying factual knowledge in large language models, often focusing on specific multi-layer perceptron blocks. This study expands on this work by examining the effectiveness of existing knowledge editing methods across languages and delving into the role of attention mechanisms in this process. Drawing from the insights gained, we propose Mass-Editing Memory with Attention in Transformers (MEMAT), a method that achieves significant improvements in all metrics while requiring minimal parameter modifications. MEMAT delivers a remarkable 10% increase in magnitude metrics, benefits languages not included in the training data and also demonstrates a high degree of portability. Our code and data are at this https URL.
zh

[NLP-26] Multilingual Attribute Extraction from News Web Pages

【速读】：该论文旨在解决多语言新闻文章网页属性自动提取的挑战。解决方案的关键在于使用一个多语言数据集，并对其进行预训练和微调，以适应不同语言的网页。具体而言，论文采用了一个包含六种语言（英语、德语、俄语、中文、韩语和阿拉伯语）的3,172个标记新闻网页的数据集，通过微调预训练的MarkupLM模型和DOM-LM模型来实现新闻属性的高效提取，并与现有的开源新闻数据提取工具进行了对比，展示了优越的性能。

链接: https://arxiv.org/abs/2502.02167
作者: Pavel Bedrin,Maksim Varlamov,Alexander Yatskov
机构: ISP RAS (俄罗斯科学院信息传输问题研究所); MSU (莫斯科国立大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages. Recent neural network models have shown high efficacy in extracting information from semi-structured web pages. However, these models are predominantly applied to domains like e-commerce and are pre-trained using English data, complicating their application to web pages in other languages. We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic) from 161 websites. The dataset is publicly available on GitHub. We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality. Additionally, we pre-trained another state-of-the-art model, DOM-LM, on multilingual data and fine-tuned it on our dataset. We compared both fine-tuned models to existing open-source news data extraction tools, achieving superior extraction metrics.
zh

[NLP-27] Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing

【速读】：该论文旨在解决现有安全对齐方法在提升整体安全性的同时未能确保特定类别安全的问题。研究发现，这些方法可能导致模型倾向于生成负面内容，从而产生拒绝性响应，而不考虑输入上下文。为了解决这一问题，论文提出了一种名为Token-level Safety-Debiased Inference (TSDI)的学习-free方法，通过随机构造提示来估计和纠正生成过程中的偏差。这种方法能够在保持安全性的同时增强模型的有用性，从而改善安全性和有用性之间的权衡。

链接: https://arxiv.org/abs/2502.02153
作者: Thien Q. Tran,Akifumi Wachi,Rei Sato,Takumi Tanabe,Youhei Akimoto
机构: LY Corporation; University of Tsukuba, RIKEN AIP
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 37 pages

点击查看摘要

Abstract:Safety alignment is an essential research topic for real-world AI applications. Despite the multifaceted nature of safety and trustworthiness in AI, current safety alignment methods often focus on a comprehensive notion of safety. By carefully assessing models from the existing safety-alignment methods, we found that, while they generally improved overall safety performance, they failed to ensure safety in specific categories. Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model’s helpfulness. We observed that, while smaller KL penalty parameters, increased training iterations, and dataset cleansing can enhance safety, they do not necessarily improve the trade-off between safety and helpfulness. We discovered that safety alignment could even induce undesired effects and result in a model that prefers generating negative tokens leading to rejective responses, regardless of the input context. To address this, we introduced a learning-free method, Token-level Safety-Debiased Inference (TSDI), to estimate and correct this bias during the generation process using randomly constructed prompts. Our experiments demonstrated that our method could enhance the model’s helpfulness while maintaining safety, thus improving the trade-off Pareto-front.
zh

[NLP-28] Risk-Aware Driving Scenario Analysis with Large Language Models

【速读】：该论文旨在解决如何利用大型语言模型（LLMs）评估自动驾驶测试模拟器生成的驾驶场景是否具有安全风险的问题。解决方案的关键在于提出了一种新颖的框架，该框架能够通过LLMs进行风险感知分析，并通过对抗方法修改现有非关键场景以生成新的安全关键场景，从而验证运动规划算法的有效性。

链接: https://arxiv.org/abs/2502.02145
作者: Yuan Gao,Mattia Piccinini,Johannes Betz
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: IEEE Intelligent Vehicles Symposium 2025

点击查看摘要

Abstract:Large Language Models (LLMs) can capture nuanced contextual relationships, reasoning, and complex problem-solving. By leveraging their ability to process and interpret large-scale information, LLMs have shown potential to address domain-specific challenges, including those in autonomous driving systems. This paper proposes a novel framework that leverages LLMs for risk-aware analysis of generated driving scenarios. We hypothesize that LLMs can effectively evaluate whether driving scenarios generated by autonomous driving testing simulators are safety-critical. To validate this hypothesis, we conducted an empirical evaluation to assess the effectiveness of LLMs in performing this task. This framework will also provide feedback to generate the new safety-critical scenario by using adversarial method to modify existing non-critical scenarios and test their effectiveness in validating motion planning algorithms. Code and scenarios are available at: this https URL
zh

[NLP-29] opic Modeling in Marathi

【速读】：该论文旨在解决在印度语言主题建模领域资源有限、语言结构多样以及独特挑战导致的研究和应用稀缺的问题。论文的关键解决方案在于结合针对印度语言训练的BERT模型使用BERTopic方法，从而在主题连贯性和多样性方面超越LDA模型。

链接: https://arxiv.org/abs/2502.02100
作者: Sanket Shinde,Raviraj Joshi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While topic modeling in English has become a prevalent and well-explored area, venturing into topic modeling for Indic languages remains relatively rare. The limited availability of resources, diverse linguistic structures, and unique challenges posed by Indic languages contribute to the scarcity of research and applications in this domain. Despite the growing interest in natural language processing and machine learning, there exists a noticeable gap in the comprehensive exploration of topic modeling methodologies tailored specifically for languages such as Hindi, Marathi, Tamil, and others. In this paper, we examine several topic modeling approaches applied to the Marathi language. Specifically, we compare various BERT and non-BERT approaches, including multilingual and monolingual BERT models, using topic coherence and topic diversity as evaluation metrics. Our analysis provides insights into the performance of these approaches for Marathi language topic modeling. The key finding of the paper is that BERTopic, when combined with BERT models trained on Indic languages, outperforms LDA in terms of topic modeling performance.
zh

[NLP-30] LongDPO: Unlock Better Long-form Generation Abilities for LLM s via Critique-augmented Stepwise Information

【速读】：该论文旨在解决长文本生成在学术写作和代码生成中的不足，特别是当前模型如GPT-4在处理长文时仍表现出不令人满意的效果。论文的关键解决方案在于引入过程监督（Process Supervision），通过蒙特卡洛树搜索（Monte Carlo Tree Search）收集逐步偏好对（preference pairs），利用全局记忆池（global memory pool）保持一致性，并结合外部评价（external critiques）来优化偏好对，最终采用分步深度偏好优化（step-level DPO）以提升长文本生成的长度和质量。实验结果表明，所提出的方法在不同模型结构下均能有效改善长文本生成的表现。

链接: https://arxiv.org/abs/2502.02095
作者: Bowen Ping,Jiali Zeng,Fandong Meng,Shuo Wang,Jie Zhou,Shanghang Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-form generation is crucial for academic writing papers and repo-level code generation. Despite this, current models, including GPT-4o, still exhibit unsatisfactory performance. Existing methods that utilize preference learning with outcome supervision often fail to provide detailed feedback for extended contexts. This shortcoming can lead to content that does not fully satisfy query requirements, resulting in issues like length deviations, and diminished quality. In this paper, we propose enhancing long-form generation by incorporating process supervision. We employ Monte Carlo Tree Search to gather stepwise preference pairs, utilizing a global memory pool to maintain consistency. To address the issue of suboptimal candidate selection, we integrate external critiques to refine and improve the quality of the preference pairs. Finally, we apply step-level DPO using the collected stepwise preference pairs. Experimental results show that our method improves length and quality on long-form generation benchmarks, with almost lossless performance on general benchmarks across various model backbones.
zh

[NLP-31] Rethinking stance detection: A theoretically-informed research agenda for user-level inference using language models

【速读】：该论文旨在解决立场检测（Stance Detection）领域中理论概念化不足以及缺乏个体层面分析的问题。论文的关键在于探索将预训练的大语言模型（LLMs）应用于个体层面属性推断，以提升立场检测模型的灵活性与准确性。通过回顾相关研究，论文提出了一项包含四个方面的研究议程，旨在推动理论丰富、包容性强且具有实际应用价值的立场检测研究。

链接: https://arxiv.org/abs/2502.02074
作者: Prasanta Bhattacharya,Hong Zhang,Yiming Cao,Wei Gao,Brandon Siyuan Loh,Joseph J.P. Simons,Liang Ze Wong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stance detection has emerged as a popular task in natural language processing research, enabled largely by the abundance of target-specific social media data. While there has been considerable research on the development of stance detection models, datasets, and application, we highlight important gaps pertaining to (i) a lack of theoretical conceptualization of stance, and (ii) the treatment of stance at an individual- or user-level, as opposed to message-level. In this paper, we first review the interdisciplinary origins of stance as an individual-level construct to highlight relevant attributes (e.g., psychological features) that might be useful to incorporate in stance detection models. Further, we argue that recent pre-trained and large language models (LLMs) might offer a way to flexibly infer such user-level attributes and/or incorporate them in modelling stance. To better illustrate this, we briefly review and synthesize the emerging corpus of studies on using LLMs for inferring stance, and specifically on incorporating user attributes in such tasks. We conclude by proposing a four-point agenda for pursuing stance detection research that is theoretically informed, inclusive, and practically impactful.
zh

[NLP-32] ASCenD-BDS: Adaptable Stochastic and Context-aware framework for Detection of Bias Discrimination and Stereotyping

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在不同语言和社会文化背景下部署和使用过程中固有的偏见、歧视和刻板印象问题。论文提出的关键解决方案是一个名为ASCenD BDS（自适应、随机和情境感知的偏见、歧视和刻板印象检测框架）。此框架通过自适应、随机性和情境感知的方法来检测多种类别（如性别、种姓、年龄、残疾、社会经济地位、语言变异等）中的偏见、歧视和刻板印象。与现有依赖数据集生成特定场景的方法相比，ASCenD BDS框架通过其自适应、随机性和情境感知的特性克服了有限场景评估的局限性，并且能够根据具体国家或文化的独特需求进行定制。论文还特别展示了如何利用印度2011年人口普查的数据来实现情境感知，并开发了一套包含800多个STEM、10个主要类别和31个独特子类别的体系来支持这些特性。

链接: https://arxiv.org/abs/2502.02072
作者: Rajiv Bahl,Venkatesan N,Parimal Aglawe,Aastha Sarasapalli,Bhavya Kancharla,Chaitanya kolukuluri,Harish Mohite,Japneet Hora,Kiran Kakollu,Rahul Diman,Shubham Kapale,Sri Bhagya Kathula,Vamsikrishna Motru,Yogeshwar Reddy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 17 pages, 6 Figures and this manuscript will be submitted to Q1,Q2 Journals

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) has transformed natural language processing but raises critical concerns about biases inherent in their deployment and use across diverse linguistic and sociocultural contexts. This paper presents a framework named ASCenD BDS (Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping). The framework presents approach to detecting bias, discrimination, stereotyping across various categories such as gender, caste, age, disability, socioeconomic status, linguistic variations, etc., using an approach which is Adaptive, Stochastic and Context-Aware. The existing frameworks rely heavily on usage of datasets to generate scenarios for detection of Bias, Discrimination and Stereotyping. Examples include datasets such as Civil Comments, Wino Gender, WinoBias, BOLD, CrowS Pairs and BBQ. However, such an approach provides point solutions. As a result, these datasets provide a finite number of scenarios for assessment. The current framework overcomes this limitation by having features which enable Adaptability, Stochasticity, Context Awareness. Context awareness can be customized for any nation or culture or sub-culture (for example an organization’s unique culture). In this paper, context awareness in the Indian context has been established. Content has been leveraged from Indian Census 2011 to have a commonality of categorization. A framework has been developed using Category, Sub-Category, STEM, X-Factor, Synonym to enable the features for Adaptability, Stochasticity and Context awareness. The framework has been described in detail in Section 3. Overall 800 plus STEMs, 10 Categories, 31 unique SubCategories were developed by a team of consultants at Saint Fox Consultancy Private Ltd. The concept has been tested out in SFCLabs as part of product development.
zh

[NLP-33] Robust and Secure Code Watermarking for Large Language Models via ML/Crypto Codesign

【速读】：该论文旨在解决机器学习生成代码的知识产权侵权及不当使用问题，并提出RoSe框架作为解决方案。RoSe的关键在于通过端到端训练水印插入与提取模块，确保水印化代码功能不变的同时，提升水印的可检测性和鲁棒性。具体而言，RoSe利用预训练的CodeT5作为插入主干，扩展代码语法和变量重命名转换的搜索空间，从而获得高质量的水印。此外，RoSe采用零知识证明进行安全验证，避免泄露底层签名，进一步保障系统的可用性。

链接: https://arxiv.org/abs/2502.02068
作者: Ruisi Zhang,Neusha Javidnia,Nojan Sheybani,Farinaz Koushanfar
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces RoSe, the first-of-its-kind ML/Crypto codesign watermarking framework that regulates LLM-generated code to avoid intellectual property rights violations and inappropriate misuse in software development. High-quality watermarks adhering to the detectability-fidelity-robustness tri-objective are limited due to codes’ low-entropy nature. Watermark verification, however, often needs to reveal the signature and requires re-encoding new ones for code reuse, which potentially compromising the system’s usability. To overcome these challenges, RoSe obtains high-quality watermarks by training the watermark insertion and extraction modules end-to-end to ensure (i) unaltered watermarked code functionality and (ii) enhanced detectability and robustness leveraging pre-trained CodeT5 as the insertion backbone to enlarge the code syntactic and variable rename transformation search space. In the deployment, RoSe uses zero-knowledge proofs for secure verification without revealing the underlying signatures. Extensive evaluations demonstrated RoSe achieves high detection accuracy while preserving the code functionality. RoSe is also robust against attacks and provides efficient secure watermark verification.
zh

[NLP-34] AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement ICRA

【速读】：该论文旨在解决在新任务和新场景下，由于缺乏足够资源（如时间或标注样本）而无法对智能体进行重新训练的问题。关键解决方案在于结合大型语言模型（Large Language Models, LLMs）的通用预测能力和知识图谱（Knowledge Graph, KG）中的领域特定知识，使智能体能够快速适应新的任务和环境。此外，机器人还会根据需要获取和利用人类反馈以优化其现有知识，从而显著提升性能。

链接: https://arxiv.org/abs/2502.02067
作者: Shivam Singh,Karthik Swaminathan,Nabanita Dash,Ramandeep Singh,Snehasis Banerjee,Mohan Sridharan,Madhava Krishna
机构: Robotics Research Center, IIIT Hyderabad (IIIT 海得拉巴机器人研究中心); TCS Research, Tata Consultancy Services (TCS 研究院，塔塔咨询服务公司); School of Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2025

点击查看摘要

Abstract:Embodied agents assisting humans are often asked to complete a new task in a new scenario. An agent preparing a particular dish in the kitchen based on a known recipe may be asked to prepare a new dish or to perform cleaning tasks in the storeroom. There may not be sufficient resources, e.g., time or labeled examples, to train the agent for these new situations. Large Language Models (LLMs) trained on considerable knowledge across many domains are able to predict a sequence of abstract actions for such new tasks and scenarios, although it may not be possible for the agent to execute this action sequence due to task-, agent-, or domain-specific constraints. Our framework addresses these challenges by leveraging the generic predictions provided by LLM and the prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an agent to quickly adapt to new tasks and scenarios. The robot also solicits and uses human input as needed to refine its existing knowledge. Based on experimental evaluation over cooking and cleaning tasks in simulation domains, we demonstrate that the interplay between LLM, KG, and human input leads to substantial performance gains compared with just using the LLM output.
zh

[NLP-35] Anticipate Act : Integrating LLM s and Classical Planning for Efficient Task Execution in Household Environments ICRA

【速读】：该论文旨在解决多任务场景下家务助手效率低下的问题。关键在于利用大型语言模型（Large Language Models, LLMs）通过少量提示来预测即将进行的任务，并将这些任务作为目标集成到经典规划系统中，以计算出一系列细粒度的动作序列，从而同时实现多个任务目标。这种方法使得执行时间减少了31%，验证了其在VirtualHome环境中的有效性。

链接: https://arxiv.org/abs/2502.02066
作者: Raghav Arora,Shivam Singh,Karthik Swaminathan,Ahana Datta,Snehasis Banerjee,Brojeshwar Bhowmick,Krishna Murthy Jatavallabhula,Mohan Sridharan,Madhava Krishna
机构: Robotics Research Center, IIIT Hyderabad (IIIT 海得拉巴机器人研究中心); TCS Research, Tata Consultancy Services (塔塔咨询服务有限公司TCS研究); CSAIL, Massachusetts Institute of Technology (麻省理工学院CSAIL); IPAB, University of Edinburgh (爱丁堡大学IPAB); TCS Research, India (印度TCS研究)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2024

点击查看摘要

Abstract:Assistive agents performing household tasks such as making the bed or cooking breakfast often compute and execute actions that accomplish one task at a time. However, efficiency can be improved by anticipating upcoming tasks and computing an action sequence that jointly achieves these tasks. State-of-the-art methods for task anticipation use data-driven deep networks and Large Language Models (LLMs), but they do so at the level of high-level tasks and/or require many training examples. Our framework leverages the generic knowledge of LLMs through a small number of prompts to perform high-level task anticipation, using the anticipated tasks as goals in a classical planning system to compute a sequence of finer-granularity actions that jointly achieve these goals. We ground and evaluate our framework’s abilities in realistic scenarios in the VirtualHome environment and demonstrate a 31% reduction in execution time compared with a system that does not consider upcoming tasks.
zh

[NLP-36] Efficient Domain Adaptation of Multimodal Embeddings using Constrastive Learning

【速读】：该论文旨在解决在资源受限的环境中，如何有效利用高级基础机器学习模型进行下游任务的问题。当前方法要么在不进行特定任务适应的情况下使用预训练模型导致性能不佳，要么需要大量的计算资源进行微调，这在资源有限的环境中是一个障碍。论文的关键解决方案在于提出了一种新的方法，通过冻结大型语言模型（Large Language Models, LLMs）和视觉模型的嵌入，并采用对比学习训练一个小的、特定任务的非线性投影，从而无需昂贵的微调过程即可将基础多模态嵌入适应到下游任务。这种方法实现了显著的性能提升，同时保持了低计算开销，为在资源受限环境中应用先进基础机器学习模型提供了实用方案。

链接: https://arxiv.org/abs/2502.02048
作者: Georgios Margaritis,Periklis Petridis,Dimitris J. Bertsimas
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in machine learning (ML), natural language processing (NLP), and foundational models have shown promise for real-life applications in critical, albeit compute-constrainted fields like healthcare. In such areas, combining foundational models with supervised ML offers potential for automating tasks like diagnosis and treatment planning, but the limited availability of onsite computational resources pose significant challenges before applying these technologies effectively: Current approaches either yield subpar results when using pretrained models without task-specific adaptation, or require substantial computational resources for fine-tuning, which is often a barrier to entry in such environments. This renders them inaccessible in applications where performance and quality standards are high, but computational resources are scarce. To bridge the gap between best-in-class performance and accessibility, we propose a novel method for adapting foundational, multimodal embeddings to downstream tasks, without the need of expensive fine-tuning processes. Our method leverages frozen embeddings from Large Language Models (LLMs) and Vision Models, and uses contrastive learning to train a small, task-specific nonlinear projection that can be used in the downstream task, without having to fine-tune the original foundational models. We show that this efficient procedure leads to significant performance improvements across various downstream tasks, and perhaps more importantly with minimal computational overhead, offering a practical solution for the use of advanced, foundational ML models in resource-constrained settings. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.02048 [cs.LG] (or arXiv:2502.02048v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.02048 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Georgios Margaritis [view email] [v1] Tue, 4 Feb 2025 06:30:12 UTC (914 KB) Full-text links: Access Paper: View a PDF of the paper titled Efficient Domain Adaptation of Multimodal Embeddings using Constrastive Learning, by Georgios Margaritis and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-02 Change to browse by: cs cs.CL cs.CV References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[NLP-37] AmaSQuAD: A Benchmark for Amharic Extractive Question Answering

【速读】：该论文旨在解决将抽取式问答数据集翻译至低资源语言时所面临的问题，特别是在翻译过程中存在的问题如问题与答案之间的错位以及翻译上下文中多个答案实例的存在。为了解决这些问题，研究采用了基于细调的BERT模型（fine-tuned BERT-based model）的嵌入向量的余弦相似度（cosine similarity），以及最长公共子序列算法（Longest Common Subsequence, LCS）。关键解决方案在于利用这些技术手段提高问答系统的性能，通过在人工标注的AmQA数据集上的实验表明，经过优化的XLM-R模型在AmaSQuAD开发数据集上的F1分数从36.55%提升到44.41%，并在人工评估的AmQA数据集上提高了F1分数和精确匹配分数。

链接: https://arxiv.org/abs/2502.02047
作者: Nebiyou Daniel Hailemariam,Blessed Guda,Tsegazeab Tefferi
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This research presents a novel framework for translating extractive question-answering datasets into low-resource languages, as demonstrated by the creation of the AmaSQuAD dataset, a translation of SQuAD 2.0 into Amharic. The methodology addresses challenges related to misalignment between translated questions and answers, as well as the presence of multiple answer instances in the translated context. For this purpose, we used cosine similarity utilizing embeddings from a fine-tuned BERT-based model for Amharic and Longest Common Subsequence (LCS). Additionally, we fine-tune the XLM-R model on the AmaSQuAD synthetic dataset for Amharic Question-Answering. The results show an improvement in baseline performance, with the fine-tuned model achieving an increase in the F1 score from 36.55% to 44.41% and 50.01% to 57.5% on the AmaSQuAD development dataset. Moreover, the model demonstrates improvement on the human-curated AmQA dataset, increasing the F1 score from 67.80% to 68.80% and the exact match score from 52.50% to 52.66%.The AmaSQuAD dataset is publicly available Datasets
zh

[NLP-38] Contextual Memory Reweaving in Large Language Models Using Layered Latent State Reconstruction

【速读】：该论文旨在解决深度神经架构中的记忆保持挑战，特别是在处理和回忆扩展上下文信息时的能力限制。随着序列长度的增加，标记依赖性会退化，导致较长输出中的连贯性和事实一致性下降。解决方案的关键在于引入一种结构化方法，通过重新编织在不同处理层捕获的潜在状态来缓解这一问题，从而强化扩展序列中的标记表示。具体而言，所提出的上下文记忆重编框架采用分层潜在状态重构机制，系统地整合过去的上下文嵌入，而不引入外部记忆模块。实验结果表明，在不同序列长度下，该方法显著提升了回忆准确性，并在罕见标记保留和数值推理一致性方面取得了显著进展。

链接: https://arxiv.org/abs/2502.02046
作者: Frederick Dillon,Gregor Halvorsen,Simon Tattershall,Magnus Rowntree,Gareth Vanderpool
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory retention challenges in deep neural architectures have ongoing limitations in the ability to process and recall extended contextual information. Token dependencies degrade as sequence length increases, leading to a decline in coherence and factual consistency across longer outputs. A structured approach is introduced to mitigate this issue through the reweaving of latent states captured at different processing layers, reinforcing token representations over extended sequences. The proposed Contextual Memory Reweaving framework incorporates a Layered Latent State Reconstruction mechanism to systematically integrate past contextual embeddings without introducing external memory modules. Experimental results demonstrate improvements in recall accuracy across a range of sequence lengths, with notable gains in the retention of rarely occurring tokens and numerical reasoning consistency. Further analysis of computational efficiency indicates that the additional processing overhead remains within acceptable thresholds, enabling scalability across different model sizes. Evaluations in long-form text generation and ambiguous query resolution highlight the capacity of memory reweaving to enhance continuity and reduce inconsistencies over extended outputs. Attention weight distributions reveal more structured allocation patterns, suggesting that reweaved latent states contribute to improved contextual awareness. The findings establish a framework for refining memory retention mechanisms in language models, addressing long-standing challenges in handling complex, multi-step reasoning tasks.
zh

[NLP-39] M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference

【速读】：该论文旨在解决在自回归生成过程中静态残差变换导致的推理效率与生成保真度之间的次优权衡问题。现有方法如Early Exiting、Skip Decoding和Mixture-of-Depth通过基于令牌级复杂性的调节残差变换来应对这一挑战，但这些方法主要考虑令牌在模型层间遍历的距离，忽略了残差演化的内在速度。论文的关键解决方案是引入了混合多速率残差（Mixture of Multi-rate Residuals, M2R2）框架，通过动态调节残差速度来改善早期对齐，从而提升推理效率。

链接: https://arxiv.org/abs/2502.02040
作者: Nikhil Bhendawade,Mahyar Najibi,Devang Naik,Irina Belousova
机构: Apple
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Residual transformations enhance the representational depth and expressive power of large language models (LLMs). However, applying static residual transformations across all tokens in auto-regressive generation leads to a suboptimal trade-off between inference efficiency and generation fidelity. Existing methods, including Early Exiting, Skip Decoding, and Mixture-of-Depth address this by modulating the residual transformation based on token-level complexity. Nevertheless, these approaches predominantly consider the distance traversed by tokens through the model layers, neglecting the underlying velocity of residual evolution. We introduce Mixture of Multi-rate Residuals (M2R2), a framework that dynamically modulates residual velocity to improve early alignment, enhancing inference efficiency. Evaluations on reasoning oriented tasks such as Koala, Self-Instruct, WizardLM, and MT-Bench show M2R2 surpasses state-of-the-art distance-based strategies, balancing generation quality and speedup. In self-speculative decoding setup, M2R2 achieves up to 2.8x speedups on MT-Bench, outperforming methods like 2-model speculative decoding, Medusa, LookAhead Decoding, and DEED. In Mixture-of-Experts (MoE) architectures, integrating early residual alignment with ahead-of-time expert loading into high-bandwidth memory (HBM) accelerates decoding, reduces expert-switching bottlenecks, and achieves a 2.9x speedup, making it highly effective in resource-constrained environments.
zh

[NLP-40] Fine-tuning Language Models for Recipe Generation: A Comparative Analysis and Benchmark Study

【速读】：该论文旨在探索和研究通过微调各种极小规模语言模型来完成食谱生成任务，并重点关注开发稳健的评估指标及对比不同语言模型在开放式食谱生成任务中的表现。关键解决方案在于引入了一种新的评估框架，该框架不仅采用了传统的NLP评估指标，还设计了特定领域的评估指标，特别是针对内容质量和过敏原替代方案提出了创新方法。实验结果表明，虽然较大的模型在标准指标上通常表现更好，但在考虑特定领域指标时，模型规模与食谱质量之间的关系更为复杂。特别地，SmolLM-360M和SmolLM-1.7B尽管规模不同，但表现出相当的性能，而Phi-2尽管参数量较大，在食谱生成方面却显示出局限性。

链接: https://arxiv.org/abs/2502.02028
作者: Anneketh Vij,Changhao Liu,Rahul Anil Nair,Theo Ho,Edward Shi,Ayan Bhowmick
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:This research presents an exploration and study of the recipe generation task by fine-tuning various very small language models, with a focus on developing robust evaluation metrics and comparing across different language models the open-ended task of recipe generation. This study presents extensive experiments with multiple model architectures, ranging from T5-small (Raffel et al., 2023) and SmolLM-135M (Allal et al., 2024) to Phi-2 (Research, 2023),implementing both traditional NLP metrics and custom domain-specific evaluation metrics. Our novel evaluation framework incorporates recipe-specific metrics for assessing content quality and introduces an approach to allergen substitution. The results indicate that, while larger models generally perform better on standard metrics, the relationship between model size and recipe quality is more nuanced when considering domain-specific metrics. We find that SmolLM-360M and SmolLM-1.7B demonstrate comparable performance despite their size difference, while Phi-2 shows limitations in recipe generation despite its larger parameter count. Our comprehensive evaluation framework and allergen substitution system provide valuable insights for future work in recipe generation and broader NLG tasks that require domain expertise and safety considerations.
zh

[NLP-41] Layer by Layer: Uncovering Hidden Representations in Language Models

【速读】：该论文旨在挑战传统观点，即大型语言模型（Large Language Models, LLMs）的输出主要依赖于其最终层，而中间层仅捕获低层次线索。论文的关键解决方案在于提出了一种基于信息论、几何学以及输入扰动不变性的统一表征质量度量框架，揭示了模型各层如何平衡信息压缩与信号保留，并证明了中间层嵌入在广泛的下游任务中通常优于最终层嵌入。通过大量实验和跨模型架构及领域的比较，论文展示了中间层特征的一致优越性，从而为模型分析和优化提供了新的方向。

链接: https://arxiv.org/abs/2502.02013
作者: Oscar Skean,Md Rifat Arefin,Dan Zhao,Niket Patel,Jalal Naghiyev,Yann LeCun,Ravid Shwartz-Ziv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:From extracting features to generating text, the outputs of large language models (LLMs) typically rely on their final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a wide range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each model layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer’s performance. Through extensive experiments on 32 text-embedding tasks and comparisons across model architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features. These findings challenge the standard focus on final-layer embeddings and open new directions for model analysis and optimization, including strategic use of mid-layer representations for more robust and accurate AI systems.
zh

[NLP-42] Reasoning Bias of Next Token Prediction Training

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在训练过程中如何有效提升其推理能力的问题。关键在于对比两种训练范式：基于下一个词预测（Next Token Prediction, NTP）与关键令牌预测（Critical Token Prediction, CTP）。研究发现，尽管NTP在训练过程中会接触到噪声，但它在推理能力方面优于CTP。研究者认为，这种出乎意料的结果归因于噪声对训练动态的正则化影响。NTP训练的模型在各种基准推理数据集上表现出更好的泛化能力和鲁棒性，并且在面对扰动时更具韧性，从而揭示了NTP在预训练期间对于培养推理能力的重要性。

链接: https://arxiv.org/abs/2502.02007
作者: Pengxiao Lin,Zhongwang Zhang,Zhi-Qin John Xu
机构: Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University(自然科学研究所，教育部重点实验室，上海交通大学); School of Mathematical Sciences, Shanghai Jiao Tong University(数学科学学院，上海交通大学); School of Artificial Intelligence, Shanghai Jiao Tong University(人工智能学院，上海交通大学); Key Laboratory of Marine Intelligent Equipment and System, Ministry of Education, P.R. China(教育部海洋智能装备与系统重点实验室，中国); Center for LLM, Institute for Advanced Algorithms Research, Shanghai(大型语言模型研究中心，先进算法研究院，上海)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Since the inception of Large Language Models (LLMs), the quest to efficiently train them for superior reasoning capabilities has been a pivotal challenge. The dominant training paradigm for LLMs is based on next token prediction (NTP). Alternative methodologies, called Critical Token Prediction (CTP), focused exclusively on specific critical tokens (such as the answer in Q\A dataset), aiming to reduce the overfitting of extraneous information and noise. Contrary to initial assumptions, our research reveals that despite NTP’s exposure to noise during training, it surpasses CTP in reasoning ability. We attribute this counterintuitive outcome to the regularizing influence of noise on the training dynamics. Our empirical analysis shows that NTP-trained models exhibit enhanced generalization and robustness across various benchmark reasoning datasets, demonstrating greater resilience to perturbations and achieving flatter loss minima. These findings illuminate that NTP is instrumental in fostering reasoning abilities during pretraining, whereas CTP is more effective for finetuning, thereby enriching our comprehension of optimal training strategies in LLM development.
zh

[NLP-43] Wavelet-based Positional Representation for Long Context ICLR2025

【速读】：该论文旨在解决大规模语言模型在序列外推过程中遇到的挑战，特别是在超过最大允许长度时。传统的位置编码机制受限于训练期间遇到的位置，无法有效表示更长序列中的位置信息。论文的关键解决方案在于提出一种新的位置表示方法，通过利用小波变换捕捉多尺度（即窗口大小）信息，从而不局限模型的注意力场。这种方法能够改进模型在短序列和长序列中的表现，并支持无限制的注意力场进行位置信息的外推。

链接: https://arxiv.org/abs/2502.02004
作者: Yui Oka,Taku Hasegawa,Kyosuke Nishida,Kuniko Saito
机构: NTT Human Informatics Laboratories, NTT Corporation (NTT 人类信息实验室，NTT公司)
类目: Computation and Language (cs.CL)
备注: Accepted to ICLR 2025. 28 pages, 11 figures

点击查看摘要

Abstract:In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model’s position embedding mechanisms are limited to positions encountered during training, thus preventing effective representation of positions in longer sequences. We analyzed conventional position encoding methods for long contexts and found the following characteristics. (1) When the representation dimension is regarded as the time axis, Rotary Position Embedding (RoPE) can be interpreted as a restricted wavelet transform using Haar-like wavelets. However, because it uses only a fixed scale parameter, it does not fully exploit the advantages of wavelet transforms, which capture the fine movements of non-stationary signals using multiple scales (window sizes). This limitation could explain why RoPE performs poorly in extrapolation. (2) Previous research as well as our own analysis indicates that Attention with Linear Biases (ALiBi) functions similarly to windowed attention, using windows of varying sizes. However, it has limitations in capturing deep dependencies because it restricts the receptive field of the model. From these insights, we propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms without limiting the model’s attention field. Experimental results show that this new method improves the performance of the model in both short and long contexts. In particular, our method allows extrapolation of position information without limiting the model’s attention field.
zh

[NLP-44] Can LLM s Assist Annotators in Identifying Morality Frames? – Case Study on Vaccination Debate on Social Media

【速读】：该论文旨在解决在自然语言处理（NLP）领域中，特别是在识别道德框架（moral framing）这类心理语言学任务时，由于数据稀缺和认知负荷导致的人工标注成本高、耗时且易出错的问题。关键解决方案在于利用大型语言模型（LLMs）通过少量示例学习（few-shot learning）的能力，结合实例与解释来辅助人类标注者，从而提高标注准确性、降低任务难度和认知负荷，为复杂心理语言学任务中的人机协作提供了新的途径。

链接: https://arxiv.org/abs/2502.01991
作者: Tunazzina Islam,Dan Goldwasser
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: Accepted at 17th ACM Web Science Conference 2025 (WebSci’25)

点击查看摘要

Abstract:Nowadays, social media is pivotal in shaping public discourse, especially on polarizing issues like vaccination, where diverse moral perspectives influence individual opinions. In NLP, data scarcity and complexity of psycholinguistic tasks such as identifying morality frames makes relying solely on human annotators costly, time-consuming, and prone to inconsistency due to cognitive load. To address these issues, we leverage large language models (LLMs), which are adept at adapting new tasks through few-shot learning, utilizing a handful of in-context examples coupled with explanations that connect examples to task principles. Our research explores LLMs’ potential to assist human annotators in identifying morality frames within vaccination debates on social media. We employ a two-step process: generating concepts and explanations with LLMs, followed by human evaluation using a “think-aloud” tool. Our study shows that integrating LLMs into the annotation process enhances accuracy, reduces task difficulty, lowers cognitive load, suggesting a promising avenue for human-AI collaboration in complex psycholinguistic tasks.
zh

[NLP-45] Gradient-Regularized Latent Space Modulation in Large Language Models for Structured Contextual Synthesis

【速读】：该论文旨在解决生成结构化文本时保持连贯性、稳定性和符合预定义约束的同时，确保语义保真度的问题。关键解决方案在于引入梯度正则化潜在空间调制（Gradient-Regularized Latent Space Modulation, GRLSM），通过在潜在空间内应用结构化约束来指导文本生成，从而减少潜在表示中的突变，增强生成序列的结构一致性和逻辑连贯性。

链接: https://arxiv.org/abs/2502.01979
作者: Derek Yotheringhay,Beatrix Nightingale,Maximilian Featherstone,Edmund Worthington,Hugo Ashdown
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating structured textual content requires mechanisms that enforce coherence, stability, and adherence to predefined constraints while maintaining semantic fidelity. Conventional approaches often rely on rule-based heuristics or fine-tuning strategies that lack flexibility and generalizability across diverse tasks. The incorporation of Gradient-Regularized Latent Space Modulation (GRLSM) introduces a novel paradigm for guiding text generation through the application of structured constraints within the latent space. The integration of gradient-based regularization mitigates abrupt variations in latent representations, ensuring a smoother encoding process that enhances structural consistency and logical progression within generated sequences. Comparative evaluations demonstrate that latent space modulation leads to a reduction in perplexity, increased coherence scores, and improved structural alignment across multiple domains. Stability assessments further indicate that the imposition of spectral norm constraints facilitates more controlled variations in generated text, preserving semantic consistency under input perturbations. Empirical results confirm that structured latent space constraints not only refine the organization of generated outputs but also enhance interpretability through more predictable and reliable synthesis patterns. Performance metrics illustrate that the GRLSM framework substantially reduces structural inconsistencies while preserving the generative flexibility inherent in neural models.
zh

[NLP-46] CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

【速读】：该论文旨在解决大型语言模型在推理过程中因高计算成本而限制其在资源受限应用中的部署问题。解决方案的关键在于提出了一种名为CITER（协作推理与令牌级别路由）的新框架，通过令牌级别路由策略实现小型语言模型（SLMs）和大型语言模型（LLMs）之间的高效协作。CITER将非关键令牌路由到SLM以提高效率，并将关键令牌路由到LLM以保证泛化质量。路由器训练被表述为策略优化，其中路由器根据预测质量和推理成本获得奖励，从而使其能够基于当前令牌及其决策的未来影响来学习令牌级别路由评分并作出路由决策。此外，引入了一种捷径以显著降低奖励评估的成本，从而提升方法的实用性。

链接: https://arxiv.org/abs/2502.01976
作者: Wenhao Zheng,Yixiao Chen,Weitong Zhang,Souvik Kundu,Yun Li,Zhengzhong Liu,Eric P. Xing,Hongyi Wang,Huaxiu Yao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel CITER (\textbfCollaborative \textbfInference with \textbfToken-l\textbfEvel \textbfRouting) framework that enables efficient collaboration between small and large language models (SLMs LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its decisions. To further accelerate the reward evaluation process, we introduce a shortcut which significantly reduces the costs of the reward estimation and improving the practicality of our approach. Extensive experiments on five benchmark datasets demonstrate that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
zh

[NLP-47] oken Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning

【速读】：该论文旨在解决在有监督微调（Supervised Fine-Tuning, SFT）过程中，即使高质量样本中也存在与任务无关的冗余或无信息模式及短语的问题。论文的关键解决方案在于提出了一种通用的令牌清洗流程，通过评估每个令牌对模型更新的影响来衡量其质量，并应用基于阈值的过滤方法。该流程可以在单次传递中使用固定参考模型进行，也可以通过自演化参考模型迭代实现。论文从理论上分析了两种方法的优劣及其误差上限，实验结果表明该框架在多个下游任务中一致地提升了性能。

链接: https://arxiv.org/abs/2502.01968
作者: Jinlong Pang,Na Di,Zhaowei Zhu,Jiaheng Wei,Hao Cheng,Chen Qian,Yang Liu
机构: University of California, Santa Cruz(加州大学圣克鲁兹分校); Northeastern University(东北大学); Docta.ai; HKUST (Guangzhou)(香港科技大学（广州）); HKBU(香港浸会大学); University of California, Santa Cruz(加州大学圣克鲁兹分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant or uninformative. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance. In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves performance across multiple downstream tasks.
zh

[NLP-48] Boundary-Driven Table-Filling with Cross-Granularity Contrastive Learning for Aspect Sentiment Triplet Extraction ICASSP2025

【速读】：该论文旨在解决Aspect Sentiment Triplet Extraction (ASTE)任务中，现有方法主要关注词级交互而忽视句子级表示的问题。这限制了模型捕捉全局上下文信息的能力，尤其是在处理复杂句子中的多词方面和意见项时。为了解决这些问题，论文提出了一种边界驱动的表填充方法结合跨粒度对比学习（boundary-driven table-filling with cross-granularity contrastive learning, BTF-CCL），以增强句子级表示与词级表示之间的语义一致性。此外，论文还提出了一种多尺度多粒度卷积方法，以更好地捕获丰富的语义信息。这种方法能够更有效地捕捉句子级上下文信息，同时保持对局部细节的敏感性。

链接: https://arxiv.org/abs/2502.01942
作者: Qingling Li,Wushao Wen,Jinghui Qin
机构: Sun Yat-Sen University (中山大学); Guangdong University of Technology (广东工业大学); Guangdong Shuye Intelligent Technology Co., Ltd. (广东舒叶智能科技有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:The Aspect Sentiment Triplet Extraction (ASTE) task aims to extract aspect terms, opinion terms, and their corresponding sentiment polarity from a given sentence. It remains one of the most prominent subtasks in fine-grained sentiment analysis. Most existing approaches frame triplet extraction as a 2D table-filling process in an end-to-end manner, focusing primarily on word-level interactions while often overlooking sentence-level representations. This limitation hampers the model’s ability to capture global contextual information, particularly when dealing with multi-word aspect and opinion terms in complex sentences. To address these issues, we propose boundary-driven table-filling with cross-granularity contrastive learning (BTF-CCL) to enhance the semantic consistency between sentence-level representations and word-level representations. By constructing positive and negative sample pairs, the model is forced to learn the associations at both the sentence level and the word level. Additionally, a multi-scale, multi-granularity convolutional method is proposed to capture rich semantic information better. Our approach can capture sentence-level contextual information more effectively while maintaining sensitivity to local details. Experimental results show that the proposed method achieves state-of-the-art performance on public benchmarks according to the F1 score.
zh

[NLP-49] Can LLM s Maintain Fundamental Abilities under KV Cache Compression?

【速读】：本文旨在探究大型语言模型（Large Language Models, LLMs）中的一个被忽视的问题：键值（KV）缓存压缩方法对其基本能力的影响。尽管现有的方法在长上下文基准测试中实现了显著的压缩比，但它们对核心模型能力的影响尚未得到充分研究。为此，本文进行了全面的经验性研究，评估了多种任务中著名的KV缓存压缩方法，包括世界知识、常识推理、算术推理、代码生成、安全性以及长上下文理解等。研究分析表明，KV缓存压缩方法表现出特定任务上的性能下降，特别是在算术推理任务中，性能下降范围为17.4%至43.3%。值得注意的是，DeepSeek R1 Distill模型相较于指令调优模型展现出更强的压缩耐受性，其性能下降仅为9.67%至25.53%。基于对注意力模式和跨任务压缩性能的分析，本文提出了一种名为ShotKV的新压缩方法，该方法在保持shot级别语义连贯性的前提下，区别处理预填充和解码阶段。实证结果显示，在高压缩比条件下，ShotKV在长上下文生成任务中实现了9%至18%的性能提升。因此，本文的关键在于提出了ShotKV这一新型压缩方法以优化KV缓存压缩对LLMs基本能力的影响。

链接: https://arxiv.org/abs/2502.01941
作者: Xiang Liu,Zhenheng Tang,Hong Chen,Peijie Dong,Zeyu Li,Xiuze Zhou,Bo Li,Xuming Hu,Xiaowen Chu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:This paper investigates an under-explored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs’ fundamental capabilities. While existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive empirical study evaluating prominent KV cache compression methods across diverse tasks, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and this http URL analysis reveals that KV cache compression methods exhibit task-specific performance degradation. Arithmetic reasoning tasks prove particularly sensitive to aggressive compression, with different methods showing performance drops of 17.4% - 43.3% . Notably, the DeepSeek R1 Distill model exhibits more robust compression tolerance compared to instruction-tuned models, showing only 9.67% - 25.53% performance degradation. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves 9% - 18% performance improvements on long-context generation tasks under aggressive compression ratios.
zh

[NLP-50] Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLM s

【速读】：该论文旨在解决算法公平性领域中忽视群体差异的问题。论文的关键在于引入描述性（基于事实）、规范性（基于价值）和相关性（基于关联）基准之间的区分，并通过包含八个不同场景的基准测试套件来评估这种差异意识，该套件涵盖了总计16k个问题。研究结果表明，现有的偏差缓解策略可能在这一新的公平性维度上适得其反，强调了差异意识作为算法公平性的一个独立维度的重要性。

链接: https://arxiv.org/abs/2502.01926
作者: Angelina Wang,Michelle Phan,Daniel E. Ho,Sanmi Koyejo
机构: Stanford University (斯坦福大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Algorithmic fairness has conventionally adopted a perspective of racial color-blindness (i.e., difference unaware treatment). We contend that in a range of important settings, group difference awareness matters. For example, differentiating between groups may be necessary in legal contexts (e.g., the U.S. compulsory draft applies to men but not women) and harm assessments (e.g., calling a girl a terrorist may be less harmful than calling a Muslim person one). In our work we first introduce an important distinction between descriptive (fact-based), normative (value-based), and correlation (association-based) benchmarks. This distinction is significant because each category requires distinct interpretation and mitigation tailored to its specific characteristics. Then, we present a benchmark suite composed of eight different scenarios for a total of 16k questions that enables us to assess difference awareness. Finally, we show results across ten models that demonstrate difference awareness is a distinct dimension of fairness where existing bias mitigation strategies may backfire.
zh

[NLP-51] PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation Negative Demonstration and Adaptive Sampling

【速读】：该论文旨在解决通过长输入序列利用大型语言模型（Large Language Models, LLMs）处理能力来规避其安全对齐的问题，即所谓的多轮脱狱攻击（many-shot jailbreaking）。论文的关键解决方案是提出PANDAS技术，该技术通过对伪造对话进行积极肯定、消极演示以及针对目标提示主题优化自适应采样方法的修改，显著提升了多轮脱狱的效果。实验结果表明，PANDAS在长上下文场景中明显优于基线方法。

链接: https://arxiv.org/abs/2502.01925
作者: Avery Ma,Yangchen Pan,Amir-massoud Farahmand
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many-shot jailbreaking circumvents the safety alignment of large language models by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational turns between the user and the model. These fabricated exchanges are randomly sampled from a pool of malicious questions and responses, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with positive affirmations, negative demonstrations, and an optimized adaptive sampling method tailored to the target prompt’s topic. Extensive experiments on AdvBench and HarmBench, using state-of-the-art LLMs, demonstrate that PANDAS significantly outperforms baseline methods in long-context scenarios. Through an attention analysis, we provide insights on how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
zh

[NLP-52] Conceptual Metaphor Theory as a Prompting Paradigm for Large Language Models

【速读】：该论文旨在通过引入概念隐喻理论（Conceptual Metaphor Theory, CMT）框架来提升大型语言模型（LLMs）在复杂推理任务中的表现。论文的关键解决方案在于利用隐喻映射来结构化抽象推理，从而增强模型处理和解释复杂概念的能力。通过嵌入基于CMT的提示，引导LLMs形成更为结构化且类人化的推理模式。实验结果表明，CMT提示显著提高了推理准确性、清晰度以及隐喻连贯性，在所有评估的任务中均优于基线模型。

链接: https://arxiv.org/abs/2502.01901
作者: Oliver Kramer
机构: Universität Oldenburg(奥尔登堡大学), Germany; Department of Computer Science; Computational Intelligence Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Conceptual Metaphor Theory (CMT) as a framework for enhancing large language models (LLMs) through cognitive prompting in complex reasoning tasks. CMT leverages metaphorical mappings to structure abstract reasoning, improving models’ ability to process and explain intricate concepts. By incorporating CMT-based prompts, we guide LLMs toward more structured and human-like reasoning patterns. To evaluate this approach, we compare four native models (Llama3.2, Phi3, Gemma2, and Mistral) against their CMT-augmented counterparts on benchmark tasks spanning domain-specific reasoning, creative insight, and metaphor interpretation. Responses were automatically evaluated using the Llama3.3 70B model. Experimental results indicate that CMT prompting significantly enhances reasoning accuracy, clarity, and metaphorical coherence, outperforming baseline models across all evaluated tasks.
zh

[NLP-53] raining and Evaluating with Human Label Variation: An Empirical Study

【速读】：该论文旨在解决人类标签变异（Human Label Variation, HLV）带来的挑战，即承认标注过程中自然存在的多样性而非仅仅依赖单一的“标准答案”。论文的关键在于提出新的评估指标与训练方法，并通过实证研究来系统性地评估这些评估指标。研究发现，使用离散化注解或软标签进行训练通常表现最佳，并且所提出的软指标与人类偏好具有最好的相关性。

链接: https://arxiv.org/abs/2502.01891
作者: Kemal Kurniawan,Meladel Mistica,Timothy Baldwin,Jey Han Lau
机构: The University of Melbourne (墨尔本大学); MBZUAI (穆罕默德 bin Zayed University of Artificial Intelligence)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human label variation (HLV) challenges the standard assumption that an example has a single ground truth, instead embracing the natural variation in human labelling to train and evaluate models. While various training methods and metrics for HLV have been proposed, there has been no systematic meta-evaluation of HLV evaluation metrics, contributing to the lack of clarity in the best HLV training method. We propose new evaluation metrics and training methods and empirically meta-evaluate HLV evaluation metrics. We find that training on either disaggregated annotations or soft labels often performs best across metrics, and that our proposed soft metric correlates best with human preference.
zh

[NLP-54] Latent Lexical Projection in Large Language Models : A Novel Approach to Implicit Representation Refinement

【速读】：该论文旨在解决生成语义连贯文本时传统嵌入技术难以充分捕捉语言结构内部表示的问题。关键解决方案在于引入了一种名为潜在词典投影（Latent Lexical Projection, LLP）的新方法，通过将词典表示转换到潜在空间进行结构化变换，从而增强输入嵌入与上下文意义之间的对齐。LLP 方法在现有语言模型架构中集成了优化的投影机制，实现了更准确的词元选择，同时保持了句法完整性。

链接: https://arxiv.org/abs/2502.01882
作者: Ziad Shaker,Brendan Ashdown,Hugo Fitzalan,Alistair Heathcote,Jocasta Huntington
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating semantically coherent text requires a robust internal representation of linguistic structures, which traditional embedding techniques often fail to capture adequately. A novel approach, Latent Lexical Projection (LLP), is introduced to refine lexical representations through a structured transformation into a latent space, thereby enhancing the alignment between input embeddings and their contextual meanings. The method integrates an optimized projection mechanism within an existing language model architecture, enabling more accurate token selection while maintaining syntactic integrity. Evaluations across multiple benchmarks indicate a reduction in perplexity and an increase in BLEU scores, suggesting improvements in predictive accuracy and fluency. The analysis of lexical diversity reveals a more varied vocabulary in generated text, addressing common issues of redundancy and repetitive phrase structures. Further assessments of entropy distributions demonstrate a decline in uncertainty during decoding, reflecting enhanced confidence in word selection. Additionally, long-range dependency retention exhibits measurable gains, with increased classification accuracy at extended token distances. Computational efficiency remains within manageable constraints, despite the added projection mechanism, highlighting the practicality of LLP for integration into existing architectures.
zh

[NLP-55] SelfCheckAgent : Zero-Resource Hallucination Detection in Generative Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中的幻觉检测难题。为了解决这一挑战，论文提出了一种名为SelfCheckAgent的新框架，该框架集成了三种不同类型的代理：符号代理（Symbolic Agent）、专门检测代理（Specialized Detection Agent）以及上下文一致性代理（Contextual Consistency Agent）。这些代理提供了多维度的幻觉检测方法。关键解决方案在于利用Llama 3.1与思维链（Chain-of-Thought, CoT）技术增强上下文一致性代理，从而在WikiBio数据集上实现了非事实性幻觉检测93.64%、事实性幻觉检测70.26%及排序检测78.48%的出色性能，并通过引入三角测量策略增强了SelfCheckAgent的整体效能。

链接: https://arxiv.org/abs/2502.01812
作者: Diyana Muhammed,Gollam Rabby,Sören Auer
机构: TIB—Leibniz Information Centre for Science and Technology (TIB—莱布尼茨科学与技术信息中心), Hannover, Germany; L3S Research Center (L3S 研究中心), Leibniz University Hannover (莱布尼茨大学汉诺威分校), Hanover, Germany
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting hallucinations in Large Language Models (LLMs) remains a critical challenge for their reliable deployment in real-world applications. To address this, we introduce SelfCheckAgent, a novel framework integrating three different agents: the Symbolic Agent, the Specialized Detection Agent, and the Contextual Consistency Agent. These agents provide a robust multi-dimensional approach to hallucination detection. Notable results include the Contextual Consistency Agent leveraging Llama 3.1 with Chain-of-Thought (CoT) to achieve outstanding performance on the WikiBio dataset, with NonFactual hallucination detection scoring 93.64%, Factual 70.26%, and Ranking 78.48% respectively. On the AIME dataset, GPT-4o with CoT excels in NonFactual detection with 94.89% but reveals trade-offs in Factual with 30.58% and Ranking with 30.68%, underscoring the complexity of hallucination detection in the complex mathematical domains. The framework also incorporates a triangulation strategy, which increases the strengths of the SelfCheckAgent, yielding significant improvements in real-world hallucination identification. The comparative analysis demonstrates SelfCheckAgent’s applicability across diverse domains, positioning it as a crucial advancement for trustworthy LLMs. These findings highlight the potentiality of consistency-driven methodologies in detecting hallucinations in LLMs.
zh

[NLP-56] Soup-of-Experts: Pretraining Specialist Models via Parameters Averag ing

【速读】：该论文旨在解决在混合数据域上训练的机器学习模型因不同领域权重导致下游表现差异显著的问题。解决方案的关键在于提出了一种名为Soup-of-Experts的新架构，该架构能够在测试时以最小的计算成本和无需重新训练的情况下，针对任意领域权重实例化一个模型。这一架构由一组专家参数组成，并通过线性组合来实例化模型。关键还在于学习这些线性组合系数作为输入领域权重的函数，并通过采样随机领域权重、实例化相应模型以及反向传播通过一批具有这些领域权重的数据来进行训练。

链接: https://arxiv.org/abs/2502.01804
作者: Pierre Ablin,Angelos Katharopoulos,Skyler Seto,David Grangier
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a model size constraint.
zh

[NLP-57] CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

【速读】：该论文旨在解决现代深度学习模型在整体性能较高时，在特定子群体上却持续表现不佳的问题。论文的关键解决方案是提出了一种改进的优化方法——CTC-DRO（Connectionist Temporal Classification Distributionally Robust Optimization）。这种方法通过平滑组权重更新以避免过度关注高损失组，并采用与输入长度匹配的批处理来缓解CTC损失尺度化问题，从而有效改善了多语言自动语音识别（ASR）任务中的最差语言错误率和平均错误率。

链接: https://arxiv.org/abs/2502.01777
作者: Martijn Bartelds,Ananjan Nandi,Moussa Koulako Bala Doumbouya,Dan Jurafsky,Tatsunori Hashimoto,Karen Livescu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss scales with input length and varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC’s scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 65.9% and the average error by up to 47.7%. CTC-DRO can be applied to ASR with minimal computational costs, and offers the potential for reducing group disparities in other domains with similar challenges.
zh

[NLP-58] On Bob Dylan: A Computational Perspective

【速读】：该论文旨在通过大规模计算分析，探讨Bob Dylan歌词从1962年至2012年的演变特征，以验证Cass Sunstein关于Dylan“去习惯化”风格的观察。论文的关键解决方案在于使用大型语言模型o3-mini-high提取概念间关系，并构建有向知识图谱来捕捉Dylan的歌词主题结构。通过这种方法，作者量化了随时间变化的情感倾向、隐喻表达、主题多样性以及网络复杂性，揭示了Dylan歌词中隐喻使用增加、情感轮廓演变及“去习惯化”程度加深的现象，从而提供了分析艺术家演变的新颖计算方法。

链接: https://arxiv.org/abs/2502.01772
作者: Prashant Garg
机构: Imperial College London(帝国理工学院); Department of Economics and Public Policy(经济与公共政策系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Cass Sunstein’s essay ‘On Bob Dylan’ describes Dylan’s ‘dishabituating’ style – a constant refusal to conform to expectation and a penchant for reinventing his musical and lyrical identity. In this paper, I extend Sunstein’s observations through a large-scale computational analysis of Dylan’s lyrics from 1962 to 2012. Using o3-mini-high (a large language model), I extract concept-to-concept relationships from the lyrics and construct directed knowledge graphs that capture Dylan’s thematic structure. I then quantify shifts in sentiment, metaphorical expression, thematic diversity, and network complexity over time. The results indicate that Dylan’s lyrics increasingly rely on metaphor, display an evolving sentiment profile, and exhibit heightened dishabituation – measured here as a growing variance in the network centrality of key concepts. I also find that references to movement, protest, and mythic imagery fluctuate in ways that align with well-known phases of Dylan’s career, reflecting the dynamic and unpredictable quality of his art. These findings not only deepen our empirical understanding of Sunstein’s thesis but also introduce a novel computational method for analyzing an artist’s evolution-offering broader applicability to the study of cultural and creative change.
zh

[NLP-59] Evaluation of Large Language Models via Coupled Token Generation

【速读】：该论文旨在解决大型语言模型在回应相同提示时因随机化导致的响应不一致问题。论文的关键在于开发了一种因果模型，用于耦合自回归生成（Coupled Autoregressive Generation），使得不同的大型语言模型能够使用相同的随机源来采样响应。通过这一方法，论文展示了耦合自回归生成不仅在基准数据集评估中得出与传统自回归生成（Vanilla Autoregressive Generation）相同的结论，而且所需的样本数量显著减少。然而，对于基于人类对比的评估，耦合自回归生成与传统自回归生成可能导致不同模型排名的变化，即使在无限样本的情况下也是如此。这表明现有评估协议中模型的相对优势可能并非真实，而是受到生成过程固有随机性的混淆。实验结果进一步验证了这些理论发现，表明在多个知识领域内，耦合自回归生成所需样本数量可减少高达40%，同时在实际应用中，强大型语言模型的胜率也会随生成方式的不同而变化。

链接: https://arxiv.org/abs/2502.01754
作者: Nina Corvelo Benz,Stratis Tsirtsis,Eleni Straitouri,Ivi Chatzi,Ander Artola Velasco,Suhas Thejaswi,Manuel Gomez-Rodriguez
机构: Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. To illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama family. We find that, across multiple knowledge areas from the popular MMLU benchmark dataset, coupled autoregressive generation requires up to 40% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, using data from the LMSYS Chatbot Arena platform, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts differ under coupled and vanilla autoregressive generation.
zh

[NLP-60] ACECODER: Acing Coder RL via Automated Test-Case Synthesis

【速读】：该论文旨在解决编码模型训练中强化学习（Reinforcement Learning, RL）应用受限的问题，主要由于代码领域缺乏可靠的奖励数据或模型。为应对这一挑战，论文提出了一种通过自动化大规模测试用例合成来增强代码模型训练的方法。关键解决方案在于设计了一个管道，能够从现有代码数据中生成大量的（问题，测试用例）对。利用这些测试用例，基于采样程序的通过率构建偏好对，以训练奖励模型（reward model）使用Bradley-Terry损失函数。这种方法显著提升了Llama-3.1-8B-Ins和Qwen2.5-Coder-7B-Ins模型的性能，并通过强化学习进一步改进了多个基准测试（HumanEval、MBPP、BigCodeBench和LiveCodeBench V4）上的表现。

链接: https://arxiv.org/abs/2502.01718
作者: Huaye Zeng,Dongfu Jiang,Haozhe Wang,Ping Nie,Xiaotong Chen,Wenhu Chen
机构: University of Waterloo(滑铁卢大学); HKUST(香港科技大学); Independent Researcher(独立研究员); Netmind.AI(Netmind.AI)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 1 figure, 7 tables

点击查看摘要

Abstract:Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25% and MBPP-plus by 6% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.
zh

[NLP-61] Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction

【速读】：该论文旨在探究是否可以通过改进生物启发神经网络来进一步提升其性能。解决方案的关键在于Comply模型的引入，该模型通过复杂权重融入位置信息，使得单层神经网络能够学习序列表示，从而在不增加参数的情况下，不仅超越了FlyVec模型，还与更大型的最先进模型表现相当。

链接: https://arxiv.org/abs/2502.01706
作者: Alexei Figueroa,Justus Westerhoff,Atefi Golzar,Dennis Fast,Benjamin Winter,Felix Alexader Gers,Alexander Löser,Wolfang Nejdl
机构: DATEXIS, Berliner Hochschule für Technik (BHT); Medical AI Analytics & Information GmbH; L3S, Leibniz University Hannover
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: Accepted at NICE2025

点击查看摘要

Abstract:Biologically inspired neural networks offer alternative avenues to model data distributions. FlyVec is a recent example that draws inspiration from the fruit fly’s olfactory circuit to tackle the task of learning word embeddings. Surprisingly, this model performs competitively even against deep learning approaches specifically designed to encode text, and it does so with the highest degree of computational efficiency. We pose the question of whether this performance can be improved further. For this, we introduce Comply. By incorporating positional information through complex weights, we enable a single-layer neural network to learn sequence representations. Our experiments show that Comply not only supersedes FlyVec but also performs on par with significantly larger state-of-the-art models. We achieve this without additional parameters. Comply yields sparse contextual representations of sentences that can be interpreted explicitly from the neuron weights.
zh

[NLP-62] QLESS: A Quantized Approach for Data Valuation and Selection in Large Language Model Fine-Tuning

【速读】：该论文旨在解决通过大规模语言模型（Large Language Models, LLMs）微调过程中因处理海量数据集而导致的高计算成本问题。论文提出的关键解决方案是QLESS（量化低秩梯度相似性搜索），它将梯度量化与LESS框架相结合，以实现内存高效的数椐估值和选择。QLESS采用两步压缩过程：首先，通过LoRA基础的随机投影获得低维梯度表示；其次，将这些梯度量化为低位宽表示。实验结果表明，QLESS在多种LLM架构（如LLaMA、Mistral、Qwen）和基准测试（MMLU、BBH、TyDiQA）中实现了与LESS相当的数据选择性能，同时将内存使用减少了高达16倍，甚至1比特的梯度量化也能保持数据估值质量。这些发现强调了QLESS作为一种实用且可扩展的方法，能够在严格的内存限制下识别出信息量大的样本。

链接: https://arxiv.org/abs/2502.01703
作者: Moses Ananta,Muhammad Farid Adilazuarda,Zayd Muhammad Kawakibi Zuhri,Ayu Purwarianti,Alham Fikri Aji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) is often constrained by the computational costs of processing massive datasets. We propose \textbfQLESS (Quantized Low-rank Gradient Similarity Search), which integrates gradient quantization with the LESS framework to enable memory-efficient data valuation and selection. QLESS employs a two-step compression process: first, it obtains low-dimensional gradient representations through LoRA-based random projection; then, it quantizes these gradients to low-bitwidth representations. Experiments on multiple LLM architectures (LLaMA, Mistral, Qwen) and benchmarks (MMLU, BBH, TyDiQA) show that QLESS achieves comparable data selection performance to LESS while reducing memory usage by up to 16x. Even 1-bit gradient quantization preserves data valuation quality. These findings underscore QLESS as a practical, scalable approach to identifying informative examples within strict memory constraints.
zh

[NLP-63] Multimodal Inverse Attention Network with Intrinsic Discriminant Feature Exploitation for Fake News Detection

【速读】：该论文旨在解决多模态虚假新闻检测中未能充分利用模态特定表示和显式差异特征的问题。论文的关键在于提出了一种名为多模态逆向注意力网络（Multimodal Inverse Attention Network, MIAN）的新框架，通过引入层次学习模块和跨模态交互模块来捕捉模态内部的多样化关系，并利用逆向注意机制显式提取不一致性特征，从而显著提升了虚假新闻检测的性能。

链接: https://arxiv.org/abs/2502.01699
作者: Tianlin Zhang,En Yu,Yi Shao,Shuai Li,Sujuan Hou,Jiande Sun
机构: School of Information Science and Engineering, Shandong Normal University(信息科学与工程学院，山东师范大学); School of Control Science and Engineering, Shandong University(控制科学与工程学院，山东大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal fake news detection has garnered significant attention due to its profound implications for social security. While existing approaches have contributed to understanding cross-modal consistency, they often fail to leverage modal-specific representations and explicit discrepant features. To address these limitations, we propose a Multimodal Inverse Attention Network (MIAN), a novel framework that explores intrinsic discriminative features based on news content to advance fake news detection. Specifically, MIAN introduces a hierarchical learning module that captures diverse intra-modal relationships through local-to-global and local-to-local interactions, thereby generating enhanced unimodal representations to improve the identification of fake news at the intra-modal level. Additionally, a cross-modal interaction module employs a co-attention mechanism to establish and model dependencies between the refined unimodal representations, facilitating seamless semantic integration across modalities. To explicitly extract inconsistency features, we propose an inverse attention mechanism that effectively highlights the conflicting patterns and semantic deviations introduced by fake news in both intra- and inter-modality. Extensive experiments on benchmark datasets demonstrate that MIAN significantly outperforms state-of-the-art methods, underscoring its pivotal contribution to advancing social security through enhanced multimodal fake news detection.
zh

[NLP-64] BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation

【速读】：该论文旨在解决在训练大型语言模型（LLMs）时，如何生成高质量且多样化的合成数据。论文的关键在于提出了一种名为Base-Refine (BARE)的方法，该方法结合了基础模型的多样性与指令调优模型的质量，通过两阶段过程生成高质量且多样的数据集。这种方法仅需少量的Few-Shot示例和人工筛选，即可显著提升下游任务的表现，在GSM8K任务上比仅使用指令调优的数据提升了101%，在RAFT任务上比现有最优方法提升了18.4%。

链接: https://arxiv.org/abs/2502.01697
作者: Alan Zhu,Parth Asawa,Jared Quincy Davis,Lingjiao Chen,Ion Stoica,Joseph E. Gonzalez,Matei Zaharia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As the demand for high-quality data in model training grows, researchers and developers are increasingly generating synthetic data to tune and train LLMs. A common assumption about synthetic data is that sampling from instruct-tuned models is sufficient; however, these models struggle to produce diverse outputs-a key requirement for generalization. Despite various prompting methods, in this work we show that achieving meaningful diversity from instruct-tuned models remains challenging. In contrast, we find base models without post-training exhibit greater diversity, but are less capable at instruction following and hence of lower quality. Leveraging this insight, we propose Base-Refine (BARE), a synthetic data generation method that combines the diversity of base models with the quality of instruct-tuned models through a two-stage process. With minimal few-shot examples and curation, BARE generates diverse and high-quality datasets, improving downstream task performance. We show that fine-tuning with as few as 1,000 BARE-generated samples can reach performance comparable to the best similarly sized models on LiveCodeBench tasks. Furthermore, fine-tuning with BARE-generated data achieves a 101% improvement over instruct-only data on GSM8K and a 18.4% improvement over SOTA methods on RAFT.
zh

[NLP-65] Agent -Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model

【速读】：该论文旨在解决从放射学报告中可靠提取结构化数据的挑战，特别是在处理复杂的非英语文本（如希伯来文）时。解决方案的关键在于引入了一种基于代理的不确定性感知方法，通过使用带有贝叶斯提示集成（Bayesian Prompt Ensembles, BayesPE）的大型语言模型（LLMs），结合六种语义等效的提示来估计不确定性，并采用基于代理的决策模型将多个提示输出整合成五个置信水平，从而提高LLMs在医学应用中的预测可信度。

链接: https://arxiv.org/abs/2502.01691
作者: Hadas Ben-Atya,Naama Gavrielov,Zvi Badash,Gili Focht,Ruth Cytter-Kuint,Talar Hagopian,Dan Turner,Moti Freiman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable extraction of structured data from radiology reports using Large Language Models (LLMs) remains challenging, especially for complex, non-English texts like Hebrew. This study introduces an agent-based uncertainty-aware approach to improve the trustworthiness of LLM predictions in medical applications. We analyzed 9,683 Hebrew radiology reports from Crohn’s disease patients (from 2010 to 2023) across three medical centers. A subset of 512 reports was manually annotated for six gastrointestinal organs and 15 pathological findings, while the remaining reports were automatically annotated using HSMP-BERT. Structured data extraction was performed using Llama 3.1 (Llama 3-8b-instruct) with Bayesian Prompt Ensembles (BayesPE), which employed six semantically equivalent prompts to estimate uncertainty. An Agent-Based Decision Model integrated multiple prompt outputs into five confidence levels for calibrated uncertainty and was compared against three entropy-based models. Performance was evaluated using accuracy, F1 score, precision, recall, and Cohen’s Kappa before and after filtering high-uncertainty cases. The agent-based model outperformed the baseline across all metrics, achieving an F1 score of 0.3967, recall of 0.6437, and Cohen’s Kappa of 0.3006. After filtering high-uncertainty cases (greater than or equal to 0.5), the F1 score improved to 0.4787, and Kappa increased to 0.4258. Uncertainty histograms demonstrated clear separation between correct and incorrect predictions, with the agent-based model providing the most well-calibrated uncertainty estimates. By incorporating uncertainty-aware prompt ensembles and an agent-based decision model, this approach enhances the performance and reliability of LLMs in structured data extraction from radiology reports, offering a more interpretable and trustworthy solution for high-stakes medical applications.
zh

[NLP-66] Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment ICASSP2025

【速读】：该论文旨在解决通过分析图画描述中的语言内容评估认知-语言障碍时，通常忽视参与者视觉叙事路径的问题。解决方案的关键在于提出了一种自动方法，通过自动化提取内容信息单元（Content Information Units, CIUs）来估计空间语义图（spatio-semantic graphs），从而实现对图画描述过程中视觉语义路径的自动表征。实验结果表明，自动构建的空间语义图能够有效区分认知受损与未受损的说话者，并且统计分析显示，自动方法提取的特征与手动方法相比具有相当的结果，甚至在感兴趣的临床群体间表现出更大的差异。这些结果突显了该自动化方法在开发用于认知障碍评估的临床言语模型方面的潜力。

链接: https://arxiv.org/abs/2502.01685
作者: Si-Ioi Ng,Pranav S. Ambadi,Kimberly D. Mueller,Julie Liss,Visar Berisha
机构: Arizona State University (亚利桑那州立大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: To appear in ICASSP 2025

点击查看摘要

Abstract:Existing methods for analyzing linguistic content from picture descriptions for assessment of cognitive-linguistic impairment often overlook the participant’s visual narrative path, which typically requires eye tracking to assess. Spatio-semantic graphs are a useful tool for analyzing this narrative path from transcripts alone, however they are limited by the need for manual tagging of content information units (CIUs). In this paper, we propose an automated approach for estimation of spatio-semantic graphs (via automated extraction of CIUs) from the Cookie Theft picture commonly used in cognitive-linguistic analyses. The method enables the automatic characterization of the visual semantic path during picture description. Experiments demonstrate that the automatic spatio-semantic graphs effectively differentiate between cognitively impaired and unimpaired speakers. Statistical analyses reveal that the features derived by the automated method produce comparable results to the manual method, with even greater group differences between clinical groups of interest. These results highlight the potential of the automated approach for extracting spatio-semantic features in developing clinical speech models for cognitive impairment assessment.
zh

[NLP-67] LLM -Powered Benchmark Factory: Reliable Generic and Efficient

【速读】：该论文旨在解决大型语言模型（LLMs）基准测试生成器缺乏通用性和可靠性的问题。论文的关键解决方案是提出了一套自动且无偏的评估框架，并在此框架下分析直接提示LLMs作为通用基准生成器的优势与不足。为了增强可靠性，引入了一系列方法来解决这些不足，并将它们整合到BenchMaker系统中。实验结果表明，BenchMaker在多个LLMs和任务上的表现优于或至少可比于人工标注的基准，在不同LLMs之间提供了高度一致的评估结果，从而验证了其通用性和可靠性。

链接: https://arxiv.org/abs/2502.01683
作者: Peiwen Yuan,Shaoxiong Feng,Yiwei Li,Xinglin Wang,Yueqi Zhang,Jiayi Shi,Chuyi Tan,Boyuan Pan,Yao Hu,Kan Li
机构: School of Computer Science and Technology, Beijing Institute of Technology (北京理工大学计算机科学与技术学院); Xiaohongshu Inc (小红书股份有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has led to a surge in both model supply and application demands. To facilitate effective matching between them, reliable, generic and efficient benchmark generators are widely needed. However, human annotators are constrained by inefficiency, and current LLM benchmark generators not only lack generalizability but also struggle with limited reliability, as they lack a comprehensive evaluation framework for validation and optimization. To fill this gap, we first propose an automated and unbiased evaluation framework, structured around four dimensions and ten criteria. Under this framework, we carefully analyze the advantages and weaknesses of directly prompting LLMs as generic benchmark generators. To enhance the reliability, we introduce a series of methods to address the identified weaknesses and integrate them as BenchMaker. Experiments across multiple LLMs and tasks confirm that BenchMaker achieves superior or comparable performance to human-annotated benchmarks on all metrics, highlighting its generalizability and reliability. More importantly, it delivers highly consistent evaluation results across 12 LLMs (0.967 Pearson correlation against MMLU-Pro), while taking only 0.005 and 0.38 minutes per sample.
zh

[NLP-68] he exception of humour: Iconicity Phonemic Surprisal Memory Recall and Emotional Associations

【速读】：该论文旨在探究幽默、语音音节二元组意外性、情感倾向与记忆回忆之间的关系。研究的关键在于揭示负向情感词汇因更高的语音音节意外性而更易被记住，并指出尽管幽默通常与正向情感相关联，但幽默词汇同样表现出较高的意外性和增强的记忆效果。

链接: https://arxiv.org/abs/2502.01682
作者: Alexander Kilpatrick,Maria Flaksman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 6 tables

点击查看摘要

Abstract:This meta-study explores the relationships between humor, phonemic bigram surprisal, emotional valence, and memory recall. Prior research indicates that words with higher phonemic surprisal are more readily remembered, suggesting that unpredictable phoneme sequences promote long-term memory recall. Emotional valence is another well-documented factor influencing memory, with negative experiences and stimuli typically being remembered more easily than positive ones. Building on existing findings, this study highlights that words with negative associations often exhibit greater surprisal and are easier to recall. Humor, however, presents an exception: while associated with positive emotions, humorous words also display heightened surprisal and enhanced memorability.
zh

[NLP-69] LIBRA: Measuring Bias of Large Language Model from a Local Context ECIR2025

【速读】：该论文旨在解决大型语言模型（LLMs）在不同文化背景下存在的固有偏见问题，并提出了一种新的评估方法来应对模型在超出其知识边界时产生的不相关结果。论文的关键解决方案是引入了Local Integrated Bias Recognition and Assessment Framework (LIBRA)，并通过此框架构建了一个包含超过360,000个测试案例的新西兰语境数据集。此外，提出了Enhanced Idealized CAT Score (EiCAT)，该指标结合了iCAT分数、超越知识边界的评分(bbs)以及基于分布差异的偏差测量方法，以更全面地评估LLMs的偏差情况。这些措施共同克服了现有研究仅关注美国文化背景以及假设模型熟悉目标社会群体的局限性。

链接: https://arxiv.org/abs/2502.01679
作者: Bo Pang,Tingrui Qiao,Caroline Walker,Chris Cunningham,Yun Sing Koh
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper accepted by ECIR 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced natural language processing applications, yet their widespread use raises concerns regarding inherent biases that may reduce utility or harm for particular social groups. Despite the advancement in addressing LLM bias, existing research has two major limitations. First, existing LLM bias evaluation focuses on the U.S. cultural context, making it challenging to reveal stereotypical biases of LLMs toward other cultures, leading to unfair development and use of LLMs. Second, current bias evaluation often assumes models are familiar with the target social groups. When LLMs encounter words beyond their knowledge boundaries that are unfamiliar in their training data, they produce irrelevant results in the local context due to hallucinations and overconfidence, which are not necessarily indicative of inherent bias. This research addresses these limitations with a Local Integrated Bias Recognition and Assessment Framework (LIBRA) for measuring bias using datasets sourced from local corpora without crowdsourcing. Implementing this framework, we develop a dataset comprising over 360,000 test cases in the New Zealand context. Furthermore, we propose the Enhanced Idealized CAT Score (EiCAT), integrating the iCAT score with a beyond knowledge boundary score (bbs) and a distribution divergence-based bias measurement to tackle the challenge of LLMs encountering words beyond knowledge boundaries. Our results show that the BERT family, GPT-2, and Llama-3 models seldom understand local words in different contexts. While Llama-3 exhibits larger bias, it responds better to different cultural contexts. The code and dataset are available at: this https URL.
zh

[NLP-70] Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset NEURIPS2024

【速读】：该论文旨在解决在同行评审过程中检测有毒反馈的问题。解决方案的关键在于利用标注数据集来评估不同模型的性能，并通过调整提示的细粒度（从粗略到细致）来优化模型表现。实验结果显示，先进的大型语言模型（如GPT-4）在简单提示下与人类判断的对齐程度较低，但在详细指令下表现出更好的对齐效果。此外，模型的信心评分能够很好地指示其与人类判断的对齐情况。

链接: https://arxiv.org/abs/2502.01676
作者: Man Luo,Bradley Peterson,Rafael Gan,Hari Ramalingame,Navya Gangrade,Ariadne Dimarogona,Imon Banerjee,Phillip Howard
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to WiML workshop @Neurips 2024

点击查看摘要

Abstract:Peer review is crucial for advancing and improving science through constructive criticism. However, toxic feedback can discourage authors and hinder scientific progress. This work explores an important but underexplored area: detecting toxicity in peer reviews. We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform, annotated by human experts according to these definitions. Leveraging this dataset, we benchmark a variety of models, including a dedicated toxicity detection model, a sentiment analysis model, several open-source large language models (LLMs), and two closed-source LLMs. Our experiments explore the impact of different prompt granularities, from coarse to fine-grained instructions, on model performance. Notably, state-of-the-art LLMs like GPT-4 exhibit low alignment with human judgments under simple prompts but achieve improved alignment with detailed instructions. Moreover, the model’s confidence score is a good indicator of better alignment with human judgments. For example, GPT-4 achieves a Cohen’s Kappa score of 0.56 with human judgments, which increases to 0.63 when using only predictions with a confidence score higher than 95%. Overall, our dataset and benchmarks underscore the need for continued research to enhance toxicity detection capabilities of LLMs. By addressing this issue, our work aims to contribute to a healthy and responsible environment for constructive academic discourse and scientific collaboration.
zh

[NLP-71] Multilingual State Space Models for Structured Question Answering in Indic Languages

【速读】：该论文旨在解决印度语言在自然语言处理（NLP）领域，特别是在问答（QA）任务中所面临的多样性和复杂性带来的独特挑战。关键解决方案在于应用状态空间模型（State Space Models, SSMs），这些模型能够有效捕捉序列数据中的长短期依赖关系，从而更好地处理印度语言丰富的形态学、复杂的句法以及上下文细微差别。通过在多种印度语言的数据集上评估不同的SSM架构，并进行性能对比分析，研究结果表明这些模型显著提升了问题理解、上下文对齐及答案生成的效果。

链接: https://arxiv.org/abs/2502.01673
作者: Arpita Vats,Rahul Raja,Mrinal Mathur,Vinija Jain,Aman Chadha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs),to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. We propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.
zh

[NLP-72] Explainable AI for Sentiment Analysis of Human Metapneumovirus (HMPV) Using XLNet

【速读】：该论文旨在通过分析社交媒体数据，利用情感分析方法增强对公众对人类偏肺病毒（Human Metapneumovirus, HMPV）反应的理解。解决方案的关键在于应用Transformer模型，特别是XLNet，实现了93.50%的情感分类准确率，并通过SHAP方法进行可解释的人工智能（XAI）分析以提高模型透明度。

链接: https://arxiv.org/abs/2502.01663
作者: Md. Shahriar Hossain Apu,Md Saiful Islam,Tanjim Taharat Aurpa
机构: Department of IoT and Robotics Engineering, Bangabandhu Sheikh Mujibur Rahman Digital University, Bangladesh(物联网与机器人工程系，孟加拉国班加班杜数码大学); Department of Educational Technology and Engineering, Bangabandhu Sheikh Mujibur Rahman Digital University, Bangladesh(教育技术与工程系，孟加拉国班加班杜数码大学); Department of Data Science and Engineering, Bangabandhu Sheikh Mujibur Rahman Digital University, Bangladesh(数据科学与工程系，孟加拉国班加班杜数码大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In 2024, the outbreak of Human Metapneumovirus (HMPV) in China, which later spread to the UK and other countries, raised significant public concern. While HMPV typically causes mild symptoms, its effects on vulnerable individuals prompted health authorities to emphasize preventive measures. This paper explores how sentiment analysis can enhance our understanding of public reactions to HMPV by analyzing social media data. We apply transformer models, particularly XLNet, achieving 93.50% accuracy in sentiment classification. Additionally, we use explainable AI (XAI) through SHAP to improve model transparency.
zh

[NLP-73] Speculative Ensemble: Fast Large Language Model Ensemble via Speculation

【速读】：该论文旨在解决大型语言模型（LLMs）集成方法因高计算成本而导致的性能瓶颈问题。解决方案的关键在于提出了一种名为Speculative Ensemble的新框架，通过让一个小的提案模型顺序生成令牌，并由较大的目标模型并行验证这些令牌，从而在不牺牲性能的前提下加速LLM集成。该方法基于两个关键见解：(1) 验证分布可以是提案和目标模型的集成分布；(2) 轮流让每个模型作为提案者和验证者可以进一步提高效率。这种方法能够推广到包含n个模型的集成，并且理论上证明了Speculative Ensemble的运行速度永远不会慢于标准集成方法，通常情况下还能实现更快的速度。

链接: https://arxiv.org/abs/2502.01662
作者: Jiale Fu,Yuchu Jiang,Junkai Chen,Jiaming Fan,Xin Geng,Xu Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ensemble methods enhance Large Language Models (LLMs) by combining multiple models but suffer from high computational costs. In this paper, we introduce Speculative Ensemble, a novel framework that accelerates LLM ensembles without sacrificing performance, inspired by Speculative Decoding-where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel. Our approach builds on two key insights: (1) the verification distribution can be the ensemble distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to ensembles with n models and theoretically prove that SE is never slower than a standard ensemble, typically achieving faster speed. Extensive experiments demonstrate speed improvements of 1.11x-2.23x over standard ensemble techniques without compromising generation quality. Our code is available at this https URL
zh

[NLP-74] Large Language Models Accuracy in Emulating Human Experts Evaluation of Public Sentiments about Heated Tobacco Products on Social Media

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在替代烟草产品社交平台上进行情感分析的准确性，特别是针对加热烟草产品（Heated Tobacco Products, HTPs）。研究的关键解决方案在于使用GPT-3.5和GPT-4 Turbo模型对来自Facebook和Twitter平台的共计1000条消息进行分类，并将模型的多数标签结果与人工评估进行对比。结果显示，相较于GPT-3.5，GPT-4 Turbo在情感分类任务中表现更优，尤其是在减少误分类方面，特别是在处理反HTP和亲HTP信息时。因此，GPT-4 Turbo可以作为辅助工具，提高情感分析的效率和准确性，但需要注意其在不同情感类别上的表现差异。

链接: https://arxiv.org/abs/2502.01658
作者: Kwanho Kim,Soojong Kim
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Sentiment analysis of alternative tobacco products on social media is important for tobacco control research. Large Language Models (LLMs) can help streamline the labor-intensive human sentiment analysis process. This study examined the accuracy of LLMs in replicating human sentiment evaluation of social media messages about heated tobacco products (HTPs). The research used GPT-3.5 and GPT-4 Turbo to classify 500 Facebook and 500 Twitter messages, including anti-HTPs, pro-HTPs, and neutral messages. The models evaluated each message up to 20 times, and their majority label was compared to human evaluators. Results showed that GPT-3.5 accurately replicated human sentiment 61.2% of the time for Facebook messages and 57.0% for Twitter messages. GPT-4 Turbo performed better, with 81.7% accuracy for Facebook and 77.0% for Twitter. Using three response instances, GPT-4 Turbo achieved 99% of the accuracy of twenty instances. GPT-4 Turbo also had higher accuracy for anti- and pro-HTPs messages compared to neutral ones. Misclassifications by GPT-3.5 often involved anti- or pro-HTPs messages being labeled as neutral or irrelevant, while GPT-4 Turbo showed improvements across all categories. In conclusion, LLMs can be used for sentiment analysis of HTP-related social media messages, with GPT-4 Turbo reaching around 80% accuracy compared to human experts. However, there’s a risk of misrepresenting overall sentiment due to differences in accuracy across sentiment categories. Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI) Cite as: arXiv:2502.01658 [cs.CL] (or arXiv:2502.01658v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.01658 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Kwanho Kim [view email] [v1] Fri, 31 Jan 2025 20:35:30 UTC (1,267 KB)
zh

[NLP-75] Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

【速读】：该论文旨在填补现有调查研究在理解大型语言模型（Large Language Models, LLMs）与视觉模态整合的训练范式及其参数高效性考虑方面的空白。论文的关键解决方案在于从训练范式视角出发，对34个来自顶级会议、期刊及高引用Arxiv论文中的视觉-大型语言模型（Vision Large Language Models, VLLMs）进行分类和回顾，并特别关注参数效率。通过介绍大型语言模型架构和参数高效学习方法，讨论视觉编码器以及模态整合器的全面分类，论文系统地回顾了三种训练范式及其效率考量，并总结了VLLM领域的基准。此外，通过比较代表性模型的实验结果，特别是复制直接适应（Direct Adaptation）范式的实验，论文深入分析了这些模型在参数效率方面的有效性。

链接: https://arxiv.org/abs/2502.01524
作者: Xiaorui Ma,Haoran Xie,S. Joe Qin
机构: Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, 3 figures

点击查看摘要

Abstract:The integration of vision-language modalities has been a significant focus in multimodal learning, traditionally relying on Vision-Language Pretrained Models. However, with the advent of Large Language Models (LLMs), there has been a notable shift towards incorporating LLMs with vision modalities. Following this, the training paradigms for incorporating vision modalities into LLMs have evolved. Initially, the approach was to integrate the modalities through pretraining the modality integrator, named Single-stage Tuning. It has since branched out into methods focusing on performance enhancement, denoted as Two-stage Tuning, and those prioritizing parameter efficiency, referred to as Direct Adaptation. However, existing surveys primarily address the latest Vision Large Language Models (VLLMs) with Two-stage Tuning, leaving a gap in understanding the evolution of training paradigms and their unique parameter-efficient considerations. This paper categorizes and reviews 34 VLLMs from top conferences, journals, and highly cited Arxiv papers, focusing on parameter efficiency during adaptation from the training paradigm perspective. We first introduce the architecture of LLMs and parameter-efficient learning methods, followed by a discussion on vision encoders and a comprehensive taxonomy of modality integrators. We then review three training paradigms and their efficiency considerations, summarizing benchmarks in the VLLM field. To gain deeper insights into their effectiveness in parameter efficiency, we compare and discuss the experimental results of representative models, among which the experiment of the Direct Adaptation paradigm is replicated. Providing insights into recent developments and practical uses, this survey is a vital guide for researchers and practitioners navigating the efficient integration of vision modalities into LLMs.
zh

计算机视觉

[CV-0] Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling

【速读】：该论文旨在解决3D有articulated（关节）物体建模的难题，特别是针对现有方法依赖于有限的手工标注关节类别（如橱柜和抽屉）而导致模型能力受限的问题。论文的关键解决方案是提出Articulate Anymesh框架，该框架能够以开放词汇的方式将任何刚体3D网格转换为其关节式对应物。通过利用先进的视觉-语言模型和视觉提示技术，Articulate Anymesh能够提取语义信息，实现物体部分的分割和功能关节的构建，从而显著扩展现有的3D关节物体数据集的覆盖范围，并促进模拟中新型关节物体操作技能的学习及其向真实机器人系统的迁移。

链接: https://arxiv.org/abs/2502.02590
作者: Xiaowen Qiu,Jincheng Yang,Yian Wang,Zhehuan Chen,Yufei Wang,Tsun-Hsuan Wang,Zhou Xian,Chuang Gan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D articulated objects modeling has long been a challenging problem, since it requires to capture both accurate surface geometries and semantically meaningful and spatially precise structures, parts, and joints. Existing methods heavily depend on training data from a limited set of handcrafted articulated object categories (e.g., cabinets and drawers), which restricts their ability to model a wide range of articulated objects in an open-vocabulary context. To address these limitations, we propose Articulate Anymesh, an automated framework that is able to convert any rigid 3D mesh into its articulated counterpart in an open-vocabulary manner. Given a 3D mesh, our framework utilizes advanced Vision-Language Models and visual prompting techniques to extract semantic information, allowing for both the segmentation of object parts and the construction of functional joints. Our experiments show that Articulate Anymesh can generate large-scale, high-quality 3D articulated objects, including tools, toys, mechanical devices, and vehicles, significantly expanding the coverage of existing 3D articulated object datasets. Additionally, we show that these generated assets can facilitate the acquisition of new articulated object manipulation skills in simulation, which can then be transferred to a real robotic system. Our Github website is this https URL.
zh

[CV-1] COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

【速读】：该论文旨在解决现有图像-文本数据集在详细场景综合描述方面的不足，通过引入COCONut-PanCap数据集来增强全景分割和基于区域的图像描述。解决方案的关键在于利用精细标注的区域级描述，这些描述与全景分割掩码相一致，从而确保了生成描述的一致性和细节丰富性，进而支持视觉-语言模型（VLMs）和文本到图像生成模型的改进训练。

链接: https://arxiv.org/abs/2502.02589
作者: Xueqing Deng,Qihang Yu,Ali Athar,Chenglin Yang,Linjie Yang,Xiaojie Jin,Xiaohui Shen,Liang-Chieh Chen
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project website: this https URL

点击查看摘要

Abstract:This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.
zh

[CV-2] Calibrated Multi-Preference Optimization for Aligning Diffusion Models

【速读】：该论文旨在解决通过偏好优化对文本到图像（Text-to-Image, T2I）扩散模型进行对齐时，因人工数据收集成本高昂导致的可扩展性限制问题。当前的方法仅考虑成对偏好分布，无法有效利用丰富的信息，并且在多偏好场景下缺乏泛化能力，难以处理奖励之间的不一致性。论文的关键解决方案是提出了一种名为校准偏好优化（Calibrated Preference Optimization, CaPO）的新方法，通过整合多个奖励模型的一般偏好来对齐T2I扩散模型，无需人工标注数据。CaPO的核心包括一种奖励校准方法，用于通过计算预期胜率来逼近一般偏好，以及一种基于前沿的配对选择方法，能够有效地管理多偏好分布。最后，使用回归损失微调扩散模型以匹配选定配对的校准奖励差异。实验结果表明，CaPO在单奖励和多奖励设置下均优于直接偏好优化（Direct Preference Optimization, DPO）方法。

链接: https://arxiv.org/abs/2502.02588
作者: Kyungmin Lee,Xiaohang Li,Qifei Wang,Junfeng He,Junjie Ke,Ming-Hsuan Yang,Irfan Essa,Jinwoo Shin,Feng Yang,Yinxiao Li
机构: Google DeepMind(谷歌深度思维); KAIST(韩国科学技术院); Google Research(谷歌研究); Google(谷歌); Georgia Institute of Technology(乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aligning text-to-image (T2I) diffusion models with preference optimization is valuable for human-annotated datasets, but the heavy cost of manual data collection limits scalability. Using reward models offers an alternative, however, current preference optimization methods fall short in exploiting the rich information, as they only consider pairwise preference distribution. Furthermore, they lack generalization to multi-preference scenarios and struggle to handle inconsistencies between rewards. To address this, we present Calibrated Preference Optimization (CaPO), a novel method to align T2I diffusion models by incorporating the general preference from multiple reward models without human annotated data. The core of our approach involves a reward calibration method to approximate the general preference by computing the expected win-rate against the samples generated by the pretrained models. Additionally, we propose a frontier-based pair selection method that effectively manages the multi-preference distribution by selecting pairs from Pareto frontiers. Finally, we use regression loss to fine-tune diffusion models to match the difference between calibrated rewards of a selected pair. Experimental results show that CaPO consistently outperforms prior methods, such as Direct Preference Optimization (DPO), in both single and multi-reward settings validated by evaluation on T2I benchmarks, including GenEval and T2I-Compbench.
zh

[CV-3] Revisiting Expected Possession Value in Football: Introducing a Benchmark U-Net Architecture and Reward and Risk for Passes

【速读】：该论文旨在建立首个Expected Possession Value (EPV)基准，并提出一种改进的EPV模型用于足球分析。为解决这一问题，论文引入了OJN-Pass-EPV基准，通过量化评估方法来比较不同EPV模型的质量。鉴于未能复制Fernández等人（2021）的研究结果，论文提出了一种基于U-net型卷积神经网络的新架构，从而在模型损失和预期校准误差方面取得了良好的效果。此外，论文还改进了传球模型，引入球的高度，并采用包含奖励与风险双组件的新传球价值模型。关键在于通过改进的EPV模型准确评估进球潜力，该模型在OJN-Pass-EPV基准中的78%的游戏状态对中正确识别了更高价值的状态。

链接: https://arxiv.org/abs/2502.02565
作者: Thijs Overmeer,Tim Janssen,Wim P.M. Nuijten
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces the first Expected Possession Value (EPV) benchmark and a new and improved EPV model for football. Through the introduction of the OJN-Pass-EPV benchmark, we present a novel method to quantitatively assess the quality of EPV models by using pairs of game states with given relative EPVs. Next, we attempt to replicate the results of Fernández et al. (2021) using a dataset containing Dutch Eredivisie and World Cup matches. Following our failure to do so, we propose a new architecture based on U-net-type convolutional neural networks, achieving good results in model loss and Expected Calibration Error. Finally, we present an improved pass model that incorporates ball height and contains a new dual-component pass value model that analyzes reward and risk. The resulting EPV model correctly identifies the higher value state in 78% of the game state pairs in the OJN-Pass-EPV benchmark, demonstrating its ability to accurately assess goal-scoring potential. Our findings can help assess the quality of EPV models, improve EPV predictions, help assess potential reward and risk of passing decisions, and improve player and team performance.
zh

[CV-4] Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

【速读】：该论文旨在解决在高维空间中保持精确平移不变性的同时，降低计算复杂度的问题。关键在于提出了一种名为STRING (Separable Translationally Invariant Position Encodings) 的位置编码方法，它通过一个统一的理论框架扩展了Rotary Position Encodings，并且能够在保持低计算开销的前提下，为任意维度的令牌坐标提供精确的平移不变性。这种方法尤其适用于需要高效三维令牌表示的机器人领域。

链接: https://arxiv.org/abs/2502.02562
作者: Connor Schenck,Isaac Reid,Mithun George Jacob,Alex Bewley,Joshua Ainslie,David Rendleman,Deepali Jain,Mohit Sharma,Avinava Dubey,Ayzaan Wahid,Sumeet Singh,Rene Wagner,Tianli Ding,Chuyuan Fu,Arunkumar Byravan,Jake Varley,Alexey Gritsenko,Matthias Minderer,Dmitry Kalashnikov,Jonathan Tompson,Vikas Sindhwani,Krzysztof Choromanski
机构: Google Research (谷歌研究); University of Cambridge (剑桥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Machine Learning (stat.ML)
备注: Videos of STRING-based robotics controllers can be found here: this https URL

点击查看摘要

Abstract:We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods.
zh

[CV-5] Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

【速读】：该论文旨在解决开放词汇3D场景理解的问题，通过引入一种新颖的数据生成管道和训练框架来实现。解决方案的关键在于满足三个重要需求：精确的3D区域分割、全面的文本描述以及足够的数据集规模。为此，研究者利用最先进的开放词汇图像分割模型和区域感知视觉语言模型，开发了一种自动生成高质量3D掩码-文本对的管道。基于此方法，创建了一个包含超过30K标注场景和560万3D掩码-文本对的新数据集Mosaic3D-5.6M。在此基础上，提出了Mosaic3D模型，结合对比学习训练的3D编码器和轻量级掩码解码器，用于开放词汇3D语义和实例分割任务。该方法在ScanNet200、Matterport3D和ScanNet++等基准测试中达到了当前最优性能。

链接: https://arxiv.org/abs/2502.02548
作者: Junha Lee,Chunghyun Park,Jaesung Choe,Yu-Chiang Frank Wang,Jan Kautz,Minsu Cho,Chris Choy
机构: NVIDIA; POSTECH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.
zh

[CV-6] Uncertainty Quantification for Collaborative Object Detection Under Adversarial Attacks

【速读】：该论文旨在解决协同目标检测（Collaborative Object Detection, COD）系统在对抗性攻击下的鲁棒性和不确定性量化问题。论文的关键解决方案是提出了一种名为可信不确定性量化框架（Trusted Uncertainty Quantification in Collaborative Perception, TUQCP）。TUQCP通过对抗训练和不确定性量化技术相结合，增强现有COD模型的对抗鲁棒性。具体而言，TUQCP首先通过对抗训练向随机选择的代理共享信息中添加扰动，然后通过基于学习的模块提供输出不确定性估计，并通过一致预测进行不确定性校准，从而减轻对抗性攻击的影响。这一框架适用于早期和中期协作COD模型以及单代理目标检测模型。评估结果表明，TUQCP在相同的对抗性攻击下，相较于基线模型，在V2X-Sim数据集上的目标检测精度提高了80.41%，证明了其在对抗性攻击下对COD的重要性。

链接: https://arxiv.org/abs/2502.02537
作者: Huiqun Huang,Cong Chen,Jean-Philippe Monteuuis,Jonathan Petit,Fei Miao
机构: University of Connecticut(康涅狄格大学); Qualcomm(高通)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Collaborative Object Detection (COD) and collaborative perception can integrate data or features from various entities, and improve object detection accuracy compared with individual perception. However, adversarial attacks pose a potential threat to the deep learning COD models, and introduce high output uncertainty. With unknown attack models, it becomes even more challenging to improve COD resiliency and quantify the output uncertainty for highly dynamic perception scenes such as autonomous vehicles. In this study, we propose the Trusted Uncertainty Quantification in Collaborative Perception framework (TUQCP). TUQCP leverages both adversarial training and uncertainty quantification techniques to enhance the adversarial robustness of existing COD models. More specifically, TUQCP first adds perturbations to the shared information of randomly selected agents during object detection collaboration by adversarial training. TUQCP then alleviates the impacts of adversarial attacks by providing output uncertainty estimation through learning-based module and uncertainty calibration through conformal prediction. Our framework works for early and intermediate collaboration COD models and single-agent object detection models. We evaluate TUQCP on V2X-Sim, a comprehensive collaborative perception dataset for autonomous driving, and demonstrate a 80.41% improvement in object detection accuracy compared to the baselines under the same adversarial attacks. TUQCP demonstrates the importance of uncertainty quantification to COD under adversarial attacks.
zh

[CV-7] Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation

【速读】：该论文旨在解决九自由度（9-DoF）物体姿态和尺寸估计在领域泛化方面的挑战，特别是针对类别级方法需要大量手工标注的真实世界训练数据的问题。关键解决方案在于引入了一种基于扩散模型的领域泛化类别级9-DoF物体姿态估计方法。通过利用扩散模型的潜在泛化能力，该模型仅使用合成渲染数据进行训练以实现对真实场景的泛化。这种方法不需要任何3D形状先验，并采用去噪扩散隐式模型，在最少三个步骤内完成逆扩散过程，从而实现实时性能。

链接: https://arxiv.org/abs/2502.02525
作者: Jian Liu,Wei Sun,Hui Yang,Pengchao Deng,Chongpei Liu,Nicu Sebe,Hossein Rahmani,Ajmal Mian
机构: National Engineering Research Center for Robot Visual Perception and Control Technology, College of Electrical and Information Engineering, School of Robotics, and the State Key Laboratory of Advanced Design and Manufacturing for Vehicle Body, Hunan University(湖南大学), China; Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(西安交通大学), China; Department of Information Engineering and Computer Science, University of Trento(University of Trento), Italy; School of Computing and Communications, Lancaster University(兰卡斯特大学), United Kingdom; Department of Computer Science, The University of Western Australia(西澳大利亚大学), Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 17 pages, 13 figures

点击查看摘要

Abstract:Nine-degrees-of-freedom (9-DoF) object pose and size estimation is crucial for enabling augmented reality and robotic manipulation. Category-level methods have received extensive research attention due to their potential for generalization to intra-class unknown objects. However, these methods require manual collection and labeling of large-scale real-world training data. To address this problem, we introduce a diffusion-based paradigm for domain-generalized category-level 9-DoF object pose estimation. Our motivation is to leverage the latent generalization ability of the diffusion model to address the domain generalization challenge in object pose estimation. This entails training the model exclusively on rendered synthetic data to achieve generalization to real-world scenes. We propose an effective diffusion model to redefine 9-DoF object pose estimation from a generative perspective. Our model does not require any 3D shape priors during training or inference. By employing the Denoising Diffusion Implicit Model, we demonstrate that the reverse diffusion process can be executed in as few as 3 steps, achieving near real-time performance. Finally, we design a robotic grasping system comprising both hardware and software components. Through comprehensive experiments on two benchmark datasets and the real-world robotic system, we show that our method achieves state-of-the-art domain generalization performance. Our code will be made public at this https URL.
zh

[CV-8] Privacy Attacks on Image AutoRegressive Models

【速读】：该论文旨在解决图像自回归（Image Autoregressive, IAR）模型在隐私保护方面的问题。研究发现，尽管IAR模型在图像质量和生成速度上优于扩散模型（Diffusion Models, DMs），但它们更容易遭受成员推理攻击（Membership Inference Attack, MIA）。论文的关键解决方案是开发了一种新的MIA方法，该方法显著提高了检测训练图像的成功率，并通过实验揭示了IAR模型的信息泄露风险更高。研究表明，结合DMs中的技术，如使用扩散进行逐令牌概率建模，可能有助于减轻IAR模型的隐私风险。

链接: https://arxiv.org/abs/2502.02514
作者: Antoni Kowalczuk,Jan Dubiński,Franziska Boenisch,Adam Dziedzic
机构: CISPA Helmholtz Center for Information Security; Warsaw University of Technology (华沙科技大学), IDEAS NCBR
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:Image autoregressive (IAR) models have surpassed diffusion models (DMs) in both image quality (FID: 1.48 vs. 1.58) and generation speed. However, their privacy risks remain largely unexplored. To address this, we conduct a comprehensive privacy analysis comparing IARs to DMs. We develop a novel membership inference attack (MIA) that achieves a significantly higher success rate in detecting training images (TPR@FPR=1%: 86.38% for IARs vs. 4.91% for DMs). Using this MIA, we perform dataset inference (DI) and find that IARs require as few as six samples to detect dataset membership, compared to 200 for DMs, indicating higher information leakage. Additionally, we extract hundreds of training images from an IAR (e.g., 698 from VAR-d30). Our findings highlight a fundamental privacy-utility trade-off: while IARs excel in generation quality and speed, they are significantly more vulnerable to privacy attacks. This suggests that incorporating techniques from DMs, such as per-token probability modeling using diffusion, could help mitigate IARs’ privacy risks. Our code is available at this https URL.
zh

[CV-9] Unified Spatial-Temporal Edge-Enhanced Graph Networks for Pedestrian Trajectory Prediction

【速读】：该论文旨在解决现有空间-时间（Spatial-Temporal, ST）方法在行人轨迹预测中分别建模个体的时间依赖性和行人间的空间交互性的问题，这些方法忽略了跨不同时间步的行人间高阶交互影响（high-order cross-time interactions），从而限制了其捕捉ST相互依赖性的能力，并阻碍了预测性能。为了解决这些问题，论文提出了UniEdge，其关键是引入了一个统一的ST图数据结构，简化了高阶跨时间步的交互关系为一阶关系，使ST相互依赖性的学习可以在单一步骤内完成，避免了多步聚合导致的信息损失。此外，UniEdge还通过一种新颖的端到端-节点到节点图卷积（E2E-N2N-GCN）网络来联合建模显式的节点间社会互动与隐含的边间影响传播模式，并采用基于变换器编码器的预测器以实现全局时序相关性的建模。

链接: https://arxiv.org/abs/2502.02504
作者: Ruochen Li,Tanqiu Qiao,Stamos Katsigiannis,Zhanxing Zhu,Hubert P. H. Shum
机构: Durham University (杜伦大学); University of Southampton (南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pedestrian trajectory prediction aims to forecast future movements based on historical paths. Spatial-temporal (ST) methods often separately model spatial interactions among pedestrians and temporal dependencies of individuals. They overlook the direct impacts of interactions among different pedestrians across various time steps (i.e., high-order cross-time interactions). This limits their ability to capture ST inter-dependencies and hinders prediction performance. To address these limitations, we propose UniEdge with three major designs. Firstly, we introduce a unified ST graph data structure that simplifies high-order cross-time interactions into first-order relationships, enabling the learning of ST inter-dependencies in a single step. This avoids the information loss caused by multi-step aggregation. Secondly, traditional GNNs focus on aggregating pedestrian node features, neglecting the propagation of implicit interaction patterns encoded in edge features. We propose the Edge-to-Edge-Node-to-Node Graph Convolution (E2E-N2N-GCN), a novel dual-graph network that jointly models explicit N2N social interactions among pedestrians and implicit E2E influence propagation across these interaction patterns. Finally, to overcome the limited receptive fields and challenges in capturing long-range dependencies of auto-regressive architectures, we introduce a transformer encoder-based predictor that enables global modeling of temporal correlation. UniEdge outperforms state-of-the-arts on multiple datasets, including ETH, UCY, and SDD.
zh

[CV-10] Graph-based Document Structure Analysis ICLR2025

【速读】：该论文旨在解决传统文档布局分析（Document Layout Analysis, DLA）方法仅能进行浅层解析，无法充分捕捉文档元素间的复杂空间和逻辑关系的问题。为解决这一问题，论文提出了一种基于图结构的文档结构分析（graph-based Document Structure Analysis, gDSA）任务，其关键是不仅检测文档元素，还需生成表示这些元素之间空间和逻辑关系的图结构，从而实现文档的整体和直观理解。为此，论文构建了一个包含80,000个文档图像和413万个关系标注的数据集（GraphDoc），并提出了一个文档关系图生成器（Document Relation Graph Generator, DRGG）来完成gDSA任务。

链接: https://arxiv.org/abs/2502.02501
作者: Yufan Chen,Ruiping Liu,Junwei Zheng,Di Wen,Kunyu Peng,Jiaming Zhang,Rainer Stiefelhagen
机构: CV:HCI Lab, Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025. Project page: this https URL

点击查看摘要

Abstract:When reading a document, glancing at the spatial layout of a document is an initial step to understand it roughly. Traditional document layout analysis (DLA) methods, however, offer only a superficial parsing of documents, focusing on basic instance detection and often failing to capture the nuanced spatial and logical relations between instances. These limitations hinder DLA-based models from achieving a gradually deeper comprehension akin to human reading. In this work, we propose a novel graph-based Document Structure Analysis (gDSA) task. This task requires that model not only detects document elements but also generates spatial and logical relations in form of a graph structure, allowing to understand documents in a holistic and intuitive manner. For this new task, we construct a relation graph-based document structure analysis dataset (GraphDoc) with 80K document images and 4.13M relation annotations, enabling training models to complete multiple tasks like reading order, hierarchical structures analysis, and complex inter-element relation inference. Furthermore, a document relation graph generator (DRGG) is proposed to address the gDSA task, which achieves performance with 57.6% at mAP _g @0.5 for a strong benchmark baseline on this novel task and dataset. We hope this graphical representation of document structure can mark an innovative advancement in document structure analysis and understanding. The new dataset and code will be made publicly available at this https URL.
zh

[CV-11] VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

【速读】：该论文旨在解决生成式视频模型在捕捉真实世界的运动、动态和物理方面存在的局限性。这一局限性源于传统的像素重建目标，导致模型倾向于提高外观保真度而牺牲运动一致性。为了解决这一问题，论文引入了VideoJAM框架，通过鼓励模型学习联合的外观-运动表示来有效地灌输有效的运动先验。VideoJAM的关键在于训练过程中扩展目标以预测生成像素及其对应的运动，并在推理过程中引入Inner-Guidance机制，利用模型自身的运动预测作为动态引导信号，从而实现连贯的运动生成。这种方法可以应用于任何视频模型且无需对训练数据或模型规模进行修改。

链接: https://arxiv.org/abs/2502.02492
作者: Hila Chefer,Uriel Singer,Amit Zohar,Yuval Kirstain,Adam Polyak,Yaniv Taigman,Lior Wolf,Shelly Sheynin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model’s own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: this https URL
zh

[CV-12] A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation

【速读】：该论文旨在解决超声（US）图像分割中因数据标注需求高且数据集难以获取而导致的性能局限性问题。论文的关键解决方案是引入了一种针对B模式超声图像的对比自监督学习（Self-Supervised Learning, SSL）方法，并结合了一种新颖的关系对比损失（Relation Contrastive Loss, RCL）。RCL通过可学习的度量标准区分正负样本对，以鼓励模型学习到不同的特征。此外，研究还提出了空间和基于频率的数据增强策略，以进一步提升超声图像表征学习的效果。该方法在三个公开的乳腺超声数据集中显著优于传统的监督分割方法，特别是在数据有限的情况下。

链接: https://arxiv.org/abs/2502.02489
作者: Edward Ellis,Andrew Bulpitt,Nasim Parsa,Michael F Byrne,Sharib Ali
机构: School of Computer Science, Faculty of Engineering and Physical Sciences, University of Leeds (利兹大学工程与物理科学学院计算机学院); Satisfai Health (Satisfai Health)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12

点击查看摘要

Abstract:Ultrasound (US) imaging is clinically invaluable due to its noninvasive and safe nature. However, interpreting US images is challenging, requires significant expertise, and time, and is often prone to errors. Deep learning offers assistive solutions such as segmentation. Supervised methods rely on large, high-quality, and consistently labeled datasets, which are challenging to curate. Moreover, these methods tend to underperform on out-of-distribution data, limiting their clinical utility. Self-supervised learning (SSL) has emerged as a promising alternative, leveraging unlabeled data to enhance model performance and generalisability. We introduce a contrastive SSL approach tailored for B-mode US images, incorporating a novel Relation Contrastive Loss (RCL). RCL encourages learning of distinct features by differentiating positive and negative sample pairs through a learnable metric. Additionally, we propose spatial and frequency-based augmentation strategies for the representation learning on US images. Our approach significantly outperforms traditional supervised segmentation methods across three public breast US datasets, particularly in data-limited scenarios. Notable improvements on the Dice similarity metric include a 4% increase on 20% and 50% of the BUSI dataset, nearly 6% and 9% improvements on 20% and 50% of the BrEaST dataset, and 6.4% and 3.7% improvements on 20% and 50% of the UDIAT dataset, respectively. Furthermore, we demonstrate superior generalisability on the out-of-distribution UDIAT dataset with performance boosts of 20.6% and 13.6% compared to the supervised baseline using 20% and 50% of the BUSI and BrEaST training data, respectively. Our research highlights that domain-inspired SSL can improve US segmentation, especially under data-limited conditions.
zh

[CV-13] Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives

【速读】：该论文旨在使自主系统能够全面理解人类活动视频流，涵盖多任务场景下的概念关联、抽象知识学习以及技能迁移。论文的关键解决方案是提出Hier-EgoPack框架，通过引入一种新颖的分层架构来实现跨不同时间粒度的推理，从而有效应对多粒度推理的挑战，并利用图神经网络（GNN）层进行优化。这一方法显著提升了EgoPack在处理多样化下游任务中的性能，特别是在Ego4D基准测试中展示了其在片段级和帧级推理中的有效性。

链接: https://arxiv.org/abs/2502.02487
作者: Simone Alberto Peirone,Francesca Pistilli,Antonio Alliegro,Tatiana Tommasi,Giuseppe Averta
机构: Politecnico di Torino (都灵理工大学), Italy (意大利)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage at this https URL

点击查看摘要

Abstract:Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.
zh

[CV-14] Mind the Gap: Evaluating Patch Embeddings from General-Purpose and Histopathology Foundation Models for Cell Segmentation and Classification

【速读】：该论文旨在探究通用型基础模型与专门针对组织病理学的领域特定模型在细胞分析等专业化任务中的表现差异。研究的关键在于通过多层级图像块嵌入（multi-level patch embeddings）对比分析这两种模型在细胞实例分割和分类任务上的性能。解决方案的关键在于采用编码器-解码器架构，其中编码器包括卷积神经网络（CNN）、视觉变换器（ViT）以及混合模型，并且这些模型分别在ImageNet-22K或LVD-142M数据集上预训练，以代表通用型基础模型。同时，论文还评估了从大规模组织病理学全切片图像中训练得到的UNI、Virchow2和Prov-GigaPath等ViT编码器的表现。所有编码器在训练过程中保持冻结状态，以评估其预训练特征提取能力。通过使用PanNuke、CONIC数据集及新引入的Nissl染色CytoDArk0数据集，论文评估了不同模型在细胞级检测、分割精度及细胞类型分类方面的性能。

链接: https://arxiv.org/abs/2502.02471
作者: Valentina Vadori,Antonella Peruffo,Jean-Marie Graïc,Livio Finos,Enrico Grisan
机构: London South Bank University (伦敦南岸大学); University of Padova (帕多瓦大学); University of Padova (帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Recent advancements in foundation models have transformed computer vision, driving significant performance improvements across diverse domains, including digital histopathology. However, the advantages of domain-specific histopathology foundation models over general-purpose models for specialized tasks such as cell analysis remain underexplored. This study investigates the representation learning gap between these two categories by analyzing multi-level patch embeddings applied to cell instance segmentation and classification. We implement an encoder-decoder architecture with a consistent decoder and various encoders. These include convolutional, vision transformer (ViT), and hybrid encoders pre-trained on ImageNet-22K or LVD-142M, representing general-purpose foundation models. These are compared against ViT encoders from the recently released UNI, Virchow2, and Prov-GigaPath foundation models, trained on patches extracted from hundreds of thousands of histopathology whole-slide images. The decoder integrates patch embeddings from different encoder depths via skip connections to generate semantic and distance maps. These maps are then post-processed to create instance segmentation masks where each label corresponds to an individual cell and to perform cell-type classification. All encoders remain frozen during training to assess their pre-trained feature extraction capabilities. Using the PanNuke and CoNIC histopathology datasets, and the newly introduced Nissl-stained CytoDArk0 dataset for brain cytoarchitecture studies, we evaluate instance-level detection, segmentation accuracy, and cell-type classification. This study provides insights into the comparative strengths and limitations of general-purpose vs. histopathology foundation models, offering guidance for model selection in cell-focused histopathology and brain cytoarchitecture analysis workflows.
zh

[CV-15] High-Fidelity Human Avatars from Laptop Webcams using Edge Compute

【速读】：该论文旨在解决使用消费级笔记本电脑网络摄像头生成高保真可动画化人类avatar的问题，尤其关注低质量图像和设备端有限计算资源的挑战。解决方案的关键在于采用基于3Dmorphable模型、地标检测、照片级真实感纹理GAN以及可微渲染技术的新型方法，充分利用移动芯片的神经计算能力，构建了一个自动系统来应对这些限制。

链接: https://arxiv.org/abs/2502.02468
作者: Akash Haridas Imran N. Junejo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures, 1 table

点击查看摘要

Abstract:Applications of generating photo-realistic human avatars are many, however, high-fidelity avatar generation traditionally required expensive professional camera rigs and artistic labor, but recent research has enabled constructing them automatically from smartphones with RGB and IR sensors. However, these new methods still rely on the presence of high-resolution cameras on modern smartphones and often require offloading the processing to powerful servers with GPUs. Modern applications such as video conferencing call for the ability to generate these avatars from consumer-grade laptop webcams using limited compute available on-device. In this work, we develop a novel method based on 3D morphable models, landmark detection, photo-realistic texture GANs, and differentiable rendering to tackle the problem of low webcam image quality and edge computation. We build an automatic system to generate high-fidelity animatable avatars under these limitations, leveraging the neural compute capabilities of mobile chips.
zh

[CV-16] owards Consistent and Controllable Image Synthesis for Face Editing

【速读】：该论文旨在解决现有基于扩散模型（Diffusion Models）的面部编辑方法在处理细粒度属性（Fine-Grained Attributes）以及保持不变属性的一致性（Consistency of Unchanged Attributes）方面所面临的挑战。为了解决这些问题，并促进更方便的面部图像编辑，论文提出了一种新颖的方法，即RigFace。其关键是利用Stable-Diffusion模型与粗略三维人脸模型（Crude 3D Face Models）相结合，以控制肖像照片中的光照、面部表情和头部姿态。具体而言，RigFace包含一个空间属性编码器（Spatial Attribute Encoder），用于提供精确且解耦的背景、姿势、表情和光照条件；一个人物身份编码器（Identity Encoder），将身份特征传递到预训练Stable-Diffusion模型的去噪UNet（Denoising UNet）；以及一个属性调节器（Attribute Rigger），将这些条件注入到去噪UNet中。这种方法实现了与现有面部编辑模型相当甚至更优的身份保留和照片真实感（Photorealism）。

链接: https://arxiv.org/abs/2502.02465
作者: Mengting Wei,Tuomas Varanka,Yante Li,Xingxun Jiang,Huai-Qian Khor,Guoying Zhao
机构: Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu, Oulu, FI-90014, Finland (芬兰); Key Laboratory of Child Development and Learning Science of Ministry of Education, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing 210096, China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current face editing methods mainly rely on GAN-based techniques, but recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in manipulating fine-grained attributes and preserving consistency of attributes that should remain unchanged. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involve combinations of target background, identity and different face attributes. We aim to sufficiently disentangle the control of these factors to enable high-quality of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Arrtibute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) An Identity Encoder that transfers identity features to the denoising UNet of a pre-trained Stable-Diffusion model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.
zh

[CV-17] IMDPrompter: Adapting SAM to Image Manipulation Detection by Cross-View Automated Prompt Learning

【速读】：该论文旨在解决Segment Anything Model (SAM)在图像操纵检测领域的应用挑战，主要问题是其依赖手动提示以及单视角信息难以支持跨数据集泛化。为了解决这些问题，论文提出了一种基于SAM的跨视角提示学习范式IMDPrompter。关键解决方案包括自动化提示设计、跨视角特征感知、最优提示选择以及跨视角提示一致性组件，这些措施共同促进了跨视角感知学习，并引导SAM生成准确的掩码。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2502.02454
作者: Quan Zhang,Yuxin Qi,Xi Tang,Jinwei Fang,Xi Lin,Ke Zhang,Chun Yuan
机构: Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Using extensive training data from SA-1B, the Segment Anything Model (SAM) has demonstrated exceptional generalization and zero-shot capabilities, attracting widespread attention in areas such as medical image segmentation and remote sensing image segmentation. However, its performance in the field of image manipulation detection remains largely unexplored and unconfirmed. There are two main challenges in applying SAM to image manipulation detection: a) reliance on manual prompts, and b) the difficulty of single-view information in supporting cross-dataset generalization. To address these challenges, we develops a cross-view prompt learning paradigm called IMDPrompter based on SAM. Benefiting from the design of automated prompts, IMDPrompter no longer relies on manual guidance, enabling automated detection and localization. Additionally, we propose components such as Cross-view Feature Perception, Optimal Prompt Selection, and Cross-View Prompt Consistency, which facilitate cross-view perceptual learning and guide SAM to generate accurate masks. Extensive experimental results from five datasets (CASIA, Columbia, Coverage, IMD2020, and NIST16) validate the effectiveness of our proposed method.
zh

[CV-18] Personalization Toolkit: Training Free Personalization of Large Vision Language Models

【速读】：该论文旨在解决现有大型视觉语言模型（Large Vision Language Models, LVLMs）个性化过程中依赖耗时的测试时训练的问题。解决方案的关键在于提出了一种无需再训练的新型方法，通过利用预训练的视觉基础模型提取独特特征，采用检索增强生成（Retrieval-Augmented Generation, RAG）技术识别视觉输入中的实例，并运用视觉提示方法，从而实现灵活且高效的个性化，而无需进行广泛的再训练。

链接: https://arxiv.org/abs/2502.02452
作者: Soroush Seifi,Vaggelis Dorovatas,Daniel Olmeda Reino,Rahaf Aljundi
机构: Toyota Motor Europe (丰田汽车欧洲)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have significant potential to deliver personalized assistance by adapting to individual users’ unique needs and preferences. Personalization of LVLMs is an emerging area that involves customizing models to recognize specific object instances and provide tailored responses. However, existing approaches rely on time-consuming test-time training for each user and object, rendering them impractical. This paper proposes a novel, training-free approach to LVLM personalization by leveraging pre-trained vision foundation models to extract distinct features, retrieval-augmented generation (RAG) techniques to recognize instances in the visual input, and visual prompting methods. Our model-agnostic vision toolkit enables flexible and efficient personalization without extensive retraining. We demonstrate state-of-the-art results, outperforming conventional training-based approaches and establish a new standard for LVLM personalization.
zh

[CV-19] UMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

【速读】：该论文旨在解决复杂道路交叉口交通场景中的时空视频理解问题，通过构建TUMTraffic-VideoQA数据集和基准模型来应对这一挑战。关键解决方案在于引入了一种新的数据集，该数据集包含1,000个视频片段，涵盖85,000个多选题问答对、2,300个物体描述和5,700个时空定位标注，并且在多样化的现实条件下（如恶劣天气和交通异常）进行测试。此外，通过融合基于元组的时空对象表达方式，统一了多选视频问答 (Multiple-Choice Video Question Answering)、指明对象描述 (Referred Object Captioning) 和时空对象定位 (Spatio-Temporal Object Grounding) 三个重要任务，形成了一个连贯的评估框架。论文进一步提出了增强视觉标记采样策略的TUMTraffic-Qwen基线模型，以提供对细粒度时空推理挑战的洞察。

链接: https://arxiv.org/abs/2502.02449
作者: Xingcheng Zhou,Konstantinos Larintzakis,Hao Guo,Walter Zimmer,Mingyu Liu,Hu Cao,Jiajie Zhang,Venkatnarayanan Lakshminarasimhan,Leah Strand,Alois C. Knoll
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present TUMTraffic-VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraffic-VideoQA unifies three essential tasks-multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding-within a cohesive evaluation framework. We further introduce the TUMTraffic-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset’s complexity, highlight the limitations of existing models, and position TUMTraffic-VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.
zh

[CV-20] Extending SEEDS to a Supervoxel Algorithm for Medical Image Analysis

【速读】：该论文旨在将SEEDS超像素算法从二维图像扩展到三维体积数据，从而开发出一种更快、更优且开源的三维监督体素(3D SEEDS)算法，用于医学图像分析。解决方案的关键在于通过改进算法显著加速了监督体素生成速度，提升了Dice分数，并减少了欠分割误差。具体而言，3D SEEDS在生成监督体素的速度上比SLIC算法快十倍，Dice分数提高了6.5%，并且将欠分割误差降低了0.16%。

链接: https://arxiv.org/abs/2502.02409
作者: Chenhui Zhao,Yan Jiang,Todd C. Hollon
机构: University of Michigan; Beijing University of Posts and Telecommunications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report

点击查看摘要

Abstract:In this work, we extend the SEEDS superpixel algorithm from 2D images to 3D volumes, resulting in 3D SEEDS, a faster, better, and open-source supervoxel algorithm for medical image analysis. We compare 3D SEEDS with the widely used supervoxel algorithm SLIC on 13 segmentation tasks across 10 organs. 3D SEEDS accelerates supervoxel generation by a factor of 10, improves the achievable Dice score by +6.5%, and reduces the under-segmentation error by -0.16%. The code is available at this https URL
zh

[CV-21] LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在处理大规模视觉输入，如视频理解任务时，跨注意力层中大量视觉标记导致的高内存需求和分布式计算需求问题。现有分布式注意力机制存在显著的通信开销，使得跨注意力层成为高效训练和推理的瓶颈。论文的关键解决方案是提出LV-XAttn，这是一种具有最小通信开销的分布式精确跨注意力机制。LV-XAttn通过在每个GPU上保留大型键值块，并交换较小的查询块来实现高效的分布式处理，同时引入了一种有效的激活重新计算技术以支持更长的视觉上下文。理论分析表明，LV-XAttn能够实现广泛的模型速度提升。

链接: https://arxiv.org/abs/2502.02406
作者: Tzu-Tao Chang,Shivaram Venkataraman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique enabling support for longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 5.58 \times end-to-end speedup compared to existing approaches.
zh

[CV-22] MaintaAvatar: A Maintainable Avatar Based on Neural Radiance Fields by Continual Learning AAAI2025

【速读】：该论文旨在解决虚拟数字人偶在真实世界场景中不断变化的外观和姿态更新问题，同时保持能够渲染其旧有外观的能力。解决方案的关键在于提出了一种基于神经辐射场的可维护人偶（MaintaAvatar），通过采用全局-局部联合存储模块和姿态蒸馏模块，解决了新外观和姿态学习过程中导致模型遗忘先前信息的问题，从而避免了过去外观渲染质量下降，特别是色彩渗透问题和错误的人体姿态。

链接: https://arxiv.org/abs/2502.02372
作者: Shengbo Gu,Yu-Kun Qiu,Yu-Ming Tang,Ancong Wu,Wei-Shi Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2025. 9 pages

点击查看摘要

Abstract:The generation of a virtual digital avatar is a crucial research topic in the field of computer vision. Many existing works utilize Neural Radiance Fields (NeRF) to address this issue and have achieved impressive results. However, previous works assume the images of the training person are available and fixed while the appearances and poses of a subject could constantly change and increase in real-world scenarios. How to update the human avatar but also maintain the ability to render the old appearance of the person is a practical challenge. One trivial solution is to combine the existing virtual avatar models based on NeRF with continual learning methods. However, there are some critical issues in this approach: learning new appearances and poses can cause the model to forget past information, which in turn leads to a degradation in the rendering quality of past appearances, especially color bleeding issues, and incorrect human body poses. In this work, we propose a maintainable avatar (MaintaAvatar) based on neural radiance fields by continual learning, which resolves the issues by utilizing a Global-Local Joint Storage Module and a Pose Distillation Module. Overall, our model requires only limited data collection to quickly fine-tune the model while avoiding catastrophic forgetting, thus achieving a maintainable virtual avatar. The experimental results validate the effectiveness of our MaintaAvatar model.
zh

[CV-23] Field Matching: an Electrostatic Paradigm to Generate and Transfer Data

【速读】：该论文旨在解决生成式建模和分布转换任务中的问题。解决方案的关键在于提出了一种名为电场匹配（Electrostatic Field Matching, EFM）的方法。该方法通过模拟电容器的物理特性，将源分布和目标分布分别置于电容器的两极，并赋予正负电荷。随后，利用神经网络近似器学习电容器的电场。通过沿着学习到的电场线移动样本，从一个极板移动到另一个极板，从而实现分布之间的映射。理论上证明了该方法能够有效完成分布转换，并在玩具数据和图像数据实验中展示了其实用性。

链接: https://arxiv.org/abs/2502.02367
作者: Alexander Kolesov,Manukhov Stepan,Vladimir V. Palyulin,Alexander Korotin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modeling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. We then learn the electrostatic field of the capacitor using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments.
zh

[CV-24] MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

【速读】：该论文旨在解决现有方法在人体运动生成与编辑中的局限性，这些局限包括针对特定任务的孤立解决方案导致的低效性和不实用性。此外，虽然已有尝试通过不同模态作为条件来统一处理运动相关任务，但这些方法缺乏编辑能力、精细控制，并且不能促进跨任务的知识共享。为了解决这些问题，提出了一种新的范式：Motion-Condition-Motion，它通过源运动（source motion）、条件（condition）和目标运动（target motion）三个概念实现多样化任务的统一表达。基于此范式，论文提出了一个名为MotionLab的统一框架，采用校正流（rectified flows）学习从源运动到目标运动的映射，同时受指定条件的引导。MotionLab的关键创新包括：1）MotionFlow Transformer以增强条件生成和编辑而无需任务专用模块；2）对齐旋转位置编码（Aligned Rotational Position Encoding）确保源运动和目标运动之间的时间同步；3）任务特定指令调制（Task Specified Instruction Modulation）；以及4）运动课程学习（Motion Curriculum Learning），以实现有效的多任务学习和跨任务知识共享。

链接: https://arxiv.org/abs/2502.02358
作者: Ziyan Guo,Zeyu Hu,Na Zhao,De Wen Soh
机构: Singapore University of Technology and Design (新加坡科技设计大学); LightSpeed Studios (光子工作室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human motion generation and editing are key components of computer graphics and vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target this http URL on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified this http URL MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion. Our code and additional video results are available at: this https URL.
zh

[CV-25] ransfer Risk Map: Mitigating Pixel-level Negative Transfer in Medical Segmentation

【速读】：该论文旨在解决迁移学习在医学图像分割应用中负迁移（negative transfer）的问题。现有方法主要针对分类或回归任务，未能充分考虑不同图像区域存在的非均匀负迁移风险。论文的关键解决方案在于提出了一种加权微调方法，通过转移可指导的风险图（transferability-guided transfer risk map）量化每个像素点的迁移难度及潜在的负迁移风险，并引入了一个归一化于图像前景大小的地图加权损失函数（map-weighted loss function），以应对类别不平衡问题。该方法显著提升了目标任务的表现，在FeTS2021数据集上提升了4.37%，在iSeg2019数据集上提升了1.81%，有效避免了跨模态和任务的负迁移现象。

链接: https://arxiv.org/abs/2502.02340
作者: Shutong Duan,Jingyun Yang,Yang Tan,Guoqing Zhang,Yang Li,Xiao-Ping Zhang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:How to mitigate negative transfer in transfer learning is a long-standing and challenging issue, especially in the application of medical image segmentation. Existing methods for reducing negative transfer focus on classification or regression tasks, ignoring the non-uniform negative transfer risk in different image regions. In this work, we propose a simple yet effective weighted fine-tuning method that directs the model’s attention towards regions with significant transfer risk for medical semantic segmentation. Specifically, we compute a transferability-guided transfer risk map to quantify the transfer hardness for each pixel and the potential risks of negative transfer. During the fine-tuning phase, we introduce a map-weighted loss function, normalized with image foreground size to counter class imbalance. Extensive experiments on brain segmentation datasets show our method significantly improves the target task performance, with gains of 4.37% on FeTS2021 and 1.81% on iSeg2019, avoiding negative transfer across modalities and tasks. Meanwhile, a 2.9% gain under a few-shot scenario validates the robustness of our approach.
zh

[CV-26] Geometric Neural Process Fields

【速读】：该论文旨在解决神经场（Neural Field, NeF）泛化的问题，即模型需要在仅提供少量观测的情况下高效适应新的信号。为了解决这一挑战，论文提出了几何神经过程场（Geometric Neural Process Fields, G-NPF），这是一种针对神经辐射场的概率框架，能够显式捕捉不确定性。解决方案的关键在于将NeF的泛化问题形式化为一个概率问题，并引入一组几何基底来编码空间结构，从而促进NeF函数分布的推理。在此基础上，设计了一个分层潜在变量模型，使得G-NPF能够整合多层级的空间结构信息，有效参数化内蕴神经辐射场（INR）函数，进而提升对新场景和未见信号的泛化能力。

链接: https://arxiv.org/abs/2502.02338
作者: Wenzhe Yin,Zehao Xiao,Jiayi Shen,Yunlu Chen,Cees G. M. Snoek,Jan-Jakob Sonke,Efstratios Gavves
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of Neural Field (NeF) generalization, where models must efficiently adapt to new signals given only a few observations. To tackle this, we propose Geometric Neural Process Fields (G-NPF), a probabilistic framework for neural radiance fields that explicitly captures uncertainty. We formulate NeF generalization as a probabilistic problem, enabling direct inference of NeF function distributions from limited context observations. To incorporate structural inductive biases, we introduce a set of geometric bases that encode spatial structure and facilitate the inference of NeF function distributions. Building on these bases, we design a hierarchical latent variable model, allowing G-NPF to integrate structural information across multiple spatial levels and effectively parameterize INR functions. This hierarchical approach improves generalization to novel scenes and unseen signals. Experiments on novel-view synthesis for 3D scenes, as well as 2D image and 1D signal regression, demonstrate the effectiveness of our method in capturing uncertainty and leveraging structural information for improved generalization.
zh

[CV-27] Event-aided Semantic Scene Completion

【速读】：该论文旨在解决自动驾驶系统在复杂环境（如运动模糊、不良照明和恶劣天气）下基于RGB图像的场景理解局限性问题。为解决这一问题，论文提出了一种新的事件辅助三维语义场景完成框架——EvSSC。其关键是引入了事件辅助提升模块（Event-aided Lifting Module, ELM），该模块有效融合了二维RGB和事件数据特征，并将其提升到三维空间，从而增强了视角变换能力和三维体积构建的鲁棒性，提升了不同场景下的预测精度。

链接: https://arxiv.org/abs/2502.02334
作者: Shangwei Guo,Hao Shi,Song Wang,Xiaoting Yin,Kailun Yang,Kaiwei Wang
机构: State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University (浙江大学极强光物理与仪器国家重点实验室); School of Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University (湖南大学机器人学院及机器人视觉感知与控制技术国家工程研究中心); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The established datasets and codebase will be made publicly at this https URL

点击查看摘要

Abstract:Autonomous driving systems rely on robust 3D scene understanding. Recent advances in Semantic Scene Completion (SSC) for autonomous driving underscore the limitations of RGB-based approaches, which struggle under motion blur, poor lighting, and adverse weather. Event cameras, offering high dynamic range and low latency, address these challenges by providing asynchronous data that complements RGB inputs. We present DSEC-SSC, the first real-world benchmark specifically designed for event-aided SSC, which includes a novel 4D labeling pipeline for generating dense, visibility-aware labels that adapt dynamically to object motion. Our proposed RGB-Event fusion framework, EvSSC, introduces an Event-aided Lifting Module (ELM) that effectively bridges 2D RGB-Event features to 3D space, enhancing view transformation and the robustness of 3D volume construction across SSC models. Extensive experiments on DSEC-SSC and simulated SemanticKITTI-E demonstrate that EvSSC is adaptable to both transformer-based and LSS-based SSC architectures. Notably, evaluations on SemanticKITTI-C demonstrate that EvSSC achieves consistently improved prediction accuracy across five degradation modes and both In-domain and Out-of-domain settings, achieving up to a 52.5% relative improvement in mIoU when the image sensor partially fails. Additionally, we quantitatively and qualitatively validate the superiority of EvSSC under motion blur and extreme weather conditions, where autonomous driving is challenged. The established datasets and our codebase will be made publicly at this https URL.
zh

[CV-28] Improving Generalization Ability for 3D Object Detection by Learning Sparsity-invariant Features ICRA2025

【速读】：该论文旨在解决3D目标检测在未见过的领域中性能显著下降的问题。关键解决方案在于通过选择性子采样源数据以特定的密度，并利用教师-学生框架以及特征内容对齐（Feature Content Alignment, FCA）和基于图的嵌入关系对齐（Graph-based Embedding Relationship Alignment, GERA）方法，使检测器能够从单一源域泛化到具有不同传感器配置和场景分布的目标域，从而提升其领域不变特性，增强泛化能力。

链接: https://arxiv.org/abs/2502.02322
作者: Hsin-Cheng Lu,Chung-Yi Lin,Winston H. Hsu
机构: National Taiwan University; Mobile Drive Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICRA 2025. Code is available at this https URL

点击查看摘要

Abstract:In autonomous driving, 3D object detection is essential for accurately identifying and tracking objects. Despite the continuous development of various technologies for this task, a significant drawback is observed in most of them-they experience substantial performance degradation when detecting objects in unseen domains. In this paper, we propose a method to improve the generalization ability for 3D object detection on a single domain. We primarily focus on generalizing from a single source domain to target domains with distinct sensor configurations and scene distributions. To learn sparsity-invariant features from a single source domain, we selectively subsample the source data to a specific beam, using confidence scores determined by the current detector to identify the density that holds utmost importance for the detector. Subsequently, we employ the teacher-student framework to align the Bird’s Eye View (BEV) features for different point clouds densities. We also utilize feature content alignment (FCA) and graph-based embedding relationship alignment (GERA) to instruct the detector to be domain-agnostic. Extensive experiments demonstrate that our method exhibits superior generalization capabilities compared to other baselines. Furthermore, our approach even outperforms certain domain adaptation methods that can access to the target domain data.
zh

[CV-29] Review of Demographic Bias in Face Recognition

【速读】：该论文旨在解决人脸识别（Face Recognition, FR）系统中存在的种族、民族和性别等人口统计学差异导致的偏见问题。论文的关键在于系统性地分析这些偏见的主要成因、相关数据集、评估指标以及缓解策略，从而为理解和应对这一复杂问题提供结构化的方法，并强调实现公平且可信的FR系统的紧迫性。

链接: https://arxiv.org/abs/2502.02309
作者: Ketan Kotwal,Sebastien Marcel
机构: Idiap Research Institute (就地人工智能研究所), University of Lausanne (洛桑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: under review

点击查看摘要

Abstract:Demographic bias in face recognition (FR) has emerged as a critical area of research, given its impact on fairness, equity, and reliability across diverse applications. As FR technologies are increasingly deployed globally, disparities in performance across demographic groups – such as race, ethnicity, and gender – have garnered significant attention. These biases not only compromise the credibility of FR systems but also raise ethical concerns, especially when these technologies are employed in sensitive domains. This review consolidates extensive research efforts providing a comprehensive overview of the multifaceted aspects of demographic bias in FR. We systematically examine the primary causes, datasets, assessment metrics, and mitigation approaches associated with demographic disparities in FR. By categorizing key contributions in these areas, this work provides a structured approach to understanding and addressing the complexity of this issue. Finally, we highlight current advancements and identify emerging challenges that need further investigation. This article aims to provide researchers with a unified perspective on the state-of-the-art while emphasizing the critical need for equitable and trustworthy FR systems. Comments: under review Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR) Cite as: arXiv:2502.02309 [cs.CV] (or arXiv:2502.02309v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.02309 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-30] UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

【速读】：该论文旨在解决当前注视估计模型在多样化数据域中泛化能力不足的问题。论文的关键解决方案在于提出UniGaze方法，通过利用大规模野外面部数据集进行自监督预训练，采用掩码自动编码器（Masked Autoencoder, MAE）和视觉变换器（Vision Transformer, ViT）来学习适用于下游注视估计任务的特征表示。这种方法显著提升了模型在多个数据域中的泛化性能，并减少了对昂贵标注数据的依赖。

链接: https://arxiv.org/abs/2502.02307
作者: Jiawei Qin,Xucong Zhang,Yusuke Sugano
机构: Institute of Industrial Science, The University of Tokyo(工业科学研究所，东京大学); Delft University of Technology(代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite decades of research on data collection and model architectures, current gaze estimation models face significant challenges in generalizing across diverse data domains. While recent advances in self-supervised pre-training have shown remarkable potential for improving model generalization in various vision tasks, their effectiveness in gaze estimation remains unexplored due to the geometric nature of the gaze regression task. We propose UniGaze, which leverages large-scale, in-the-wild facial datasets through self-supervised pre-training for gaze estimation. We carefully curate multiple facial datasets that capture diverse variations in identity, lighting, background, and head poses. By directly applying Masked Autoencoder (MAE) pre-training on normalized face images with a Vision Transformer (ViT) backbone, our UniGaze learns appropriate feature representations within the specific input space required by downstream gaze estimation models. Through comprehensive experiments using challenging cross-dataset evaluation and novel protocols, including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. The source code and pre-trained models will be released upon acceptance.
zh

[CV-31] GP-GS: Gaussian Processes for Enhanced Gaussian Splatting

【速读】：该论文旨在解决3D Gaussian Splatting在场景重建过程中因依赖稀疏Structure-from-Motion (SfM)点云而导致的重建质量下降问题。解决方案的关键在于提出了一种名为Gaussian Processes Gaussian Splatting (GP-GS)的新框架，通过开发一个多输出高斯过程模型来实现自适应且基于不确定性的稀疏SfM点云稠密化。具体而言，该框架采用动态采样和过滤管道，利用基于高斯过程的预测从输入的2D像素和深度图中推断新的候选点，同时利用不确定性估计指导高方差预测的剪枝，确保几何一致性并生成密集的点云。这些密集化的点云提供了高质量的初始3D高斯分布，从而提高了重建性能。

链接: https://arxiv.org/abs/2502.02283
作者: Zhihao Guo,Jingxuan Su,Shenglin Wang,Jinlong Fan,Jing Zhang,Liangxiu Han,Peng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages,11 figures

点击查看摘要

Abstract:3D Gaussian Splatting has emerged as an efficient photorealistic novel view synthesis method. However, its reliance on sparse Structure-from-Motion (SfM) point clouds consistently compromises the scene reconstruction quality. To address these limitations, this paper proposes a novel 3D reconstruction framework Gaussian Processes Gaussian Splatting (GP-GS), where a multi-output Gaussian Process model is developed to achieve adaptive and uncertainty-guided densification of sparse SfM point clouds. Specifically, we propose a dynamic sampling and filtering pipeline that adaptively expands the SfM point clouds by leveraging GP-based predictions to infer new candidate points from the input 2D pixels and depth maps. The pipeline utilizes uncertainty estimates to guide the pruning of high-variance predictions, ensuring geometric consistency and enabling the generation of dense point clouds. The densified point clouds provide high-quality initial 3D Gaussians to enhance reconstruction performance. Extensive experiments conducted on synthetic and real-world datasets across various scales validate the effectiveness and practicality of the proposed framework.
zh

[CV-32] Survey of Quantization Techniques for On-Device Vision-based Crack Detection

【速读】：该论文旨在解决在资源受限设备上实现高效深度学习模型以支持基于视觉的裂纹检测的问题。解决方案的关键在于利用量化 Aware 训练（Quantization-Aware Training, QAT），该方法能够在保持接近浮点精度的同时，确保高效的资源使用，如MobileNetV2x0.5在Torch-QAT下的F1分数达到0.8376，从而实现实时、低功耗的裂纹检测，进而提升结构健康监测（SHM）应用的安全性、可扩展性和成本效益。

链接: https://arxiv.org/abs/2502.02269
作者: Yuxuan Zhang,Luciano Sebastian Martinez-Rau,Quynh Nguyen Phuong Vu,Bengt Oelmann,Sebastian Bader
机构: Mid Sweden University (中期瑞典大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IEEE International Instrumentation and Measurement Technology Conference (I2MTC) 2025

点击查看摘要

Abstract:Structural Health Monitoring (SHM) ensures the safety and longevity of infrastructure by enabling timely damage detection. Vision-based crack detection, combined with UAVs, addresses the limitations of traditional sensor-based SHM methods but requires the deployment of efficient deep learning models on resource-constrained devices. This study evaluates two lightweight convolutional neural network models, MobileNetV1x0.25 and MobileNetV2x0.5, across TensorFlow, PyTorch, and Open Neural Network Exchange platforms using three quantization techniques: dynamic quantization, post-training quantization (PTQ), and quantization-aware training (QAT). Results show that QAT consistently achieves near-floating-point accuracy, such as an F1-score of 0.8376 for MBNV2x0.5 with Torch-QAT, while maintaining efficient resource usage. PTQ significantly reduces memory and energy consumption but suffers from accuracy loss, particularly in TensorFlow. Dynamic quantization preserves accuracy but faces deployment challenges on PyTorch. By leveraging QAT, this work enables real-time, low-power crack detection on UAVs, enhancing safety, scalability, and cost-efficiency in SHM applications, while providing insights into balancing accuracy and efficiency across different platforms for autonomous inspections.
zh

[CV-33] UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation ICLR2025

【速读】：该论文旨在解决在预训练（Pre-training）和微调（Fine-tuning）数据模态存在较大域差距（Domain Gap）的情况下，语义分割任务性能下降的问题。论文的关键在于通过分析不同预训练方法在红外语义分割中的表现，揭示了局部、混合和全局三种典型的注意力模式，并指出混合模式对于语义分割至关重要，因为它同时关注邻近和前景元素。基于这些洞见，论文提出了一种名为UNIP（UNified Infrared Pre-training）的统一红外预训练框架，其关键在于采用混合注意力蒸馏NMI-HAD作为预训练目标，使用大规模混合数据集InfMix进行预训练，并利用最后一层特征金字塔网络LL-FPN进行微调。实验结果表明，UNIP在三种红外分割任务中，相较于其他预训练方法平均mIoU提升了多达13.5%，且在计算成本仅为MAE-L的十分之一时达到了相当的性能。

链接: https://arxiv.org/abs/2502.02257
作者: Tao Zhang,Jinyong Wen,Zhen Chen,Kun Ding,Shiming Xiang,Chunhong Pan
机构: MAIS, Institute of Automation, Chinese Academy of Sciences(自动化研究所,中国科学院,MAIS); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); School of Automation Science and Electrical Engineering, Beihang University(北京航空航天大学自动化科学与电气工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025. 27 pages, 13 figures, 21 tables

点击查看摘要

Abstract:Pre-training techniques significantly enhance the performance of semantic segmentation tasks with limited training data. However, the efficacy under a large domain gap between pre-training (e.g. RGB) and fine-tuning (e.g. infrared) remains underexplored. In this study, we first benchmark the infrared semantic segmentation performance of various pre-training methods and reveal several phenomena distinct from the RGB domain. Next, our layerwise analysis of pre-trained attention maps uncovers that: (1) There are three typical attention patterns (local, hybrid, and global); (2) Pre-training tasks notably influence the pattern distribution across layers; (3) The hybrid pattern is crucial for semantic segmentation as it attends to both nearby and foreground elements; (4) The texture bias impedes model generalization in infrared tasks. Building on these insights, we propose UNIP, a UNified Infrared Pre-training framework, to enhance the pre-trained model performance. This framework uses the hybrid-attention distillation NMI-HAD as the pre-training target, a large-scale mixed dataset InfMix for pre-training, and a last-layer feature pyramid network LL-FPN for fine-tuning. Experimental results show that UNIP outperforms various pre-training methods by up to 13.5% in average mIoU on three infrared segmentation tasks, evaluated using fine-tuning and linear probing metrics. UNIP-S achieves performance on par with MAE-L while requiring only 1/10 of the computational cost. Furthermore, UNIP significantly surpasses state-of-the-art (SOTA) infrared or RGB segmentation methods and demonstrates broad potential for application in other modalities, such as RGB and depth. Our code is available at this https URL.
zh

[CV-34] Rotation-Adaptive Point Cloud Domain Generalization via Intricate Orientation Learning

【速读】：该论文旨在解决三维点云分析在不可预测旋转下的脆弱性问题，即方位感知的三维领域泛化（orientation-aware 3D domain generalization）。为应对这一挑战，论文提出了一种创新的旋转自适应领域泛化框架。解决方案的关键在于通过迭代学习过程利用复杂的样本，识别每个点云最具挑战性的旋转，并构建复杂的方位集以优化这些方位。此外，采用了一种方位感知对比学习框架，结合方位一致性损失和边距分离损失，从而实现具有旋转一致性的类别判别性和泛化性特征的有效学习。

链接: https://arxiv.org/abs/2502.02247
作者: Bangzhen Liu,Chenxi Zheng,Xuemiao Xu,Cheng Xu,Huaidong Zhang,Shengfeng He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13pages, supplementary included, early accepted by TPAMI

点击查看摘要

Abstract:The vulnerability of 3D point cloud analysis to unpredictable rotations poses an open yet challenging problem: orientation-aware 3D domain generalization. Cross-domain robustness and adaptability of 3D representations are crucial but not easily achieved through rotation augmentation. Motivated by the inherent advantages of intricate orientations in enhancing generalizability, we propose an innovative rotation-adaptive domain generalization framework for 3D point cloud analysis. Our approach aims to alleviate orientational shifts by leveraging intricate samples in an iterative learning process. Specifically, we identify the most challenging rotation for each point cloud and construct an intricate orientation set by optimizing intricate orientations. Subsequently, we employ an orientation-aware contrastive learning framework that incorporates an orientation consistency loss and a margin separation loss, enabling effective learning of categorically discriminative and generalizable features with rotation consistency. Extensive experiments and ablations conducted on 3D cross-domain benchmarks firmly establish the state-of-the-art performance of our proposed approach in the context of orientation-aware 3D domain generalization.
zh

[CV-35] Mask-informed Deep Contrastive Incomplete Multi-view Clustering

【速读】：该论文旨在解决多视角聚类（Multi-view Clustering, MvC）中因特定视角数据缺失而导致的知识整合难题。关键解决方案在于提出了一种名为Mask-informed Deep Contrastive Incomplete Multi-view Clustering (Mask-IMvC) 的方法，通过引入一个基于掩码的信息融合网络，聚合不完整多视角信息，并考虑样本在各视角中的观测状态以减轻缺失值带来的负面影响。此外，设计了一种辅助对比学习损失函数，通过注入不同视角样本的邻域信息来增强聚合的公共视图表示能力。

链接: https://arxiv.org/abs/2502.02234
作者: Zhenglai Li,Yuqi Shi,Xiao He,Chang Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-view clustering (MvC) utilizes information from multiple views to uncover the underlying structures of data. Despite significant advancements in MvC, mitigating the impact of missing samples in specific views on the integration of knowledge from different views remains a critical challenge. This paper proposes a novel Mask-informed Deep Contrastive Incomplete Multi-view Clustering (Mask-IMvC) method, which elegantly identifies a view-common representation for clustering. Specifically, we introduce a mask-informed fusion network that aggregates incomplete multi-view information while considering the observation status of samples across various views as a mask, thereby reducing the adverse effects of missing values. Additionally, we design a prior knowledge-assisted contrastive learning loss that boosts the representation capability of the aggregated view-common representation by injecting neighborhood information of samples from different views. Finally, extensive experiments are conducted to demonstrate the superiority of the proposed Mask-IMvC method over state-of-the-art approaches across multiple MvC datasets, both in complete and incomplete scenarios.
zh

[CV-36] A Robust Remote Photoplethysmography Method

【速读】：该论文旨在解决在非理想条件下（如受试者运动和光照变化影响）远程光体积描记法（rPPG）心率测量准确性下降的问题。关键解决方案在于提出了一种结合数学变换的方法来计算心率，这种方法对失真的抵抗能力更强，并且硬件需求最少。即使使用未改装红外滤镜的相机也能实现较好的效果。实验结果表明，该方法的平均绝对误差为1.95次/分钟，优于先前的研究成果。

链接: https://arxiv.org/abs/2502.02229
作者: Alexey Protopopov
机构: Joint Stock Research and Production Company Kryptonite (股份制研究与生产公司克里普顿石)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 1 table

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) is a method for measuring a subjects heart rate remotely using a camera. Factors such as subject movement, ambient light level, makeup etc. complicate such measurements by distorting the observed pulse. Recent works on this topic have proposed a variety of approaches for accurately measuring heart rate in humans, however these methods were tested in ideal conditions, where the subject does not make significant movements and all measurements are taken at the same level of illumination. In more realistic conditions these methods suffer from decreased accuracy. The study proposes a more robust method that is less susceptible to distortions and has minimal hardware requirements. The proposed method uses a combination of mathematical transforms to calculate the subjects heart rate. It performs best when used with a camera that has been modified by removing its infrared filter, although using an unmodified camera is also possible. The method was tested on 26 videos taken from 19 volunteers of varying gender and age. The obtained results were compared to reference data and the average mean absolute error was found to be at 1.95 beats per minute, which is noticeably better than the results from previous works. The remote photoplethysmography method proposed in the present article is more resistant to distortions than methods from previous publications and thus allows one to remotely and accurately measure the subjects heart rate without imposing any significant limitations on the subjects behavior.
zh

[CV-37] Exploring the latent space of diffusion models directly through singular value decomposition

【速读】：该论文旨在探索扩散模型（Diffusion Models, DMs）的潜在空间以实现可控且保真的图像编辑。论文的关键在于通过奇异值分解（Singular Value Decomposition, SVD）直接研究潜在空间，并发现了三个有助于控制生成结果的有用属性，这些属性能够在无需数据收集的情况下保持生成图像的身份保真度。基于这些发现，作者提出了一种新颖的图像编辑框架，能够从文本提示（text prompts）中学习任意属性。

链接: https://arxiv.org/abs/2502.02225
作者: Li Wang,Boyan Gao,Yanran Li,Zhao Wang,Xiaosong Yang,David A. Clifton,Jun Xiao
机构: Zhejiang University (浙江大学); University of Oxford (牛津大学); University of Bedfordshire (贝德福德郡大学); Bournemouth University (伯恩茅斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Despite the groundbreaking success of diffusion models in generating high-fidelity images, their latent space remains relatively under-explored, even though it holds significant promise for enabling versatile and interpretable image editing capabilities. The complicated denoising trajectory and high dimensionality of the latent space make it extremely challenging to interpret. Existing methods mainly explore the feature space of U-Net in Diffusion Models (DMs) instead of the latent space itself. In contrast, we directly investigate the latent space via Singular Value Decomposition (SVD) and discover three useful properties that can be used to control generation results without the requirements of data collection and maintain identity fidelity generated images. Based on these properties, we propose a novel image editing framework that is capable of learning arbitrary attributes from one pair of latent codes destined by text prompts in Stable Diffusion Models. To validate our approach, extensive experiments are conducted to demonstrate its effectiveness and flexibility in image editing. We will release our codes soon to foster further research and applications in this area.
zh

[CV-38] InterLCM: Low-Quality Images as Intermediate States of Latent Consistency Models for Effective Blind Face Restoration ICLR2025

【速读】：该论文旨在解决盲面恢复（Blind Face Restoration, BFR）中扩散先验（diffusion prior）导致的语义一致性差及优化困难的问题。论文的关键解决方案是提出InterLCM方法，通过利用潜一致性模型（Latent Consistency Model, LCM）的优越语义一致性和效率，来改善这些问题。InterLCM通过对低质量图像进行处理，并从LCM的早期步骤开始，实现保真度与质量之间的平衡。此外，它还整合了感知损失以提高训练过程中的恢复质量，特别是在真实场景中。为了减轻结构和语义不确定性，InterLCM引入了一个视觉模块来提取视觉特征和一个空间编码器来捕捉空间细节，从而增强恢复图像的保真度。

链接: https://arxiv.org/abs/2502.02215
作者: Senmao Li,Kai Wang,Joost van de Weijer,Fahad Shahbaz Khan,Chun-Le Guo,Shiqi Yang,Yaxing Wang,Jian Yang,Ming-Ming Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR2025

点击查看摘要

Abstract:Diffusion priors have been used for blind face restoration (BFR) by fine-tuning diffusion models (DMs) on restoration datasets to recover low-quality images. However, the naive application of DMs presents several key limitations. (i) The diffusion prior has inferior semantic consistency (e.g., ID, structure and color.), increasing the difficulty of optimizing the BFR model; (ii) reliance on hundreds of denoising iterations, preventing the effective cooperation with perceptual losses, which is crucial for faithful restoration. Observing that the latent consistency model (LCM) learns consistency noise-to-data mappings on the ODE-trajectory and therefore shows more semantic consistency in the subject identity, structural information and color preservation, we propose InterLCM to leverage the LCM for its superior semantic consistency and efficiency to counter the above issues. Treating low-quality images as the intermediate state of LCM, InterLCM achieves a balance between fidelity and quality by starting from earlier LCM steps. LCM also allows the integration of perceptual loss during training, leading to improved restoration quality, particularly in real-world scenarios. To mitigate structural and semantic uncertainties, InterLCM incorporates a Visual Module to extract visual features and a Spatial Encoder to capture spatial details, enhancing the fidelity of restored images. Extensive experiments demonstrate that InterLCM outperforms existing approaches in both synthetic and real-world datasets while also achieving faster inference speed.
zh

[CV-39] Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition WWW2025

【速读】：该论文旨在解决跨视角孤立手语识别（Cross-View Isolated Sign Language Recognition, CV-ISLR）的问题，传统孤立手语识别（ISLR）数据集主要从正面视角捕获手语视频，而在实际应用中，摄像头角度往往各不相同。为从不同视角准确识别手语，模型需理解多角度手势，这使得跨视角识别极具挑战性。论文的关键解决方案在于采用集成学习策略，通过多维度Video Swin Transformer模型增强模型在多样视角下的鲁棒性和泛化能力，最终在RGB和RGB-D两种基于视图的手语识别赛道中分别排名第三，证明了其有效性。代码可在提供的链接中获取。

链接: https://arxiv.org/abs/2502.02196
作者: Fei Wang,Kun Li,Yiqi Nie,Zhangling Duan,Peng Zou,Zhiliang Wu,Yuwei Wang,Yanyan Wei
机构: Hefei University of Technology(合肥工业大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究所); Anhui University(安徽大学); CCAI, Zhejiang University(浙江大学CCAI); Hangzhou(杭州); Anhui Agricultural University(安徽农业大学); Anhui Province Key Laboratory of Industry Safety and Emergency Technology(安徽省工业安全与应急技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 3rd Place in Cross-View Isolated Sign Language Recognition Challenge at WWW 2025

点击查看摘要

Abstract:In this paper, we present our solution to the Cross-View Isolated Sign Language Recognition (CV-ISLR) challenge held at WWW 2025. CV-ISLR addresses a critical issue in traditional Isolated Sign Language Recognition (ISLR), where existing datasets predominantly capture sign language videos from a frontal perspective, while real-world camera angles often vary. To accurately recognize sign language from different viewpoints, models must be capable of understanding gestures from multiple angles, making cross-view recognition challenging. To address this, we explore the advantages of ensemble learning, which enhances model robustness and generalization across diverse views. Our approach, built on a multi-dimensional Video Swin Transformer model, leverages this ensemble strategy to achieve competitive performance. Finally, our solution ranked 3rd in both the RGB-based ISLR and RGB-D-based ISLR tracks, demonstrating the effectiveness in handling the challenges of cross-view recognition. The code is available at: this https URL.
zh

[CV-40] ShapeShifter: 3D Variations Using Multiscale and Sparse Point-Voxel Diffusion

【速读】：该论文旨在解决现有3D生成方法缺乏几何细节以及需要长时间训练和大量资源的问题。解决方案的关键在于结合稀疏体素网格和点、法线及颜色采样，通过一个多尺度神经架构实现高效并行训练，从而更好地捕捉原始输入的细节，并处理更广泛的表面类型。

链接: https://arxiv.org/abs/2502.02187
作者: Nissim Maruani,Wang Yifan,Matthew Fisher,Pierre Alliez,Mathieu Desbrun
机构: Inria, Université Côte d’Azur; Adobe Research; Inria, Université Côte d’Azur; Inria Saclay - Ecole Polytechnique
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes ShapeShifter, a new 3D generative model that learns to synthesize shape variations based on a single reference model. While generative methods for 3D objects have recently attracted much attention, current techniques often lack geometric details and/or require long training times and large resources. Our approach remedies these issues by combining sparse voxel grids and point, normal, and color sampling within a multiscale neural architecture that can be trained efficiently and in parallel. We show that our resulting variations better capture the fine details of their original input and can handle more general types of surfaces than previous SDF-based methods. Moreover, we offer interactive generation of 3D shape variants, allowing more human control in the design loop if needed.
zh

[CV-41] Sequence models for continuous cell cycle stage prediction from brightfield images

【速读】：该论文旨在解决利用非荧光亮场成像预测连续Fucci信号以监测细胞周期动态的问题。解决方案的关键在于采用序列模型，特别是因果状态空间模型和双向Transformer模型，这些模型显著优于单帧和固定帧方法，能够在1小时的时间分辨率下预测如G1/S等视觉上难以察觉的细胞周期转换。

链接: https://arxiv.org/abs/2502.02182
作者: Louis-Alexandre Leger,Maxine Leonardi,Andrea Salati,Felix Naef,Martin Weigert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding cell cycle dynamics is crucial for studying biological processes such as growth, development and disease progression. While fluorescent protein reporters like the Fucci system allow live monitoring of cell cycle phases, they require genetic engineering and occupy additional fluorescence channels, limiting broader applicability in complex experiments. In this study, we conduct a comprehensive evaluation of deep learning methods for predicting continuous Fucci signals using non-fluorescence brightfield imaging, a widely available label-free modality. To that end, we generated a large dataset of 1.3 M images of dividing RPE1 cells with full cell cycle trajectories to quantitatively compare the predictive performance of distinct model categories including single time-frame models, causal state space models and bidirectional transformer models. We show that both causal and transformer-based models significantly outperform single- and fixed frame approaches, enabling the prediction of visually imperceptible transitions like G1/S within 1h resolution. Our findings underscore the importance of sequence models for accurate predictions of cell cycle dynamics and highlight their potential for label-free imaging.
zh

[CV-42] VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation

【速读】：该论文旨在解决Vision-Language-Action (VLA)模型在机器人任务中的高计算成本问题，特别是由于其需要实时决策以快速响应环境变化。解决方案的关键在于提出了一种名为VLA-Cache的高效VLA模型，通过引入一种令牌选择机制来比较每一步的视觉输入与前一步的输入，从而自适应地识别出变化最小的视觉令牌，并通过KV缓存重用这些不变令牌的计算结果，显著提高了VLA-Cache模型的效率。

链接: https://arxiv.org/abs/2502.02175
作者: Siyu Xu,Yunke Wang,Chenghao Xia,Dihao Zhu,Tao Huang,Chang Xu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) model can process instructions and visual perception to directly generate actions as output in an end-to-end fashion due to its strong multi-modal reasoning capabilities. While the performance of VLA models is promising, their computational cost can be substantial. This raises challenge for applying them on robotics tasks, which requires real-time decision-making to respond quickly to environmental changes. Since robotic control involves sequential decision-making, the visual input often exhibits minimal variation between successive steps. A natural idea is to reuse the computational results of unchanged visual tokens from the last step. Motivated by this idea, we propose VLA-Cache, an efficient vision-language-action model. VLA-Cache incorporates a token-selection mechanism that compares the visual input at each step with the input from the previous step, adaptively identifying visual tokens with minimal changes. The computational results for these unchanged tokens are then reused in subsequent steps via KV-cache, thereby significantly improving the efficiency of the VLA-Cache model. Experimental results on both simulation (e.g., LIBERO benchmark and SIMPLER) and real-world robot valid VLA-Cache can achieve practical acceleration with minimal sacrifice in success rate.
zh

[CV-43] EditIQ: Automated Cinematic Editing of Static Wide-Angle Videos via Dialogue Interpretation and Saliency Cues

【速读】：本文旨在解决通过单一固定高分辨率广角摄像机捕捉场景的自动化电影编辑问题。解决方案的关键在于EditIQ框架，它通过生成多个虚拟摄像机视角（rushes）来模拟多摄像机组，并利用自动编辑算法优化这些视角的选择与过渡，以呈现最生动的场景内容。这一过程结合了基于大型语言模型（LLM）的对话理解模块和视觉显著性预测技术，前者用于分析对话流程，后者用于识别有意义的场景元素及相应的摄像机视角。最终，该方法将电影视频编辑问题转化为一个关于镜头选择的能量最小化问题，确保编辑后的视频在保持叙事连贯性和观赏流畅性的同时，具有美学和视觉吸引力。

链接: https://arxiv.org/abs/2502.02172
作者: Rohit Girmaji,Bhav Beri,Ramanathan Subramanian,Vineet Gandhi
机构: CVIT, IIIT Hyderabad, India(印度国际信息技术学院CVIT研究中心); University of Canberra (堪培拉大学), Australia; IIIT Hyderabad, India(印度国际信息技术学院海得拉巴分校)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted at 30th International Conference on Intelligent User Interfaces (IUI 25)

点击查看摘要

Abstract:We present EditIQ, a completely automated framework for cinematically editing scenes captured via a stationary, large field-of-view and high-resolution camera. From the static camera feed, EditIQ initially generates multiple virtual feeds, emulating a team of cameramen. These virtual camera shots termed rushes are subsequently assembled using an automated editing algorithm, whose objective is to present the viewer with the most vivid scene content. To understand key scene elements and guide the editing process, we employ a two-pronged approach: (1) a large language model (LLM)-based dialogue understanding module to analyze conversational flow, coupled with (2) visual saliency prediction to identify meaningful scene elements and camera shots therefrom. We then formulate cinematic video editing as an energy minimization problem over shot selection, where cinematic constraints determine shot choices, transitions, and continuity. EditIQ synthesizes an aesthetically and visually compelling representation of the original narrative while maintaining cinematic coherence and a smooth viewing experience. Efficacy of EditIQ against competing baselines is demonstrated via a psychophysical study involving twenty participants on the BBC Old School dataset plus eleven theatre performance videos. Video samples from EditIQ can be found at this https URL.
zh

[CV-44] DeepForest: Sensing Into Self-Occluding Volumes of Vegetation With Aerial Imaging

【速读】：该论文旨在解决通过遥感技术难以穿透密集植被层以获取冠层下方三维植被结构数据的问题。解决方案的关键在于利用合成孔径成像技术结合无人机扫描焦平面堆栈，并采用预训练的三维卷积神经网络减少离焦信号的贡献，从而实现对强遮挡下大规模植被体积的深度感知。这一方法能够提供植被体积的低频表示，进而通过多光谱反射率堆栈分析整个植被体积内的植物健康、生长及环境条件。

链接: https://arxiv.org/abs/2502.02171
作者: Mohamed Youssef,Jian Peng,Oliver Bimber
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Access to below-canopy volumetric vegetation data is crucial for understanding ecosystem dynamics. We address the long-standing limitation of remote sensing to penetrate deep into dense canopy layers. LiDAR and radar are currently considered the primary options for measuring 3D vegetation structures, while cameras can only extract the reflectance and depth of top layers. Using conventional, high-resolution aerial images, our approach allows sensing deep into self-occluding vegetation volumes, such as forests. It is similar in spirit to the imaging process of wide-field microscopy, but can handle much larger scales and strong occlusion. We scan focal stacks by synthetic-aperture imaging with drones and reduce out-of-focus signal contributions using pre-trained 3D convolutional neural networks. The resulting volumetric reflectance stacks contain low-frequency representations of the vegetation volume. Combining multiple reflectance stacks from various spectral channels provides insights into plant health, growth, and environmental conditions throughout the entire vegetation volume.
zh

[CV-45] Progressive Correspondence Regenerator for Robust 3D Registration

【速读】：该论文旨在解决高异常值比率下准确获取足够高质量对应关系的挑战。现有方法主要依赖于异常值剔除，这在极端异常值比率下要么无法正确识别准确对应关系，要么选择过少正确的对应关系以支持鲁棒配准。为了解决这一问题，论文提出了一种名为Regor的渐进式对应关系再生成方法。关键在于通过先验引导的局部分组和广义互匹配生成局部区域对应关系，并采用中心感知三点一致性进行局部对应关系修正而非剔除，同时结合全局对应关系优化，从而逐步迭代生成大量高质量对应关系。这种方法显著优于现有的异常值剔除技术，并且能够获得比这些方法多十倍的正确对应关系，从而实现即使在弱特征情况下也能达到鲁棒配准。

链接: https://arxiv.org/abs/2502.02163
作者: Guiyu Zhao,Sheng Ao,Ye Zhang,Kai Xu Yulan Guo
机构: Beijing Institute of Technology(北京理工大学); Sun Yat-sen University(中山大学); National University of Defense Technology(国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Obtaining enough high-quality correspondences is crucial for robust registration. Existing correspondence refinement methods mostly follow the paradigm of outlier removal, which either fails to correctly identify the accurate correspondences under extreme outlier ratios, or select too few correct correspondences to support robust registration. To address this challenge, we propose a novel approach named Regor, which is a progressive correspondence regenerator that generates higher-quality matches whist sufficiently robust for numerous outliers. In each iteration, we first apply prior-guided local grouping and generalized mutual matching to generate the local region correspondences. A powerful center-aware three-point consistency is then presented to achieve local correspondence correction, instead of removal. Further, we employ global correspondence refinement to obtain accurate correspondences from a global perspective. Through progressive iterations, this process yields a large number of high-quality correspondences. Extensive experiments on both indoor and outdoor datasets demonstrate that the proposed Regor significantly outperforms existing outlier removal techniques. More critically, our approach obtain 10 times more correct correspondences than outlier removal methods. As a result, our method is able to achieve robust registration even with weak features. The code will be released.
zh

[CV-46] On the Guidance of Flow Matching

【速读】：该论文旨在解决流匹配（Flow Matching）在引导（guidance）方面的挑战，特别是其引导方式相较于前驱模型如扩散模型（Diffusion Models）具有更广泛且显著不同的特性。论文的关键在于提出首个通用的流匹配引导框架，并由此衍生出一系列可应用于通用流匹配的引导技术。这些技术包括一种新的无需训练即可渐近精确的引导方法、用于基于训练的引导的新损失函数，以及涵盖经典梯度引导方法的两类近似引导方法。论文通过理论分析为不同场景下选择合适的方法提供了实用指南，并通过实验验证了所提引导方法的有效性和流匹配引导框架的正确性。

链接: https://arxiv.org/abs/2502.02150
作者: Ruiqi Feng,Tailin Wu,Chenglei Yu,Wenhao Deng,Peiyan Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 35 pages, 7 figures

点击查看摘要

Abstract:Flow matching has shown state-of-the-art performance in various generative tasks, ranging from image generation to decision-making, where guided generation is pivotal. However, the guidance of flow matching is more general than and thus substantially different from that of its predecessor, diffusion models. Therefore, the challenge in guidance for general flow matching remains largely underexplored. In this paper, we propose the first framework of general guidance for flow matching. From this framework, we derive a family of guidance techniques that can be applied to general flow matching. These include a new training-free asymptotically exact guidance, novel training losses for training-based guidance, and two classes of approximate guidance that cover classical gradient guidance methods as special cases. We theoretically investigate these different methods to give a practical guideline for choosing suitable methods in different scenarios. Experiments on synthetic datasets, image inverse problems, and offline reinforcement learning demonstrate the effectiveness of our proposed guidance methods and verify the correctness of our flow matching guidance framework. Code to reproduce the experiments can be found at this https URL.
zh

[CV-47] DOC-Depth: A novel approach for dense depth ground truth generation

【速读】：该论文旨在解决在大规模动态环境中实现完全密集且精确的深度估计的问题。关键在于提出了一种名为DOC-Depth的新方法，通过使用LiDAR测距仪生成密集深度图。该方法首先利用LiDAR里程计重建一致的密集三维环境，然后借助于先进的动态对象分类方法DOC自动处理动态对象的遮挡问题。这一方案不仅高效易部署，而且快速可扩展，能够创建大小和时间均不受限的数据集。

链接: https://arxiv.org/abs/2502.02144
作者: Simon de Moreau,Mathias Corsia,Hassan Bouchiba,Yasser Almehio,Andrei Bursuc,Hafid El-Idrissi,Fabien Moutarde
机构: Mines Paris – PSL (巴黎高等矿业学院); Valeo (法雷奥); Exwayz (埃克韦兹); Valeo AI (法雷奥人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint. Code and dataset available on the project page : this https URL

点击查看摘要

Abstract:Accurate depth information is essential for many computer vision applications. Yet, no available dataset recording method allows for fully dense accurate depth estimation in a large scale dynamic environment. In this paper, we introduce DOC-Depth, a novel, efficient and easy-to-deploy approach for dense depth generation from any LiDAR sensor. After reconstructing consistent dense 3D environment using LiDAR odometry, we address dynamic objects occlusions automatically thanks to DOC, our state-of-the art dynamic object classification method. Additionally, DOC-Depth is fast and scalable, allowing for the creation of unbounded datasets in terms of size and time. We demonstrate the effectiveness of our approach on the KITTI dataset, improving its density from 16.1% to 71.2% and release this new fully dense depth annotation, to facilitate future research in the domain. We also showcase results using various LiDAR sensors and in multiple environments. All software components are publicly available for the research community.
zh

[CV-48] BRIDLE: Generalized Self-supervised Learning with Quantization

【速读】：该论文旨在解决自监督学习在处理未标记数据时所面临的挑战，特别是在音频信号中使用单一码本进行向量化量化（Vector Quantization, VQ）存在的局限性。这些局限包括复杂多维信号难以被单一码本充分捕捉，以及码本利用率低下导致的码矢量利用不足。为了解决这些问题，论文提出了一种名为BRIDLE（双向残差量化交错离散学习编码器）的新框架。BRIDLE的关键创新在于引入了残差量化（Residual Quantization, RQ），通过多个层级码本实现潜在空间的细粒度离散化，从而提高表示质量。此外，BRIDLE采用编码器与标记器之间的交错训练过程，以增强模型性能。实验结果表明，BRIDLE在音频理解任务上达到了最先进的性能，并在图像分类和视频分类任务中也表现出了优于传统VQ方法的竞争力。

链接: https://arxiv.org/abs/2502.02118
作者: Hoang M. Nguyen,Satya N. Shukla,Qiang Zhang,Hanchao Yu,Sreya D. Roy,Taipeng Tian,Lingjiong Zhu,Yuchen Liu
机构: Florida State University; Meta (Meta)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT’s success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
zh

[CV-49] VerteNet – A Multi-Context Hybrid CNN Transformer for Accurate Vertebral Landmark Localization in Lateral Spine DXA Images

【速读】：该论文旨在解决在Dual Energy X-ray Absorptiometry (DXA)脊柱侧位影像(Lateral Spine Image, LSI)上的准确椎体标志定位(Accurate Vertebral Landmark Localization, VLL)问题，以检测脊柱疾病如驼背(kyphosis)和前凸(lordosis)，并评估腹部主动脉钙化(Abdominal Aortic Calcification, AAC)。论文的关键解决方案是提出了一种名为VerteNet的混合卷积神经网络-Transformer模型，该模型包含一种新颖的双分辨率注意力机制，即Dual Resolution Self-Attention (DRSA)和Dual Resolution Cross-Attention (DRCA)，用于捕捉DXA图像中的不同频率信息。此外，还设计了一个Multi-Context Feature Fusion Block (MCFB)来高效整合这些特征。通过这种方法，VerteNet在来自多种设备的620个DXA LSI数据集上训练，并取得了优于现有方法的结果。

链接: https://arxiv.org/abs/2502.02097
作者: Zaid Ilyas,Arooba Maqsood,Afsah Saleem,Erchuan Zhang,David Suter,Parminder Raina,Jonathan M. Hodgson,John T. Schousboe,William D. Leslie,Joshua R. Lewis,Syed Zulqarnain Gilani
机构: Center for AI & ML, School of Science, and Nutrition and Health Innovation Research Institute, Edith Cowan University, Australia (埃迪斯科文大学); Computer Science and Software Engineering, The University of Western Australia (西澳大利亚大学); School of Science, Sun Yat-sen University, China (中山大学); School of Medical and Health Sciences, and Nutrition and Health Innovation Research Institute, Edith Cowan University, Australia (埃迪斯科文大学); Park Nicollet Clinic and HealthPartners Institute, Minneapolis, USA (明尼苏达州健康合作伙伴研究所); Department of Medicine and Radiology, University of Manitoba, Canada (曼尼托巴大学); Department of Health Sciences, McMaster University, Canada (麦克马斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages with 7 figures

点击查看摘要

Abstract:Lateral Spine Image (LSI) analysis is important for medical diagnosis, treatment planning, and detailed spinal health assessments. Although modalities like Computed Tomography and Digital X-ray Imaging are commonly used, Dual Energy X-ray Absorptiometry (DXA) is often preferred due to lower radiation exposure, seamless capture, and cost-effectiveness. Accurate Vertebral Landmark Localization (VLL) on LSIs is important to detect spinal conditions like kyphosis and lordosis, as well as assessing Abdominal Aortic Calcification (AAC) using Inter-Vertebral Guides (IVGs). Nonetheless, few automated VLL methodologies have concentrated on DXA LSIs. We present VerteNet, a hybrid CNN-Transformer model featuring a novel dual-resolution attention mechanism in self and cross-attention domains, referred to as Dual Resolution Self-Attention (DRSA) and Dual Resolution Cross-Attention (DRCA). These mechanisms capture the diverse frequencies in DXA images by operating at two different feature map resolutions. Additionally, we design a Multi-Context Feature Fusion Block (MCFB) that efficiently integrates the features using DRSA and DRCA. We train VerteNet on 620 DXA LSIs from various machines and achieve superior results compared to existing methods. We also design an algorithm that utilizes VerteNet’s predictions in estimating the Region of Interest (ROI) to detect potential abdominal aorta cropping, where inadequate soft tissue hinders calcification assessment. Additionally, we present a small proof-of-concept study to show that IVGs generated from VLL information can improve inter-reader correlation in AAC scoring, addressing two key areas of disagreement in expert AAC-24 scoring: IVG placement and quality control for full abdominal aorta assessment. The code for this work can be found at this https URL.
zh

[CV-50] Dual-Flow: Transferable Multi-Target Instance-Agnostic Attacks via In-the-wild Cascading Flow Optimization

【速读】：该论文旨在解决多目标实例无关对抗攻击在黑盒场景下传输成功率较低的问题。为了解决这一挑战，论文提出了一种新颖的Dual-Flow框架，并采用级联分布偏移训练方法来构建对抗速度函数。关键在于Dual-Flow框架显著提升了多目标生成式攻击的传输性能，在从Inception-v3到ResNet-152的模型上成功率提高了34.58%。

链接: https://arxiv.org/abs/2502.02096
作者: Yixiao Chen,Shikun Sun,Jianshu Li,Ruoyu Li,Zhe Li,Junliang Xing
机构: Tsinghua University (清华大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial attacks are widely used to evaluate model robustness, and in black-box scenarios, the transferability of these attacks becomes crucial. Existing generator-based attacks have excellent generalization and transferability due to their instance-agnostic nature. However, when training generators for multi-target tasks, the success rate of transfer attacks is relatively low due to the limitations of the model’s capacity. To address these challenges, we propose a novel Dual-Flow framework for multi-target instance-agnostic adversarial attacks, utilizing Cascading Distribution Shift Training to develop an adversarial velocity function. Extensive experiments demonstrate that Dual-Flow significantly improves transferability over previous multi-target generative attacks. For example, it increases the success rate from Inception-v3 to ResNet-152 by 34.58%. Furthermore, our attack method, such as adversarially trained models, shows substantially stronger robustness against defense mechanisms.
zh

[CV-51] Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation

【速读】：该论文旨在解决现有4D动态场景编辑方法在时间维度上的可扩展性问题，这些方法需要处理数千张2D图像，并通过额外的训练循环更新整个场景，导致单个动态场景的编辑耗时数小时。为了解决这一问题，论文提出了一种高效的动态场景编辑方法，关键在于采用4D高斯表示法，结合静态3D高斯(Gaussian)和Hexplane基变形场来建模4D动态场景，并仅对静态3D高斯进行编辑，以实现视觉编辑所需的最小但充分组件。此外，为了修正编辑过程中可能产生的3D高斯与变形场之间的错位，论文还引入了基于得分蒸馏机制的细化阶段。实验结果表明，该方法不仅效率更高，编辑时间减少超过一半，同时保持了高质量的编辑效果，更好地遵循用户指令。

链接: https://arxiv.org/abs/2502.02091
作者: JooHyun Kwon,Hanbyel Cho,Junmo Kim
机构: DGIST(大德研究院), KAIST(韩国科学技术院), KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent 4D dynamic scene editing methods require editing thousands of 2D images used for dynamic scene synthesis and updating the entire scene with additional training loops, resulting in several hours of processing to edit a single dynamic scene. Therefore, these methods are not scalable with respect to the temporal dimension of the dynamic scene (i.e., the number of timesteps). In this work, we propose an efficient dynamic scene editing method that is more scalable in terms of temporal dimension. To achieve computational efficiency, we leverage a 4D Gaussian representation that models a 4D dynamic scene by combining static 3D Gaussians with a Hexplane-based deformation field, which handles dynamic information. We then perform editing solely on the static 3D Gaussians, which is the minimal but sufficient component required for visual editing. To resolve the misalignment between the edited 3D Gaussians and the deformation field potentially resulting from the editing process, we additionally conducted a refinement stage using a score distillation mechanism. Extensive editing results demonstrate that our method is efficient, reducing editing time by more than half compared to existing methods, while achieving high editing quality that better follows user instructions.
zh

[CV-52] IPO: Iterative Preference Optimization for Text-to-Video Generation

【速读】：该论文旨在解决视频基础模型在生成质量方面无法满足实际应用需求的问题。关键解决方案是引入迭代偏好优化（Iterative Preference Optimization, IPO）策略，通过整合人类反馈来增强生成视频的质量。IPO利用一个评论家模型来进行成对排名或逐点评分，从而优化视频基础模型，并借助偏好反馈信号改进生成视频的主题一致性、运动平滑性和美学质量等。此外，IPO将评论家模型与多模态大型语言模型相结合，使其能够自动分配偏好标签，无需重新训练或重新标注，从而高效地进行多轮迭代优化。

链接: https://arxiv.org/abs/2502.02088
作者: Xiaomeng Yang,Zhiyu Tan,Xuecheng Nie,Hao Li
机构: Fudan University (复旦大学); Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学研究院); MT Lab, Meitu Inc. (美图实验室, 美图公司), Beijing, China (中国北京)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video foundation models have achieved significant advancement with the help of network upgrade as well as model scale-up. However, they are still hard to meet requirements of applications due to unsatisfied generation quality. To solve this problem, we propose to align video foundation models with human preferences from the perspective of post-training in this paper. Consequently, we introduce an Iterative Preference Optimization strategy to enhance generated video quality by incorporating human feedback. Specifically, IPO exploits a critic model to justify video generations for pairwise ranking as in Direct Preference Optimization or point-wise scoring as in Kahneman-Tversky Optimization. Given this, IPO optimizes video foundation models with guidance of signals from preference feedback, which helps improve generated video quality in subject consistency, motion smoothness and aesthetic quality, etc. In addition, IPO incorporates the critic model with the multi-modality large language model, which enables it to automatically assign preference labels without need of retraining or relabeling. In this way, IPO can efficiently perform multi-round preference optimization in an iterative manner, without the need of tediously manual labeling. Comprehensive experiments demonstrate that the proposed IPO can effectively improve the video generation quality of a pretrained model and help a model with only 2B parameters surpass the one with 5B parameters. Besides, IPO achieves new state-of-the-art performance on VBench benchmark. We will release our source codes, models as well as dataset to advance future research and applications.
zh

[CV-53] Improving Power Plant CO2 Emission Estimation with Deep Learning and Satellite/Simulated Data

【速读】：该论文旨在解决燃煤发电厂二氧化碳（CO2）排放量准确量化的问题，特别是在数据稀缺地区。论文的关键解决方案在于通过整合Sentinel-5P的NO2数据来扩展可用数据集，并生成连续的XCO2地图，同时结合OCO-2/3的真实卫星观测数据，用于超过71个位于数据稀缺地区的燃煤发电厂。此外，采用定制的U-Net模型处理多种时空分辨率，以提高排放速率估计的准确性。这些方法显著提升了排放速率的量化精度，从而实现对主要CO2排放源的近乎实时且精确的量化，支持环境保护举措并为监管框架提供信息。

链接: https://arxiv.org/abs/2502.02083
作者: Dibyabha Deb,Kamal Das
机构: Manipal Institute of Technology(曼普尔技术学院), India; IBM Research(IBM研究), India
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:CO2 emissions from power plants, as significant super emitters, contribute substantially to global warming. Accurate quantification of these emissions is crucial for effective climate mitigation strategies. While satellite-based plume inversion offers a promising approach, challenges arise from data limitations and the complexity of atmospheric conditions. This study addresses these challenges by (a) expanding the available dataset through the integration of NO2 data from Sentinel-5P, generating continuous XCO2 maps, and incorporating real satellite observations from OCO-2/3 for over 71 power plants in data-scarce regions; and (b) employing a customized U-Net model capable of handling diverse spatio-temporal resolutions for emission rate estimation. Our results demonstrate significant improvements in emission rate accuracy compared to previous methods. By leveraging this enhanced approach, we can enable near real-time, precise quantification of major CO2 emission sources, supporting environmental protection initiatives and informing regulatory frameworks.
zh

[CV-54] Position Paper: Building Trust in Synthetic Data for Clinical AI

【速读】：该论文旨在解决合成医学数据在临床应用中的信任问题，以促进其实际采纳。论文的关键在于通过实证证据表明，合成数据的质量、多样性和比例直接影响临床AI模型的信任度，从而为部署和接受基于合成数据驱动的AI系统提供改进方向。

链接: https://arxiv.org/abs/2502.02076
作者: Krishan Agyakari Raja Babu,Supriti Mulay,Om Prabhu,Mohanasankar Sivaprakasam
机构: Indian Institute of Technology Madras(印度理工学院马德拉斯分校); All India Institute of Medical Sciences(全印度医学科学研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 8 figures (including sub-figures)

点击查看摘要

Abstract:Deep generative models and synthetic medical data have shown significant promise in addressing key challenges in healthcare, such as privacy concerns, data bias, and the scarcity of realistic datasets. While research in this area has grown rapidly and demonstrated substantial theoretical potential, its practical adoption in clinical settings remains limited. Despite the benefits synthetic data offers, questions surrounding its reliability and credibility persist, leading to a lack of trust among clinicians. This position paper argues that fostering trust in synthetic medical data is crucial for its clinical adoption. It aims to spark a discussion on the viability of synthetic medical data in clinical practice, particularly in the context of current advancements in AI. We present empirical evidence from brain tumor segmentation to demonstrate that the quality, diversity, and proportion of synthetic data directly impact trust in clinical AI models. Our findings provide insights to improve the deployment and acceptance of synthetic data-driven AI systems in real-world clinical workflows.
zh

[CV-55] LoRA-TTT: Low-Rank Test-Time Training for Vision-Language Models

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在训练集与测试集之间的分布偏移问题。论文的关键解决方案是提出了一种名为LoRA-TTT的方法，通过低秩适应（Low-Rank Adaptation, LoRA）仅更新图像编码器的参数来实现测试时微调（Test-Time Training, TTT）。这种方法不仅保持了模型的初始泛化能力，还通过引入高效的重建损失函数，在不增加内存消耗或运行时间的情况下实现了显著的性能提升。

链接: https://arxiv.org/abs/2502.02069
作者: Yuto Kojima,Jiarui Xu,Xueyan Zou,Xiaolong Wang
机构: UC San Diego
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancements in vision-language models (VLMs), such as CLIP, have intensified the need to address distribution shifts between training and testing datasets. Although prior Test-Time Training (TTT) techniques for VLMs have demonstrated robust performance, they predominantly rely on tuning text prompts, a process that demands substantial computational resources and is heavily dependent on entropy-based loss. In this paper, we propose LoRA-TTT, a novel TTT method that leverages Low-Rank Adaptation (LoRA), applied exclusively to the image encoder of VLMs. By introducing LoRA and updating only its parameters during test time, our method offers a simple yet effective TTT approach, retaining the model’s initial generalization capability while achieving substantial performance gains with minimal memory and runtime overhead. Additionally, we introduce a highly efficient reconstruction loss tailored for TTT. Our method can adapt to diverse domains by combining these two losses, without increasing memory consumption or runtime. Extensive experiments on two benchmarks, covering 15 datasets, demonstrate that our method improves the zero-shot top-1 accuracy of CLIP-ViT-B/16 by an average of 5.79% on the OOD benchmark and 1.36% on the fine-grained benchmark, efficiently surpassing test-time prompt tuning, without relying on any external models or cache.
zh

[CV-56] CASIM: Composite Aware Semantic Injection for Text to Motion Generation

【速读】：该论文旨在解决在文本到运动生成过程中有效利用文本信息进行条件化运动生成的挑战。当前方法主要依赖于固定长度的文本嵌入（如CLIP）进行全局语义注入，难以捕捉人类动作的复合特性，导致生成的运动质量与可控性不佳。论文提出的关键解决方案是复合感知语义注入机制（Composite Aware Semantic Injection Mechanism, CASIM），包括一个复合感知语义编码器和一个文本-运动对齐器，能够学习文本和运动标记之间的动态对应关系。CASIM是一种模型和表示无关的方法，可以轻松集成到自回归和基于扩散的方法中。实验结果表明，CASIM显著提升了运动质量、文本-运动对齐度以及检索分数，并且在定性分析中展示了其相对于固定长度语义注入的优势。

链接: https://arxiv.org/abs/2502.02063
作者: Che-Jui Chang,Qingze Tony Liu,Honglu Zhou,Vladimir Pavlovic,Mubbasir Kapadia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Recent advances in generative modeling and tokenization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite-aware semantic encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based methods. Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores across state-of-the-art methods. Qualitative analyses further highlight the superiority of our composite-aware approach over fixed-length semantic injection, enabling precise motion control from text prompts and stronger generalization to unseen text inputs.
zh

[CV-57] RAPID: Robust and Agile Planner Using Inverse Reinforcement Learning for Vision-Based Drone Navigation

【速读】：该论文旨在解决在复杂环境中无人机进行高速视觉导航时所面临的挑战，特别是针对基于行为克隆（Behavior Cloning, BC）和强化学习（Reinforcement Learning, RL）方法的固有限制。论文的关键解决方案在于提出了一种基于逆向强化学习（Inverse Reinforcement Learning, IRL）的框架。通过利用IRL，该方法能够减少与模拟环境的交互次数，提高处理高维空间的能力，并保持RL策略的鲁棒性。此外，通过结合专家数据集和从模拟环境中收集的学习者数据集，该框架能够在不同状态下学习到稳健的奖励函数和策略，从而实现无需额外训练或调整即可直接应用于现实场景中的高效导航。

链接: https://arxiv.org/abs/2502.02054
作者: Minwoo Kim,Geunsik Bae,Jinwoo Lee,Woojae Shin,Changseung Kim,Myong-Yol Choi,Heejung Shin,Hyondong Oh
机构: Ulsan National Institute of Science and Technology (蔚山科学技术研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 11 figures, 58 references, and appendix is included

点击查看摘要

Abstract:This paper introduces a learning-based visual planner for agile drone flight in cluttered environments. The proposed planner generates collision-free waypoints in milliseconds, enabling drones to perform agile maneuvers in complex environments without building separate perception, mapping, and planning modules. Learning-based methods, such as behavior cloning (BC) and reinforcement learning (RL), demonstrate promising performance in visual navigation but still face inherent limitations. BC is susceptible to compounding errors due to limited expert imitation, while RL struggles with reward function design and sample inefficiency. To address these limitations, this paper proposes an inverse reinforcement learning (IRL)-based framework for high-speed visual navigation. By leveraging IRL, it is possible to reduce the number of interactions with simulation environments and improve capability to deal with high-dimensional spaces while preserving the robustness of RL policies. A motion primitive-based path planning algorithm collects an expert dataset with privileged map data from diverse environments, ensuring comprehensive scenario coverage. By leveraging both the acquired expert and learner dataset gathered from the agent’s interactions with the simulation environments, a robust reward function and policy are learned across diverse states. While the proposed method is trained in a simulation environment only, it can be directly applied to real-world scenarios without additional training or tuning. The performance of the proposed method is validated in both simulation and real-world environments, including forests and various structures. The trained policy achieves an average speed of 7 m/s and a maximum speed of 8.8 m/s in real flight experiments. To the best of our knowledge, this is the first work to successfully apply an IRL framework for high-speed visual navigation of drones.
zh

[CV-58] MORPH-LER: Log-Euclidean Regularization for Population-Aware Image Registration

【速读】：该论文旨在解决医学图像分析中空间变换未能有效整合群体形态统计信息的问题。传统平滑正则化方法在图像配准中无法整合群体统计数据，导致解剖学上不一致的变换。逆一致性正则化虽促进几何一致性，但缺乏形态统计整合。MORPH-LER的关键在于引入了一种基于对数欧几里得正则化的框架，通过学习空间变换中的群体形态统计信息来指导和正则化注册网络，确保解剖学上合理的形变。此框架包含一个瓶颈自动编码器，通过迭代平方根预测计算形变场的主对数，从而创建一个尊重微分同胚属性并强制逆一致性的线性化潜在空间。这使得MORPH-LER能够生成平滑且有意义的形变场，同时提供两种主要贡献：一是数据驱动的正则化策略，整合群体解剖统计信息以增强变换的有效性；二是线性化潜在空间，实现紧凑且可解释的形变场，用于高效的群体形态统计分析。

链接: https://arxiv.org/abs/2502.02029
作者: Mokshagna Sai Teja Karanam,Krithika Iyer,Sarang Joshi,Shireen Elhabian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Spatial transformations that capture population-level morphological statistics are critical for medical image analysis. Commonly used smoothness regularizers for image registration fail to integrate population statistics, leading to anatomically inconsistent transformations. Inverse consistency regularizers promote geometric consistency but lack population morphometrics integration. Regularizers that constrain deformation to low-dimensional manifold methods address this. However, they prioritize reconstruction over interpretability and neglect diffeomorphic properties, such as group composition and inverse consistency. We introduce MORPH-LER, a Log-Euclidean regularization framework for population-aware unsupervised image registration. MORPH-LER learns population morphometrics from spatial transformations to guide and regularize registration networks, ensuring anatomically plausible deformations. It features a bottleneck autoencoder that computes the principal logarithm of deformation fields via iterative square-root predictions. It creates a linearized latent space that respects diffeomorphic properties and enforces inverse consistency. By integrating a registration network with a diffeomorphic autoencoder, MORPH-LER produces smooth, meaningful deformation fields. The framework offers two main contributions: (1) a data-driven regularization strategy that incorporates population-level anatomical statistics to enhance transformation validity and (2) a linearized latent space that enables compact and interpretable deformation fields for efficient population morphometrics analysis. We validate MORPH-LER across two families of deep learning-based registration networks, demonstrating its ability to produce anatomically accurate, computationally efficient, and statistically meaningful transformations on the OASIS-1 brain imaging dataset.
zh

[CV-59] From Fog to Failure: How Dehazing Can Harm Clear Image Object Detection

【速读】：该论文旨在解决在雾天条件下通过人类视觉线索进行去雾与物体检测集成时所面临的挑战。论文的关键解决方案在于提出一个多阶段框架，首先利用轻量级检测器识别感兴趣区域（Regions of Interest, RoIs），然后通过基于空间注意力的去雾方法增强这些区域，最后由重量级模型完成最终检测。尽管此方法在雾天条件下有效，但意外地导致清晰图像上的性能下降。

链接: https://arxiv.org/abs/2502.02027
作者: Ashutosh Kumar,Aman Chadha
机构: Rochester Institute of Technology (罗切斯特理工学院); Amazon Gen AI (亚马逊生成式人工智能团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2410.01225

点击查看摘要

Abstract:This study explores the challenges of integrating human visual cue-based dehazing into object detection, given the selective nature of human perception. While human vision adapts dynamically to environmental conditions, computational dehazing does not always enhance detection uniformly. We propose a multi-stage framework where a lightweight detector identifies regions of interest (RoIs), which are then enhanced via spatial attention-based dehazing before final detection by a heavier model. Though effective in foggy conditions, this approach unexpectedly degrades the performance on clear images. We analyze this phenomenon, investigate possible causes, and offer insights for designing hybrid pipelines that balance enhancement and detection. Our findings highlight the need for selective preprocessing and challenge assumptions about universal benefits from cascading transformations.
zh

[CV-60] Multi-illuminant Color Constancy via Multi-scale Illuminant Estimation and Fusion

【速读】：该论文旨在解决多光源颜色恒常性方法在图像尺度变化下的局部色彩偏差问题。关键在于将光照图表示为从多尺度图像估计出的分量的线性组合，并提出三分支卷积网络以估计来自多尺度图像的多粒度光照分布图。这些多粒度光照图通过注意力机制融合模块自适应合并，从而提高方法的有效性和性能。

链接: https://arxiv.org/abs/2502.02021
作者: Hang Luo,Rongwei Li,Jinxing Liang
机构: School of Computer Science and Artificial Intelligence, Wuhan Textile University (武汉纺织大学计算机科学与人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 4 figures, this manuscript is under the consideration of Optics Express

点击查看摘要

Abstract:Multi-illuminant color constancy methods aim to eliminate local color casts within an image through pixel-wise illuminant estimation. Existing methods mainly employ deep learning to establish a direct mapping between an image and its illumination map, which neglects the impact of image scales. To alleviate this problem, we represent an illuminant map as the linear combination of components estimated from multi-scale images. Furthermore, we propose a tri-branch convolution networks to estimate multi-grained illuminant distribution maps from multi-scale images. These multi-grained illuminant maps are merged adaptively with an attentional illuminant fusion module. Through comprehensive experimental analysis and evaluation, the results demonstrate the effectiveness of our method, and it has achieved state-of-the-art performance.
zh

[CV-61] One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation

【速读】：该论文旨在解决现有一步扩散方法受限于教师模型性能的问题，特别是当教师模型表现不佳时会导致图像出现伪影。解决方案的关键在于提出了一种名为FluxSR的新技术，它基于流匹配模型，并使用最先进的扩散模型FLUX.1-dev作为教师模型和基础模型。具体而言，通过引入Flow Trajectory Distillation (FTD)将多步流匹配模型蒸馏为一步真实图像超分辨率 (Real-ISR)，以及提出TV-LPIPS感知损失和Attention Diversification Loss (ADL)正则化项来减少Transformer中的令牌相似性，从而消除高频伪影。

链接: https://arxiv.org/abs/2502.01993
作者: Jianze Li,Jiezhang Cao,Yong Guo,Wenbo Li,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models (DMs) have significantly advanced the development of real-world image super-resolution (Real-ISR), but the computational cost of multi-step diffusion models limits their application. One-step diffusion models generate high-quality images in a one sampling step, greatly reducing computational overhead and inference latency. However, most existing one-step diffusion methods are constrained by the performance of the teacher model, where poor teacher performance results in image artifacts. To address this limitation, we propose FluxSR, a novel one-step diffusion Real-ISR technique based on flow matching models. We use the state-of-the-art diffusion model FLUX.1-dev as both the teacher model and the base model. First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR. Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss and introduce Attention Diversification Loss (ADL) as a regularization term to reduce token similarity in transformer, thereby eliminating high-frequency artifacts. Comprehensive experiments demonstrate that our method outperforms existing one-step diffusion-based Real-ISR methods. The code and model will be released at this https URL.
zh

[CV-62] Rethinking Timesteps Samplers and Prediction Types

【速读】：该论文旨在解决扩散模型在有限资源条件下训练困难的问题。论文的关键在于通过实验发现，不同时间步的训练损失显著变化是导致训练失败的主要因素之一。此外，不同的 ( x_0 ) 预测类型在不同任务和时间步的效果也存在差异。为此，论文提出采用混合预测方法来识别最准确的 ( x_0 ) 预测类型，以期突破现有局限，特别是针对高分辨率任务。

链接: https://arxiv.org/abs/2502.01990
作者: Bin Xie,Gady Agam
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models suffer from the huge consumption of time and resources to train. For example, diffusion models need hundreds of GPUs to train for several weeks for a high-resolution generative task to meet the requirements of an extremely large number of iterations and a large batch size. Training diffusion models become a millionaire’s game. With limited resources that only fit a small batch size, training a diffusion model always fails. In this paper, we investigate the key reasons behind the difficulties of training diffusion models with limited resources. Through numerous experiments and demonstrations, we identified a major factor: the significant variation in the training losses across different timesteps, which can easily disrupt the progress made in previous iterations. Moreover, different prediction types of x_0 exhibit varying effectiveness depending on the task and timestep. We hypothesize that using a mixed-prediction approach to identify the most accurate x_0 prediction type could potentially serve as a breakthrough in addressing this issue. In this paper, we outline several challenges and insights, with the hope of inspiring further research aimed at tackling the limitations of training diffusion models with constrained resources, particularly for high-resolution tasks.
zh

[CV-63] DCT-Mamba3D: Spectral Decorrelation and Spatial-Spectral Feature Extraction for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱图像分类中的挑战，特别是由于光谱冗余和复杂的时空依赖关系导致的问题。论文的关键解决方案在于提出了DCT-Mamba3D框架，其核心包括：(1) 三维光谱-空间解相关模块，通过应用三维离散余弦变换基函数减少光谱和空间冗余，增强跨维度特征清晰度；(2) 三维马巴（3D-Mamba）模块，利用双向状态空间模型捕捉复杂的时空依赖关系；以及(3) 全局残差增强模块，稳定特征表示，提高鲁棒性和收敛性。

链接: https://arxiv.org/abs/2502.01986
作者: Weijia Cao,Xiaofei Yang,Yicong Zhou,Zheng Zhang
机构: Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院空天信息创新研究院), Beijing, China; Guangzhou University(广州大学), Guangzhou, China; University of Macau(澳门大学), Macau, China; Harbin Institute of Technology(哈尔滨工业大学), Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral image classification presents challenges due to spectral redundancy and complex spatial-spectral dependencies. This paper proposes a novel framework, DCT-Mamba3D, for hyperspectral image classification. DCT-Mamba3D incorporates: (1) a 3D spectral-spatial decorrelation module that applies 3D discrete cosine transform basis functions to reduce both spectral and spatial redundancy, enhancing feature clarity across dimensions; (2) a 3D-Mamba module that leverages a bidirectional state-space model to capture intricate spatial-spectral dependencies; and (3) a global residual enhancement module that stabilizes feature representation, improving robustness and convergence. Extensive experiments on benchmark datasets show that our DCT-Mamba3D outperforms the state-of-the-art methods in challenging scenarios such as the same object in different spectra and different objects in the same spectra.
zh

[CV-64] AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLM s

【速读】：该论文旨在解决用户界面（UI）理解领域中现有数据集存在的局限性，即仅提供大规模无上下文的元素注释或小规模上下文化的元素功能描述。论文的关键解决方案是提出了AutoGUI管道，通过利用大规模语言模型（LLMs）来推断元素的功能，通过对特定UI元素模拟交互前后UI内容变化的比较实现大规模详细功能描述的自动标注。此外，引入了基于LLM的拒绝和验证方法以提高标注质量，从而无需人工干预即可消除无效和错误的标注。这一方法构建了包含多分辨率、多设备截图及多样化数据域的AutoGUI-704k数据集，并实现了与专业标注员相当的标注准确性。

链接: https://arxiv.org/abs/2502.01977
作者: Hongxin Li,Jingfan Chen,Jingran Su,Yuntao Chen,Qing Li,Zhaoxiang Zhang
机构: University of Chinese Academy of Sciences (UCAS); New Laboratory of Pattern Recognition (NLPR), CASIA; CAIR, HKISI, CAS; The Hong Kong Polytechnic University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:User interface understanding with vision-language models has received much attention due to its potential for enabling next-generation software automation. However, existing UI datasets either only provide large-scale context-free element annotations or contextualized functional descriptions for elements at a much smaller scale. In this work, we propose the \methodname pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing the UI content changes before and after simulated interactions with specific UI elements. To improve annotation quality, we propose LLM-aided rejection and verification, eliminating invalid and incorrect annotations without human labor. We construct an \methodname-704k dataset using the proposed pipeline, featuring multi-resolution, multi-device screenshots, diverse data domains, and detailed functionality annotations that have never been provided by previous datasets. Human evaluation shows that the AutoGUI pipeline achieves annotation correctness comparable to trained human annotators. Extensive experimental results show that our \methodname-704k dataset remarkably enhances VLM’s UI grounding capabilities, exhibits significant scaling effects, and outperforms existing web pre-training data types. We envision AutoGUI as a scalable pipeline for generating massive data to build GUI-oriented VLMs. AutoGUI dataset can be viewed at this anonymous URL: this https URL.
zh

[CV-65] Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在多模态推理中因对象幻觉导致的响应与视觉内容不一致的问题。论文指出，不同LVLMs在注意力与空间位置之间的相关性存在差异，使得现有通过重新排序视觉标记来缓解此问题的方法难以泛化。解决方案的关键在于提出了一种无需训练的均匀注意力校准（Uniform Attention Calibration, UAC）方法，该方法通过从单一无意义输入图像估计偏差，并应用校准矩阵来纠正注意力不平衡。为进一步减轻偏差，引入了一种微调方案，即动态注意力校准（Dynamic Attention Calibration, DAC），该方案通过一个即插即用模块确保无论对象在图像中的位置如何，输出都保持一致。实验结果表明，UAC和DAC显著减少了对象幻觉，同时提升了多模态对齐性能。

链接: https://arxiv.org/abs/2502.01969
作者: Younan Zhu,Linwei Tao,Minjing Dong,Chang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) exhibit impressive multimodal reasoning capabilities but remain highly susceptible to object hallucination, where models generate responses that are not factually aligned with the visual content. Recent works attribute this issue to an inherent bias of LVLMs where vision token attention map has a fixed correlation with spatial position, and propose to mitigate this issue by reordering visual tokens. However, we find that different LVLMs exhibit different correlations between attention and spatial position, which makes the existing solution difficult to generalize to other LVLMs. To address this issue, we first introduce a training-free solution, Uniform Attention Calibration (UAC), that estimates the bias from single meaningless input image and applies a calibration matrix to rectify attention imbalances. To further alleviate the bias, we relax the assumption of single meaningless input in UAC and introduce a fine-tuning solution, Dynamic Attention Calibration (DAC), that enforces the consistent outputs wherever the object locates in the image via a plug-and-plays module. Comprehensive experiments across multiple benchmarks demonstrate that UAC and DAC significantly reduce object hallucination while improving general multimodal alignment. Our methods achieve state-of-the-art performance across diverse LVLM architectures on various metrics.
zh

[CV-66] Memory Efficient Transformer Adapter for Dense Predictions ICLR2025

【速读】：该论文旨在解决当前视觉变换器（Vision Transformer, ViT）适配器方法在推理速度方面因低效内存访问操作（如标准归一化和频繁重塑）而受到阻碍的问题。解决方案的关键在于提出了一种名为META的简单且快速的ViT适配器，它通过减少低效内存访问操作来提高模型的内存效率并降低内存时间消耗。具体而言，META采用了一种内存高效的适配器块，实现了自注意力机制和前馈网络层之间层归一化（Layer Normalization）的共享，从而减少了模型对归一化操作的依赖。此外，META引入了交叉形状的自注意力机制以减少频繁的重塑操作，并通过轻量级卷积分支增强局部归纳偏置，特别适用于密集预测任务。最后，适配器块以级联方式计算多样化头部特征，从而丰富特征表示的多样性。

链接: https://arxiv.org/abs/2502.01962
作者: Dong Zhang,Rui Yan,Pingcheng Dong,Kwang-Ting Cheng
机构: The Hong Kong University of Science and Technology; InnoHK AI Chip Center for Smart Emerging Systems; Nanjing University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by ICLR 2025

点击查看摘要

Abstract:While current Vision Transformer (ViT) adapter methods have shown promising accuracy, their inference speed is implicitly hindered by inefficient memory access operations, e.g., standard normalization and frequent reshaping. In this work, we propose META, a simple and fast ViT adapter that can improve the model’s memory efficiency and decrease memory time consumption by reducing the inefficient memory access operations. Our method features a memory-efficient adapter block that enables the common sharing of layer normalization between the self-attention and feed-forward network layers, thereby reducing the model’s reliance on normalization operations. Within the proposed block, the cross-shaped self-attention is employed to reduce the model’s frequent reshaping operations. Moreover, we augment the adapter block with a lightweight convolutional branch that can enhance local inductive biases, particularly beneficial for the dense prediction tasks, e.g., object detection, instance segmentation, and semantic segmentation. The adapter block is finally formulated in a cascaded manner to compute diverse head features, thereby enriching the variety of feature representations. Empirically, extensive evaluations on multiple representative datasets validate that META substantially enhances the predicted quality, while achieving a new state-of-the-art accuracy-efficiency trade-off. Theoretically, we demonstrate that META exhibits superior generalization capability and stronger adaptability.
zh

[CV-67] Hierarchical Consensus Network for Multiview Feature Learning AAAI2025

【速读】：该论文旨在解决多视角特征学习中视图一致性特征难以有效学习的问题。为了解决这一挑战，提出了层次共识网络（Hierarchical Consensus Network, HCN），其关键是通过三个共识指数（分类共识、编码共识和全局共识）来捕捉不同视角之间的多层次共识，从而更好地整合各视角内的信息，获得更全面且具有判别性的特征。

链接: https://arxiv.org/abs/2502.01961
作者: Chengwei Xia,Chaoxi Niu,Kun Zhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2025 accepted paper

点击查看摘要

Abstract:Multiview feature learning aims to learn discriminative features by integrating the distinct information in each view. However, most existing methods still face significant challenges in learning view-consistency features, which are crucial for effective multiview learning. Motivated by the theories of CCA and contrastive learning in multiview feature learning, we propose the hierarchical consensus network (HCN) in this paper. The HCN derives three consensus indices for capturing the hierarchical consensus across views, which are classifying consensus, coding consensus, and global consensus, respectively. Specifically, classifying consensus reinforces class-level correspondence between views from a CCA perspective, while coding consensus closely resembles contrastive learning and reflects contrastive comparison of individual instances. Global consensus aims to extract consensus information from two perspectives simultaneously. By enforcing the hierarchical consensus, the information within each view is better integrated to obtain more comprehensive and discriminative features. The extensive experimental results obtained on four multiview datasets demonstrate that the proposed method significantly outperforms several state-of-the-art methods.
zh

[CV-68] MATCNN: Infrared and Visible Image Fusion Method Based on Multi-scale CNN with Attention Transformer

【速读】：该论文旨在解决现有图像融合方法在提取多尺度局部特征和保留全局特征方面的不足。解决方案的关键在于提出了一种基于多尺度卷积神经网络与注意力Transformer（Multi-scale Attention Transformer CNN, MATCNN）的新型跨模态图像融合方法。MATCNN通过多尺度融合模块（MSFM）提取不同尺度的局部特征，并利用全局特征提取模块（GFEM）提取全局特征。同时，采用信息掩膜来标记图像中的重要细节，以增强红外图像中的显著目标和可见光图像中的背景纹理在融合图像中的比例。此外，开发了一种新的优化算法，通过内容一致性、结构相似性指数测量和全局特征损失的整合来指导特征提取。

链接: https://arxiv.org/abs/2502.01959
作者: Jingjing Liu,Li Zhang,Xiaoyang Zeng,Wanquan Liu,Jianhua Zhang
机构: Shanghai Key Laboratory of Chips and Systems for Intelligent Connected Vehicle, School of Microelectronics, Shanghai University (上海大学微电子学院; 上海智能网联汽车芯片与系统重点实验室), China; State Key Laboratory of Integrated Chips and Systems, Fudan University (复旦大学), China; School of Intelligent Systems Engineering, Sun Yat-sen University (中山大学智能工程学院), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While attention-based approaches have shown considerable progress in enhancing image fusion and addressing the challenges posed by long-range feature dependencies, their efficacy in capturing local features is compromised by the lack of diverse receptive field extraction techniques. To overcome the shortcomings of existing fusion methods in extracting multi-scale local features and preserving global features, this paper proposes a novel cross-modal image fusion approach based on a multi-scale convolutional neural network with attention Transformer (MATCNN). MATCNN utilizes the multi-scale fusion module (MSFM) to extract local features at different scales and employs the global feature extraction module (GFEM) to extract global features. Combining the two reduces the loss of detail features and improves the ability of global feature representation. Simultaneously, an information mask is used to label pertinent details within the images, aiming to enhance the proportion of preserving significant information in infrared images and background textures in visible images in fused images. Subsequently, a novel optimization algorithm is developed, leveraging the mask to guide feature extraction through the integration of content, structural similarity index measurement, and global feature loss. Quantitative and qualitative evaluations are conducted across various datasets, revealing that MATCNN effectively highlights infrared salient targets, preserves additional details in visible images, and achieves better fusion results for cross-modal images. The code of MATCNN will be available at this https URL.
zh

[CV-69] LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation

【速读】：该论文旨在解决文本引导的三维场景生成中的三个主要问题：(i) 描述文本中多个对象间复杂关系的难度；(ii) 无法生成物理上合理的场景布局；(iii) 组成性场景的可控性和可扩展性不足。为了解决这些问题，论文提出的关键方案是LayoutDreamer框架，它利用三维高斯点云（3D Gaussian Splatting, 3DGS）来促进高质量、物理一致的组成性场景生成，并通过将文本提示转换为有向场景图来指导这一过程，动态调整初始组成性三维高斯点云的密度和布局，进而确保实体级别的生成质量。最后，通过提取场景图中的有向依赖关系，定制物理和布局能量，以确保生成场景的真实性和灵活性。

链接: https://arxiv.org/abs/2502.01949
作者: Yang Zhou,Zongjin He,Qixuan Li,Chao Wang
机构: ShangHai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Recently, the field of text-guided 3D scene generation has garnered significant attention. High-quality generation that aligns with physical realism and high controllability is crucial for practical 3D scene applications. However, existing methods face fundamental limitations: (i) difficulty capturing complex relationships between multiple objects described in the text, (ii) inability to generate physically plausible scene layouts, and (iii) lack of controllability and extensibility in compositional scenes. In this paper, we introduce LayoutDreamer, a framework that leverages 3D Gaussian Splatting (3DGS) to facilitate high-quality, physically consistent compositional scene generation guided by text. Specifically, given a text prompt, we convert it into a directed scene graph and adaptively adjust the density and layout of the initial compositional 3D Gaussians. Subsequently, dynamic camera adjustments are made based on the training focal point to ensure entity-level generation quality. Finally, by extracting directed dependencies from the scene graph, we tailor physical and layout energy to ensure both realism and flexibility. Comprehensive experiments demonstrate that LayoutDreamer outperforms other compositional scene generation quality and semantic alignment methods. Specifically, it achieves state-of-the-art (SOTA) performance in the multiple objects generation metric of T3Bench.
zh

[CV-70] HeRCULES: Heterogeneous Radar Dataset in Complex Urban Environment for Multi-session Radar SLAM ICRA2025

【速读】：该论文旨在解决现有雷达数据集仅涵盖单一雷达类型导致算法适应性受限的问题。关键在于引入HeRCULES数据集，该数据集整合了不同类型的雷达（包括FMCW LiDAR、IMU、GPS和摄像头），提供了前所未有的定位、地图构建和位置识别能力，并覆盖了多样的天气和光照条件以及城市交通场景，从而支持跨多种环境的全面分析。

链接: https://arxiv.org/abs/2502.01946
作者: Hanjun Kim,Minwoo Jung,Chiyun Noh,Sangwoo Jung,Hyunho Song,Wooseong Yang,Hyesu Jang,Ayoung Kim
机构: SNU(首尔国立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE International Conference on Robotics and Automation (ICRA 2025)

点击查看摘要

Abstract:Recently, radars have been widely featured in robotics for their robustness in challenging weather conditions. Two commonly used radar types are spinning radars and phased-array radars, each offering distinct sensor characteristics. Existing datasets typically feature only a single type of radar, leading to the development of algorithms limited to that specific kind. In this work, we highlight that combining different radar types offers complementary advantages, which can be leveraged through a heterogeneous radar dataset. Moreover, this new dataset fosters research in multi-session and multi-robot scenarios where robots are equipped with different types of radars. In this context, we introduce the HeRCULES dataset, a comprehensive, multi-modal dataset with heterogeneous radars, FMCW LiDAR, IMU, GPS, and cameras. This is the first dataset to integrate 4D radar and spinning radar alongside FMCW LiDAR, offering unparalleled localization, mapping, and place recognition capabilities. The dataset covers diverse weather and lighting conditions and a range of urban traffic scenarios, enabling a comprehensive analysis across various environments. The sequence paths with multiple revisits and ground truth pose for each sensor enhance its suitability for place recognition research. We expect the HeRCULES dataset to facilitate odometry, mapping, place recognition, and sensor fusion research. The dataset and development tools are available at this https URL.
zh

[CV-71] DAMO: Data- and Model-aware Alignment of Multi-modal LLM s

【速读】：该论文旨在解决现有直接偏好优化（DPO）方法在处理不同难度数据时响应不平衡的问题，即过度拟合于容易区分的数据而欠拟合于难以区分的数据。解决方案的关键在于提出了一种名为数据与模型感知DPO（DAMO）的方法。DAMO通过两个关键策略动态调整优化过程：一是数据感知策略，考虑数据难度；二是模型感知策略，整合实时模型响应。这种结合使得模型能够有效适应不同难度的数据，从而显著提升模型的信任度和在通用任务中的有效性。

链接: https://arxiv.org/abs/2502.01943
作者: Jinda Lu,Junkang Wu,Jinghan Li,Xiaojun Jia,Shuo Wang,YiFan Zhang,Junfeng Fang,Xiang Wang,Xiangnan He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has shown effectiveness in aligning multi-modal large language models (MLLM) with human preferences. However, existing methods exhibit an imbalanced responsiveness to the data of varying hardness, tending to overfit on the easy-to-distinguish data while underfitting on the hard-to-distinguish data. In this paper, we propose Data- and Model-aware DPO (DAMO) to dynamically adjust the optimization process from two key aspects: (1) a data-aware strategy that incorporates data hardness, and (2) a model-aware strategy that integrates real-time model responses. By combining the two strategies, DAMO enables the model to effectively adapt to data with varying levels of hardness. Extensive experiments on five benchmarks demonstrate that DAMO not only significantly enhances the trustworthiness, but also improves the effectiveness over general tasks. For instance, on the Object HalBench, our DAMO-7B reduces response-level and mentioned-level hallucination by 90.0% and 95.3%, respectively, surpassing the performance of GPT-4V.
zh

[CV-72] oward a Low-Cost Perception System in Autonomous Vehicles: A Spectrum Learning Approach

【速读】：该论文旨在解决自动驾驶（Autonomous Driving, AD）和自动驾驶车辆（Autonomous Vehicles, AVs）中深度图稀疏的问题。解决方案的关键在于提出了一种新颖的像素位置编码算法，该算法受Bartlett空间谱估计技术启发，能够将雷达深度图和RGB图像转换到一个统一的空间谱子空间中，从而实现基于相似性和差异性的有效学习。此方法通过利用高分辨率相机图像训练雷达深度图生成模型，克服了传统雷达探测器在复杂车辆环境中的局限性，提高了雷达输出的精度。

链接: https://arxiv.org/abs/2502.01940
作者: Mohammed Alsakabi,Aidan Erickson,John M. Dolan,Ozan K. Tonguz
机构: Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present a cost-effective new approach for generating denser depth maps for Autonomous Driving (AD) and Autonomous Vehicles (AVs) by integrating the images obtained from deep neural network (DNN) 4D radar detectors with conventional camera RGB images. Our approach introduces a novel pixel positional encoding algorithm inspired by Bartlett’s spatial spectrum estimation technique. This algorithm transforms both radar depth maps and RGB images into a unified pixel image subspace called the Spatial Spectrum, facilitating effective learning based on their similarities and differences. Our method effectively leverages high-resolution camera images to train radar depth map generative models, addressing the limitations of conventional radar detectors in complex vehicular environments, thus sharpening the radar output. We develop spectrum estimation algorithms tailored for radar depth maps and RGB images, a comprehensive training framework for data-driven generative models, and a camera-radar deployment scheme for AV operation. Our results demonstrate that our approach also outperforms the state-of-the-art (SOTA) by 27.95% in terms of Unidirectional Chamfer Distance (UCD).
zh

[CV-73] PATCH: a deep learning method to assess heterogeneity of artistic practice in historical paintings

【速读】：该论文旨在解决历史绘画创作过程中艺术家实践模式识别的问题。由于历史文档记录有限，缺乏外部训练数据（ground truth），传统方法难以识别不同艺术家在作品中的具体贡献。论文的关键解决方案是提出了一种名为“配对分配训练用于分类异质性”（Pairwise Assignment Training for Classifying Heterogeneity, PATCH）的新机器学习方法。此方法能够在没有外部训练数据的情况下，通过有监督的方式实现无监督的结果，从而有效识别个体艺术家的实践模式，并优于传统的统计程序和无监督机器学习方法。

链接: https://arxiv.org/abs/2502.01912
作者: Andrew Van Horn,Lauryn Smith,Mahamad Mahmoud,Michael McMaster,Clara Pinchbeck,Ina Martin,Andrew Lininger,Anthony Ingrisano,Adam Lowe,Carlos Bayod,Elizabeth Bolman,Kenneth Singer,Michael Hinczewski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: main text: 16 pages, 6 figures; SI: 7 pages, 3 figures

点击查看摘要

Abstract:The history of art has seen significant shifts in the manner in which artworks are created, making understanding of creative processes a central question in technical art history. In the Renaissance and Early Modern period, paintings were largely produced by master painters directing workshops of apprentices who often contributed to projects. The masters varied significantly in artistic and managerial styles, meaning different combinations of artists and implements might be seen both between masters and within workshops or even individual canvases. Information on how different workshops were managed and the processes by which artworks were created remains elusive. Machine learning methods have potential to unearth new information about artists’ creative processes by extending the analysis of brushwork to a microscopic scale. Analysis of workshop paintings, however, presents a challenge in that documentation of the artists and materials involved is sparse, meaning external examples are not available to train networks to recognize their contributions. Here we present a novel machine learning approach we call pairwise assignment training for classifying heterogeneity (PATCH) that is capable of identifying individual artistic practice regimes with no external training data, or “ground truth.” The method achieves unsupervised results by supervised means, and outperforms both simple statistical procedures and unsupervised machine learning methods. We apply this method to two historical paintings by the Spanish Renaissance master, El Greco: The Baptism of Christ and Christ on the Cross with Landscape, and our findings regarding the former potentially challenge previous work that has assigned the painting to workshop members. Further, the results of our analyses create a measure of heterogeneity of artistic practice that can be used to characterize artworks across time and space.
zh

[CV-74] Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models

【速读】：该论文旨在解决大型视觉与语言模型（LVLMs）在处理多模态输入时未能充分区分视觉和文本信息的问题。论文的关键在于提出了一种名为分解注意力机制（Decomposed Attention, D-Attn）的方法，通过分解一维因果自注意力机制来分别处理视觉和文本嵌入。这种方法通过对视觉到视觉的自注意力进行对角化处理，将计算复杂度从 (\mathcal{O}(|V|^2)) 降低至 (\mathcal{O}(|V|))，同时保持性能不下降。此外，D-Attn 还通过调整位置编码来减少文本到视觉跨注意力中的偏见，进一步增强视觉理解能力。最后，引入了 (\alpha)-加权策略以融合视觉和文本信息，从而最大限度地保留预训练大语言模型（LLM）的能力，且仅需少量修改。

链接: https://arxiv.org/abs/2502.01906
作者: Chia-Wen Kuo,Sijie Zhu,Fan Chen,Xiaohui Shen,Longyin Wen
机构: ByteDance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large vision-and-language models (LVLMs) typically treat visual and textual embeddings as homogeneous inputs to a large language model (LLM). However, these inputs are inherently different: visual inputs are multi-dimensional and contextually rich, often pre-encoded by models like CLIP, while textual inputs lack this structure. In this paper, we propose Decomposed Attention (D-Attn), a novel method that processes visual and textual embeddings differently by decomposing the 1-D causal self-attention in LVLMs. After the attention decomposition, D-Attn diagonalizes visual-to-visual self-attention, reducing computation from \mathcalO(|V|^2) to \mathcalO(|V|) for |V| visual embeddings without compromising performance. Moreover, D-Attn debiases positional encodings in textual-to-visual cross-attention, further enhancing visual understanding. Finally, we introduce an \alpha -weighting strategy to merge visual and textual information, maximally preserving the pre-trained LLM’s capabilities with minimal modifications. Extensive experiments and rigorous analyses validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs. Code, data, and models will be publicly available.
zh

[CV-75] INTACT: Inducing Noise Tolerance through Adversarial Curriculum Training for LiDAR-based Safety-Critical Perception and Autonomy

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）在处理安全关键感知任务中的LiDAR数据时，因数据噪声和稀疏性导致的鲁棒性不足问题。解决方案的关键在于INTACT框架，它结合了元学习（meta-learning）与对抗性课程训练（adversarial curriculum training, ACT）。元学习阶段使教师网络获得任务无关的先验知识，从而生成鲁棒显著图以识别关键数据区域；ACT阶段利用这些显著图逐步将学生网络暴露于日益复杂的噪声模式中，确保有针对性的扰动并增强模型的抗噪能力。这一框架不仅克服了传统训练策略的局限性，还为资源受限的安全关键系统提供了可扩展且高效的解决方案。

链接: https://arxiv.org/abs/2502.01896
作者: Nastaran Darabi,Divake Kumar,Sina Tayebati,Amit Ranjan Trivedi
机构: University of Illinois at Chicago (UIC)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In this work, we present INTACT, a novel two-phase framework designed to enhance the robustness of deep neural networks (DNNs) against noisy LiDAR data in safety-critical perception tasks. INTACT combines meta-learning with adversarial curriculum training (ACT) to systematically address challenges posed by data corruption and sparsity in 3D point clouds. The meta-learning phase equips a teacher network with task-agnostic priors, enabling it to generate robust saliency maps that identify critical data regions. The ACT phase leverages these saliency maps to progressively expose a student network to increasingly complex noise patterns, ensuring targeted perturbation and improved noise resilience. INTACT’s effectiveness is demonstrated through comprehensive evaluations on object detection, tracking, and classification benchmarks using diverse datasets, including KITTI, Argoverse, and ModelNet40. Results indicate that INTACT improves model robustness by up to 20% across all tasks, outperforming standard adversarial and curriculum training methods. This framework not only addresses the limitations of conventional training strategies but also offers a scalable and efficient solution for real-world deployment in resource-constrained safety-critical systems. INTACT’s principled integration of meta-learning and adversarial training establishes a new paradigm for noise-tolerant 3D perception in safety-critical applications. INTACT improved KITTI Multiple Object Tracking Accuracy (MOTA) by 9.6% (64.1% - 75.1%) and by 12.4% under Gaussian noise (52.5% - 73.7%). Similarly, KITTI mean Average Precision (mAP) rose from 59.8% to 69.8% (50% point drop) and 49.3% to 70.9% (Gaussian noise), highlighting the framework’s ability to enhance deep learning model resilience in safety-critical object tracking scenarios.
zh

[CV-76] SimBEV: A Synthetic Multi-Task Multi-Sensor Driving Data Generation Tool and Dataset

【速读】：该论文旨在解决现有数据集未能充分支持鸟瞰图（BEV）表示的问题，并且创建新数据集耗时较长。为了解决这一问题，论文提出SimBEV，这是一种高度可配置且可扩展的随机合成数据生成工具，能够整合多源信息以捕捉精确的BEV地面真实数据，支持广泛的传感器类型，并实现包括BEV分割和三维目标检测在内的多种感知任务。

链接: https://arxiv.org/abs/2502.01894
作者: Goodarz Mehr,Azim Eskandarian
机构: Virginia Commonwealth University (弗吉尼亚联邦大学); Virginia Tech (弗吉尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Bird’s-eye view (BEV) perception for autonomous driving has garnered significant attention in recent years, in part because BEV representation facilitates the fusion of multi-sensor data. This enables a variety of perception tasks including BEV segmentation, a concise view of the environment that can be used to plan a vehicle’s trajectory. However, this representation is not fully supported by existing datasets, and creation of new datasets can be a time-consuming endeavor. To address this problem, in this paper we introduce SimBEV, an extensively configurable and scalable randomized synthetic data generation tool that incorporates information from multiple sources to capture accurate BEV ground truth data, supports a comprehensive array of sensors, and enables a variety of perception tasks including BEV segmentation and 3D object detection. We use SimBEV to create the SimBEV dataset, a large collection of annotated perception data from diverse driving scenarios.
zh

[CV-77] Geometric Framework for 3D Cell Segmentation Correction

【速读】：该论文旨在解决三维细胞图像分割中由于二维层分割误差导致的过分割问题。解决方案的关键在于引入了一个可解释的几何框架，通过基于相邻层几何信息修正二维分割结果来纠正过分割现象。此框架利用了几何特征（层间，二维）和拓扑特征（三维形状），采用二分类方法判断相邻细胞是否需要合并。

链接: https://arxiv.org/abs/2502.01890
作者: Peter Chen,Bryan Chang,Olivia Annette Creasey,Julie Beth Sneddon,Yining Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 16 figures

点击查看摘要

Abstract:3D cellular image segmentation methods are commonly divided into non-2D-based and 2D-based approaches, the latter reconstructing 3D shapes from the segmentation results of 2D layers. However, errors in 2D results often propagate, leading to oversegmentations in the final 3D results. To tackle this issue, we introduce an interpretable geometric framework that addresses the oversegmentations by correcting the 2D segmentation results based on geometric information from adjacent layers. Leveraging both geometric (layer-to-layer, 2D) and topological (3D shape) features, we use binary classification to determine whether neighboring cells should be stitched. We develop a pre-trained classifier on public plant cell datasets and validate its performance on animal cell datasets, confirming its effectiveness in correcting oversegmentations under the transfer learning setting. Furthermore, we demonstrate that our framework can be extended to correcting oversegmentation on non-2D-based methods. A clear pipeline is provided for end-users to build the pre-trained model to any labeled dataset.
zh

[CV-78] Explaining Automatic Image Assessment

【速读】：该论文旨在解决通过传统手动标注和分类方法在美学评估模型解释中的局限性。这些传统方法不仅需要复杂的手动标注过程，而且数据集规模受限。论文的关键解决方案在于通过训练神经网络来自动分析不同版本的数据集，从而可视化数据集的趋势以及自动分类视觉美学特征。这一方法通过现有和新的指标评估适应于特定模态的模型，以捕捉和可视化美学特征与趋势。

链接: https://arxiv.org/abs/2502.01873
作者: Max Lisaius,Scott Wehrwein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Previous work in aesthetic categorization and explainability utilizes manual labeling and classification to explain aesthetic scores. These methods require a complex labeling process and are limited in size. Our proposed approach attempts to explain aesthetic assessment models through visualizing dataset trends and automatic categorization of visual aesthetic features through training neural networks on different versions of the same dataset. By evaluating the models adapted to each specific modality using existing and novel metrics, we can capture and visualize aesthetic features and trends.
zh

[CV-79] Reliability-Driven LiDAR-Camera Fusion for Robust 3D Object Detection

【速读】：该论文旨在解决在自动驾驶中3D物体检测过程中，传感器如激光雷达(LiDAR)和摄像头数据融合所面临的性能下降问题，特别是在传感器故障或模态失效情况下。论文的关键解决方案在于提出了一种名为ReliFusion的新框架，它在鸟瞰图(BEV)空间中操作，并集成了三个核心组件：时空特征聚合(STFA)模块，用于捕捉跨帧依赖以稳定时间上的预测；可靠性模块，用于在具有挑战性的条件下量化各模态的可信度；以及基于这些可信度的置信加权互交叉注意力(CW-MCA)模块，动态平衡来自LiDAR和摄像头模态的信息。

链接: https://arxiv.org/abs/2502.01856
作者: Reza Sadeghian,Niloofar Hooshyaripour,Chris Joslin,WonSook Lee
机构: School of Electrical Engineering and Computer Science, University of Ottawa(电气工程与计算机科学学院,渥太华大学), Ottawa, Canada; School of Information Technology, Carleton University(信息技术学院,卡尔顿大学), Ottawa, Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate and robust 3D object detection is essential for autonomous driving, where fusing data from sensors like LiDAR and camera enhances detection accuracy. However, sensor malfunctions such as corruption or disconnection can degrade performance, and existing fusion models often struggle to maintain reliability when one modality fails. To address this, we propose ReliFusion, a novel LiDAR-camera fusion framework operating in the bird’s-eye view (BEV) space. ReliFusion integrates three key components: the Spatio-Temporal Feature Aggregation (STFA) module, which captures dependencies across frames to stabilize predictions over time; the Reliability module, which assigns confidence scores to quantify the dependability of each modality under challenging conditions; and the Confidence-Weighted Mutual Cross-Attention (CW-MCA) module, which dynamically balances information from LiDAR and camera modalities based on these confidence scores. Experiments on the nuScenes dataset show that ReliFusion significantly outperforms state-of-the-art methods, achieving superior robustness and accuracy in scenarios with limited LiDAR fields of view and severe sensor malfunctions.
zh

[CV-80] Learning Fine-to-Coarse Cuboid Shape Abstraction

【速读】：该论文旨在通过简化几何基元（如长方体）来抽象三维物体，以推断复杂的三维形状中的结构信息，从而改进三维形状理解、结构分析和几何建模。论文提出了一种新颖的自监督细到粗学习方法，用于抽象三维形状集合。关键解决方案在于设计一种架构，能够在训练过程中将基元数量从数百个（精细重建）减少到仅几个（粗略抽象），同时优化重构误差，并确保每个形状符合用户指定的基元数量。此外，通过引入抽象损失函数逐渐惩罚冗余基元，并结合重构损失函数不仅考虑表面逼近还保持体积一致性，使得可以用更少的长方体基元更精确地表示三维形状，从而超越先前基于长方体的形状抽象技术。

链接: https://arxiv.org/abs/2502.01855
作者: Gregor Kobsik,Morten Henkel,Yanjiang He,Victor Czech,Tim Elsner,Isaak Lim,Leif Kobbelt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 10 pages, 6 figures, 4 tables

点击查看摘要

Abstract:The abstraction of 3D objects with simple geometric primitives like cuboids allows to infer structural information from complex geometry. It is important for 3D shape understanding, structural analysis and geometric modeling. We introduce a novel fine-to-coarse unsupervised learning approach to abstract collections of 3D shapes. Our architectural design allows us to reduce the number of primitives from hundreds (fine reconstruction) to only a few (coarse abstraction) during training. This allows our network to optimize the reconstruction error and adhere to a user-specified number of primitives per shape while simultaneously learning a consistent structure across the whole collection of data. We achieve this through our abstraction loss formulation which increasingly penalizes redundant primitives. Furthermore, we introduce a reconstruction loss formulation to account not only for surface approximation but also volume preservation. Combining both contributions allows us to represent 3D shapes more precisely with fewer cuboid primitives than previous work. We evaluate our method on collections of man-made and humanoid shapes comparing with previous state-of-the-art learning methods on commonly used benchmarks. Our results confirm an improvement over previous cuboid-based shape abstraction techniques. Furthermore, we demonstrate our cuboid abstraction in downstream tasks like clustering, retrieval, and partial symmetry detection.
zh

[CV-81] Foundation Model-Based Apple Ripeness and Size Estimation for Selective Harvesting

【速读】：该论文旨在解决苹果采摘过程中因手工采摘导致的高成本、低效率及潜在安全隐患的问题，并提出一种新的基于基础模型的框架，用于高效评估苹果的成熟度和大小。解决方案的关键在于创建了一个包含4,027张图像和16,257个标注苹果的新数据集（Fuji-Ripeness-Size Dataset），并结合使用Grounding-DINO语言模型进行物体检测以实现鲁棒的苹果检测和成熟度分类，同时开发了一种误差最小、变化最小的尺寸估计算法。

链接: https://arxiv.org/abs/2502.01850
作者: Keyi Zhu,Jiajia Li,Kaixiang Zhang,Chaaran Arunachalam,Siddhartha Bhattacharya,Renfu Lu,Zhaojian Li
机构: Michigan State University(密歇根州立大学); United States Department of Agriculture Agricultural Research Service(美国农业部农业研究署)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Harvesting is a critical task in the tree fruit industry, demanding extensive manual labor and substantial costs, and exposing workers to potential hazards. Recent advances in automated harvesting offer a promising solution by enabling efficient, cost-effective, and ergonomic fruit picking within tight harvesting windows. However, existing harvesting technologies often indiscriminately harvest all visible and accessible fruits, including those that are unripe or undersized. This study introduces a novel foundation model-based framework for efficient apple ripeness and size estimation. Specifically, we curated two public RGBD-based Fuji apple image datasets, integrating expanded annotations for ripeness (“Ripe” vs. “Unripe”) based on fruit color and image capture dates. The resulting comprehensive dataset, Fuji-Ripeness-Size Dataset, includes 4,027 images and 16,257 annotated apples with ripeness and size labels. Using Grounding-DINO, a language-model-based object detector, we achieved robust apple detection and ripeness classification, outperforming other state-of-the-art models. Additionally, we developed and evaluated six size estimation algorithms, selecting the one with the lowest error and variation for optimal performance. The Fuji-Ripeness-Size Dataset and the apple detection and size estimation algorithms are made publicly available, which provides valuable benchmarks for future studies in automated and selective harvesting.
zh

[CV-82] UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS) 在建模三维物体和场景时面临的挑战，特别是其离散、无结构及排列不变性带来的难题。论文的关键解决方案在于引入UVGS表示法，通过球面映射将3DGS转换为结构化的二维表示，并进一步利用一个多分支网络将异构特征压缩到低维共享特征空间。这一方法使得UVGS可以被视作典型的RGB图像处理，从而能够直接利用现有的二维生成模型，如潜在扩散模型，来高效地建模3DGS，同时提供了一种可扩展的解决方案。

链接: https://arxiv.org/abs/2502.01846
作者: Aashish Rai,Dilin Wang,Mihir Jain,Nikolaos Sarafianos,Arthur Chen,Srinath Sridhar,Aayush Prakash
机构: Brown University; Meta Reality Labs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated superior quality in modeling 3D objects and scenes. However, generating 3DGS remains challenging due to their discrete, unstructured, and permutation-invariant nature. In this work, we present a simple yet effective method to overcome these challenges. We utilize spherical mapping to transform 3DGS into a structured 2D representation, termed UVGS. UVGS can be viewed as multi-channel images, with feature dimensions as a concatenation of Gaussian attributes such as position, scale, color, opacity, and rotation. We further find that these heterogeneous features can be compressed into a lower-dimensional (e.g., 3-channel) shared feature space using a carefully designed multi-branch network. The compressed UVGS can be treated as typical RGB images. Remarkably, we discover that typical VAEs trained with latent diffusion models can directly generalize to this new representation without additional training. Our novel representation makes it effortless to leverage foundational 2D models, such as diffusion models, to directly model 3DGS. Additionally, one can simply increase the 2D UV resolution to accommodate more Gaussians, making UVGS a scalable solution compared to typical 3D backbones. This approach immediately unlocks various novel generation applications of 3DGS by inherently utilizing the already developed superior 2D generation capabilities. In our experiments, we demonstrate various unconditional, conditional generation, and inpainting applications of 3DGS based on diffusion models, which were previously non-trivial.
zh

[CV-83] xture Image Synthesis Using Spatial GAN Based on Vision Transformers

【速读】：该论文旨在解决复杂纹理合成的挑战，特别是传统方法如平铺和基于补丁的技术在处理复杂纹理时存在的局限性。论文的关键解决方案在于提出了一种名为ViT-SGAN的新混合模型，该模型融合了视觉变换器（Vision Transformers, ViTs）与空间生成对抗网络（Spatial Generative Adversarial Network, SGAN）。通过将专门的纹理描述符（如均值-方差（mean-variance, mu, sigma）和文本子（textons））纳入ViTs的自注意力机制中，该模型能够更有效地捕捉复杂的空间依赖关系，从而显著提升生成的纹理质量，尤其在规则和不规则纹理方面超越了现有最先进的模型。

链接: https://arxiv.org/abs/2502.01842
作者: Elahe Salari,Zohreh Azimifar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at the 2nd International Conference on Artificial Intelligence and Software Engineering (AI-SOFT), Shiraz University, Shiraz, Iran, 2024

点击查看摘要

Abstract:Texture synthesis is a fundamental task in computer vision, whose goal is to generate visually realistic and structurally coherent textures for a wide range of applications, from graphics to scientific simulations. While traditional methods like tiling and patch-based techniques often struggle with complex textures, recent advancements in deep learning have transformed this field. In this paper, we propose ViT-SGAN, a new hybrid model that fuses Vision Transformers (ViTs) with a Spatial Generative Adversarial Network (SGAN) to address the limitations of previous methods. By incorporating specialized texture descriptors such as mean-variance (mu, sigma) and textons into the self-attention mechanism of ViTs, our model achieves superior texture synthesis. This approach enhances the model’s capacity to capture complex spatial dependencies, leading to improved texture quality that is superior to state-of-the-art models, especially for regular and irregular textures. Comparison experiments with metrics such as FID, IS, SSIM, and LPIPS demonstrate the substantial improvement of ViT-SGAN, which underlines its efficiency in generating diverse realistic textures.
zh

[CV-84] Low Resource Video Super-resolution using Memory and Residual Deformable Convolutions

【速读】：该论文旨在解决Transformer-based视频超分辨率（VSR）模型在资源受限设备上部署时面临的高计算需求与输出质量之间的平衡问题。论文的关键解决方案在于提出了一种新颖的轻量级、参数高效的深度残差可变形卷积网络。此方法通过残差连接增强特征利用，并采用可变形卷积实现精确帧对齐，有效处理运动动态。此外，引入单一记忆张量以捕捉从前帧累积的信息，从而改善跨帧运动估计。这些设计使模型在仅使用230万参数的情况下，在REDS4数据集上达到了0.9175的SSIM，超越了现有轻量级和许多 heavyweight 模型的精度与资源效率。

链接: https://arxiv.org/abs/2502.01816
作者: Kavitha Viswanathan,Shashwat Pathak,Piyush Bharambe,Harsh Choudhary,Amit Sethi
机构: IIT Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer-based video super-resolution (VSR) models have set new benchmarks in recent years, but their substantial computational demands make most of them unsuitable for deployment on resource-constrained devices. Achieving a balance between model complexity and output quality remains a formidable challenge in VSR. Although lightweight models have been introduced to address this issue, they often struggle to deliver state-of-the-art performance. We propose a novel lightweight, parameter-efficient deep residual deformable convolution network for VSR. Unlike prior methods, our model enhances feature utilization through residual connections and employs deformable convolution for precise frame alignment, addressing motion dynamics effectively. Furthermore, we introduce a single memory tensor to capture information accrued from the past frames and improve motion estimation across frames. This design enables an efficient balance between computational cost and reconstruction quality. With just 2.3 million parameters, our model achieves state-of-the-art SSIM of 0.9175 on the REDS4 dataset, surpassing existing lightweight and many heavy models in both accuracy and resource efficiency. Architectural insights from our model pave the way for real-time VSR on streaming data.
zh

[CV-85] PolyhedronNet: Representation Learning for Polyhedra with Surface-attributed Graph

【速读】：该论文旨在解决三维多面体对象表示学习中的表面建模问题。现有方法大多关注顶点序列，而忽略了复杂表面建模的重要性。论文提出的关键解决方案是PolyhedronNet框架，通过引入带有表面属性的图（Surface-attributed Graph）来建模多面体内的顶点、边、面及其几何关系。进一步地，通过将其分解为局部刚性表示（local rigid representations），有效学习每个局部区域相对于其他区域的相对位置，同时避免几何信息损失。最终，通过PolyhedronGNN模块实现局部刚性表示的层次聚合，从而获得全局表示，确保在保持旋转和平移不变性的同时最小化信息损失。

链接: https://arxiv.org/abs/2502.01814
作者: Dazhou Yu,Genpei Zhang,Liang Zhao
机构: Emory University (埃默里大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ubiquitous geometric objects can be precisely and efficiently represented as polyhedra. The transformation of a polyhedron into a vector, known as polyhedra representation learning, is crucial for manipulating these shapes with mathematical and statistical tools for tasks like classification, clustering, and generation. Recent years have witnessed significant strides in this domain, yet most efforts focus on the vertex sequence of a polyhedron, neglecting the complex surface modeling crucial in real-world polyhedral objects. This study proposes \textbfPolyhedronNet, a general framework tailored for learning representations of 3D polyhedral objects. We propose the concept of the surface-attributed graph to seamlessly model the vertices, edges, faces, and their geometric interrelationships within a polyhedron. To effectively learn the representation of the entire surface-attributed graph, we first propose to break it down into local rigid representations to effectively learn each local region’s relative positions against the remaining regions without geometric information loss. Subsequently, we propose PolyhedronGNN to hierarchically aggregate the local rigid representation via intra-face and inter-face geometric message passing modules, to obtain a global representation that minimizes information loss while maintaining rotation and translation invariance. Our experimental evaluations on four distinct datasets, encompassing both classification and retrieval tasks, substantiate PolyhedronNet’s efficacy in capturing comprehensive and informative representations of 3D polyhedral objects. Code and data are available at this https URL.
zh

[CV-86] AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis

【速读】：该论文旨在解决水下场景理解中的视觉-语言模型在零样本设置下的性能不足问题。解决方案的关键在于引入了一个名为AquaticCLIP的新颖对比语言-图像预训练模型，它通过利用大规模无标注的水下图像-文本配对数据集，采用无监督学习框架来对齐图像和文本。此外，该模型通过提示引导的视觉编码器和视觉引导机制，逐步聚合局部特征，并优化对比预训练损失以对齐视觉和文本模态，从而提升了模型在多个水下计算机视觉任务中的零样本性能，同时增强了鲁棒性和可解释性。

链接: https://arxiv.org/abs/2502.01785
作者: Basit Alawode,Iyyakutti Iyappan Ganapathi,Sajid Javed,Naoufel Werghi,Mohammed Bennamoun,Arif Mahmood
机构: Khalifa University of Science and Technology (哈利法科技大学); Information Technology University (信息技术大学), Pakistan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The preservation of aquatic biodiversity is critical in mitigating the effects of climate change. Aquatic scene understanding plays a pivotal role in aiding marine scientists in their decision-making processes. In this paper, we introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding. AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments, enabling tasks such as segmentation, classification, detection, and object counting. By leveraging our large-scale underwater image-text paired dataset without the need for ground-truth annotations, our model enriches existing vision-language models in the aquatic domain. For this purpose, we construct a 2 million underwater image-text paired dataset using heterogeneous resources, including YouTube, Netflix, NatGeo, etc. To fine-tune AquaticCLIP, we propose a prompt-guided vision encoder that progressively aggregates patch features via learnable prompts, while a vision-guided mechanism enhances the language encoder by incorporating visual context. The model is optimized through a contrastive pretraining loss to align visual and textual modalities. AquaticCLIP achieves notable performance improvements in zero-shot settings across multiple underwater computer vision tasks, outperforming existing methods in both robustness and interpretability. Our model sets a new benchmark for vision-language applications in underwater environments. The code and dataset for AquaticCLIP are publicly available on GitHub at xxx.
zh

[CV-87] VILP: Imitation Learning with Latent Video Planning

【速读】：该论文旨在解决在生成式AI时代，如何有效将视频生成模型集成到机器人政策学习中的问题。关键解决方案在于提出了一种潜伏视频扩散模型（Latent Video Diffusion Model），能够生成高度时间对齐的多视角预测机器人视频，并确保这些视频具有良好的时间一致性。这种方法不仅提高了视频生成的时间效率，还能减少对大量高质量任务特定机器人动作数据的依赖，同时保持稳健的性能。

链接: https://arxiv.org/abs/2502.01784
作者: Zhengtong Xu,Qiang Qiu,Yu She
机构: Purdue University (普渡大学); Purdue University (普渡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This paper introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-efficient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96x160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our findings indicate that VILP can rely less on extensive high-quality task-specific robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related fields and directions. For more details, please refer to our open-source repository this https URL.
zh

[CV-88] Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

【速读】：该论文旨在解决扩散变换器（Diffusion Transformers, DiTs）在视频生成任务中的高计算成本问题，通常即使在高性能GPU上生成几秒钟的视频也需要数十分钟。论文的关键解决方案在于提出了一种名为稀疏视频生成（Sparse VideoGen, SVG）的无训练框架，通过利用三维全注意力机制中的固有稀疏性来提升推理效率。SVG通过动态识别注意力头的不同稀疏模式，将其分为空间头和时间头，并结合高效的张量布局转换和定制内核实现，实现了高达2.28倍和2.33倍的端到端加速，同时保持生成质量。

链接: https://arxiv.org/abs/2502.01776
作者: Haocheng Xi,Shuo Yang,Yilong Zhao,Chenfeng Xu,Muyang Li,Xiuyu Li,Yujun Lin,Han Cai,Jintao Zhang,Dacheng Li,Jianfei Chen,Ion Stoica,Kurt Keutzer,Song Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
zh

[CV-89] Coarse-to-Fine 3D Keyframe Transporter

【速读】：该论文旨在解决基于关键帧模仿学习（Keyframe Imitation Learning, IL）中的样本效率低下问题。论文的关键在于识别并利用关键帧动作方案中的双同变性（bi-equivariance）特性，提出了一种基于Transporter网络的关键帧传输器（Keyframe Transporter），并通过交叉相关运算评估抓取物体特征与场景特征之间的关联。此外，论文还提出了一种计算高效的粗到细SE(3)动作评估方案，用于推理交织的平移和旋转动作。这些方法使所提出的策略在广泛的模拟任务中平均提升了10%的性能，并在四个物理实验中平均提升了55%的性能。

链接: https://arxiv.org/abs/2502.01773
作者: Xupeng Zhu,David Klee,Dian Wang,Boce Hu,Haojie Huang,Arsh Tangri,Robin Walters,Robert Platt
机构: Northeastern University; Boston Dynamics AI Institute
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Keyframe Imitation Learning (IL) have enabled learning-based agents to solve a diverse range of manipulation tasks. However, most approaches ignore the rich symmetries in the problem setting and, as a consequence, are sample-inefficient. This work identifies and utilizes the bi-equivariant symmetry within Keyframe IL to design a policy that generalizes to transformations of both the workspace and the objects grasped by the gripper. We make two main contributions: First, we analyze the bi-equivariance properties of the keyframe action scheme and propose a Keyframe Transporter derived from the Transporter Networks, which evaluates actions using cross-correlation between the features of the grasped object and the features of the scene. Second, we propose a computationally efficient coarse-to-fine SE(3) action evaluation scheme for reasoning the intertwined translation and rotation action. The resulting method outperforms strong Keyframe IL baselines by an average of 10% on a wide range of simulation tasks, and by an average of 55% in 4 physical experiments.
zh

[CV-90] Generating Multi-Image Synthetic Data for Text-to-Image Customization WWW

【速读】：该论文旨在解决文本到图像模型定制化过程中存在的两个主要问题：一是测试时优化成本高昂，二是单图训练数据集缺乏多图监督导致生成图像质量较差。为了解决这些问题，论文提出的关键方案包括创建一个包含多个视角、光照和背景变化的高质量合成自定义数据集（SynCD），设计一种基于共享注意力机制的新编码器架构以更好地整合输入图像的细节信息，以及提出一种新的推理技术以缓解推理过程中的过曝问题。通过这些改进，论文表明其模型在标准自定义基准测试中优于现有的无调优方法。

链接: https://arxiv.org/abs/2502.01720
作者: Nupur Kumari,Xi Yin,Jun-Yan Zhu,Ishan Misra,Samaneh Azadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings. Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision, leading to worse image quality. We propose a simple approach that addresses both limitations. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. We then propose a new encoder architecture based on shared attention mechanisms that better incorporate fine-grained visual details from input images. Finally, we propose a new inference technique that mitigates overexposure issues during inference by normalizing the text and image guidance vectors. Through extensive experiments, we show that our model, trained on the synthetic dataset with the proposed encoder and inference algorithm, outperforms existing tuning-free methods on standard customization benchmarks.
zh

[CV-91] MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation

【速读】：该论文旨在解决视频生成模型在文本指令对齐、内容幻觉、安全性和偏见等方面的现有挑战。解决方案的关键在于引入MJ-BENCH-VIDEO基准数据集以及基于专家混合（Mixture-of-Experts, MoE）的MJ-VIDEO奖励模型。MJ-VIDEO能够根据输入的文本-视频对动态选择相关专家，以实现更精确和适应性的偏好判断，从而显著提升了整体和细粒度的偏好评估性能，分别提高了17.58%和15.87%。

链接: https://arxiv.org/abs/2502.01719
作者: Haibo Tong,Zhaoyang Wang,Zhaorun Chen,Haonian Ji,Shi Qiu,Siwei Han,Kexin Geng,Zhongkai Xue,Yiyang Zhou,Peng Xia,Mingyu Ding,Rafael Rafailov,Chelsea Finn,Huaxiu Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in video generation have significantly improved the ability to synthesize videos from text instructions. However, existing models still struggle with key challenges such as instruction misalignment, content hallucination, safety concerns, and bias. Addressing these limitations, we introduce MJ-BENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence Consistency, and Bias Fairness. This benchmark incorporates 28 fine-grained criteria to provide a comprehensive evaluation of video preference. Building upon this dataset, we propose MJ-VIDEO, a Mixture-of-Experts (MoE)-based video reward model designed to deliver fine-grained reward. MJ-VIDEO can dynamically select relevant experts to accurately judge the preference based on the input text-video pair. This architecture enables more precise and adaptable preference judgments. Through extensive benchmarking on MJ-BENCH-VIDEO, we analyze the limitations of existing video reward models and demonstrate the superior performance of MJ-VIDEO in video preference assessment, achieving 17.58% and 15.87% improvements in overall and fine-grained preference judgments, respectively. Additionally, introducing MJ-VIDEO for preference tuning in video generation enhances the alignment performance.
zh

[CV-92] A Multi-Scale Feature Fusion Framework Integrating Frequency Domain and Cross-View Attention for Dual-View X-ray Security Inspections

【速读】：该论文旨在解决单视角X射线设备在复杂堆叠场景下难以准确识别违禁品的问题，主要由于其视角依赖性强且特征表示不足。为了解决这一问题，论文提出了一种创新的多尺度交互特征融合框架，专门针对双视角X射线安检图像分类。该框架的关键在于三个核心模块：频域交互模块（Frequency Domain Interaction Module, FDIM）通过傅里叶变换增强频域特征；多尺度跨视角特征增强（Multi-Scale Cross-View Feature Enhancement, MSCFE）利用跨视角注意力机制加强特征交互；卷积注意力融合模块（Convolutional Attention Fusion Module, CAFM）通过结合通道注意力与深度可分离卷积高效融合特征。实验结果表明，该方法在多种骨干网络架构上超越现有最先进方法，尤其在存在遮挡和物体堆叠的复杂场景中表现优异。

链接: https://arxiv.org/abs/2502.01710
作者: Shilong Hong,Yanzhou Zhou,Weichao Xu
机构: School of Automation, Guangdong University of Technology(广东工业大学自动化学院), China; Guangzhou Institute of Science and Technology(广州科技研究院), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of modern transportation systems and the exponential growth of logistics volumes, intelligent X-ray-based security inspection systems play a crucial role in public safety. Although single-view X-ray equipment is widely deployed, it struggles to accurately identify contraband in complex stacking scenarios due to strong viewpoint dependency and inadequate feature representation. To address this, we propose an innovative multi-scale interactive feature fusion framework tailored for dual-view X-ray security inspection image classification. The framework comprises three core modules: the Frequency Domain Interaction Module (FDIM) enhances frequency-domain features through Fourier transform; the Multi-Scale Cross-View Feature Enhancement (MSCFE) leverages cross-view attention mechanisms to strengthen feature interactions; and the Convolutional Attention Fusion Module (CAFM) efficiently fuses features by integrating channel attention with depthwise-separable convolutions. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches across multiple backbone architectures, particularly excelling in complex scenarios with occlusions and object stacking.
zh

[CV-93] CLIP-DQA: Blindly Evaluating Dehazed Images from Global and Local Perspectives Using CLIP ISCAS2025

【速读】：该论文旨在解决无参考去雾图像质量评估（BDQA）中小规模数据集限制现有学习方法性能的问题。关键解决方案是通过调整预训练的对比语言图像预训练模型（CLIP），使其能够处理去雾图像的全局和局部信息，并利用提示学习（prompt learning）来优化视觉分支和语言分支，从而实现更准确的质量预测。

链接: https://arxiv.org/abs/2502.01707
作者: Yirui Zeng,Jun Fu,Hadi Amirpour,Huasheng Wang,Guanghui Yue,Hantao Liu,Ying Chen,Wei Zhou
机构: School of Computer Science and Informatics, Cardiff University (计算机科学与信息学学院, 卡迪夫大学), United Kingdom; Christian Doppler Laboratory ATHENA, Alpen-Adria-Universita¨¨𝑎\ddot{a}over¨ start_ARG italic_a end_ARGt, Klagenfurt (阿尔卑斯-亚得里亚大学, 克lagenfurt), Austria; Alibaba Group (阿里巴巴集团), China; School of Biomedical Engineering, Shenzhen University (生物医学工程学院, 深圳大学), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ISCAS 2025 (Oral)

点击查看摘要

Abstract:Blind dehazed image quality assessment (BDQA), which aims to accurately predict the visual quality of dehazed images without any reference information, is essential for the evaluation, comparison, and optimization of image dehazing algorithms. Existing learning-based BDQA methods have achieved remarkable success, while the small scale of DQA datasets limits their performance. To address this issue, in this paper, we propose to adapt Contrastive Language-Image Pre-Training (CLIP), pre-trained on large-scale image-text pairs, to the BDQA task. Specifically, inspired by the fact that the human visual system understands images based on hierarchical features, we take global and local information of the dehazed image as the input of CLIP. To accurately map the input hierarchical information of dehazed images into the quality score, we tune both the vision branch and language branch of CLIP with prompt learning. Experimental results on two authentic DQA datasets demonstrate that our proposed approach, named CLIP-DQA, achieves more accurate quality predictions over existing BDQA methods. The code is available at this https URL.
zh

[CV-94] HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment

【速读】：该论文旨在解决文本到视频（Text-to-Video, T2V）生成方法中缺乏有效的直接偏好优化（Direct Preference Optimization, DPO）策略指导，以及训练数据不足导致生成视频质量不佳和灵活性不足的问题。关键解决方案包括：首先，引入HuViDPO方法，通过设计精心构造的损失函数，利用人类反馈来优化视频生成与人类偏好的一致性；其次，构建针对不同动作类别的小型人类偏好数据集，并进行微调以提高生成视频的美学质量同时减少训练成本；最后，采用基于首帧条件（First-Frame-Conditioned）的策略和稀疏因果注意力机制（SparseCausal Attention），增强视频生成的灵活性和质量。

链接: https://arxiv.org/abs/2502.01690
作者: Lifan Jiang,Boxi Wu,Jiahui Zhang,Xiaotong Guan,Shuang Chen
机构: State Key Laboratory of CAD & CG, Zhejiang University(浙江大学CAD&CG国家重点实验室); College of Software Technology, Zhejiang University(浙江大学软件学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of AIGC technology, significant progress has been made in diffusion model-based technologies for text-to-image (T2I) and text-to-video (T2V). In recent years, a few studies have introduced the strategy of Direct Preference Optimization (DPO) into T2I tasks, significantly enhancing human preferences in generated images. However, existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences using DPO strategies. Additionally, challenges such as the scarcity of paired video preference data hinder effective model training. At the same time, the lack of training datasets poses a risk of insufficient flexibility and poor video generation quality in the generated videos. Based on those problems, our work proposes three targeted solutions in sequence. 1) Our work is the first to introduce the DPO strategy into the T2V tasks. By deriving a carefully structured loss function, we utilize human feedback to align video generation with human preferences. We refer to this new method as HuViDPO. 2) Our work constructs small-scale human preference datasets for each action category and fine-tune this model, improving the aesthetic quality of the generated videos while reducing training costs. 3) We adopt a First-Frame-Conditioned strategy, leveraging the rich in formation from the first frame to guide the generation of subsequent frames, enhancing flexibility in video generation. At the same time, we employ a SparseCausal Attention mechanism to enhance the quality of the generated this http URL details and examples can be accessed on our website: this https URL. this http URL.
zh

[CV-95] Semantic Communication based on Generative AI: A New Approach to Image Compression and Edge Optimization

【速读】：该论文旨在解决智能设备产生的大量数据在通信网络中的高效传输与处理问题，特别是在自动驾驶车辆、智能传感器及物联网系统中面临的挑战。论文的关键解决方案在于将语义通信与生成模型（Generative Adversarial Networks 和 Denoising Diffusion Probabilistic Models）相结合，用于优化图像压缩和边缘网络资源分配。通过仅编码语义相关的特征，这些模型实现了高质量的图像重建，同时显著提升了带宽效率和降低了延迟。此外，论文提出了一种基于目标导向的边缘网络优化框架，利用信息瓶颈原理和随机优化动态分配资源，进一步提升效率。这一方法平衡了计算效率和通信效能，适用于实时应用。

链接: https://arxiv.org/abs/2502.01675
作者: Francesco Pezone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: PhD thesis

点击查看摘要

Abstract:As digital technologies advance, communication networks face challenges in handling the vast data generated by intelligent devices. Autonomous vehicles, smart sensors, and IoT systems necessitate new paradigms. This thesis addresses these challenges by integrating semantic communication and generative models for optimized image compression and edge network resource allocation. Unlike bit-centric systems, semantic communication prioritizes transmitting meaningful data specifically selected to convey the meaning rather than obtain a faithful representation of the original data. The communication infrastructure can benefit to significant improvements in bandwidth efficiency and latency reduction. Central to this work is the design of semantic-preserving image compression using Generative Adversarial Networks and Denoising Diffusion Probabilistic Models. These models compress images by encoding only semantically relevant features, allowing for high-quality reconstruction with minimal transmission. Additionally, a Goal-Oriented edge network optimization framework is introduced, leveraging the Information Bottleneck principle and stochastic optimization to dynamically allocate resources and enhance efficiency. By integrating semantic communication into edge networks, this approach balances computational efficiency and communication effectiveness, making it suitable for real-time applications. The thesis compares semantic-aware models with conventional image compression techniques using classical and semantic evaluation metrics. Results demonstrate the potential of combining generative AI and semantic communication to create more efficient semantic-goal-oriented communication networks that meet the demands of modern data-driven applications.
zh

[CV-96] Leverag ing Stable Diffusion for Monocular Depth Estimation via Image Semantic Encoding

【速读】：该论文致力于解决单目深度估计在复杂户外环境中的准确性问题，特别是在需要丰富上下文信息的场景下，传统方法如基于CLIP的模型存在局限性。论文的关键解决方案在于提出了一种基于图像的语义嵌入方法，直接从视觉特征中提取上下文信息，从而显著提升了复杂环境下的深度预测性能。该方法在KITTI和Waymo数据集上的评估结果表明，其表现可与最先进模型相媲美，并且克服了CLIP嵌入在处理户外场景时的不足，展示了增强的鲁棒性和适应性。

链接: https://arxiv.org/abs/2502.01666
作者: Jingming Xia,Guanqun Cao,Guang Ma,Yiben Luo,Qinzhao Li,John Oyekan
机构: University of York (约克大学), UK
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Monocular depth estimation involves predicting depth from a single RGB image and plays a crucial role in applications such as autonomous driving, robotic navigation, 3D reconstruction, etc. Recent advancements in learning-based methods have significantly improved depth estimation performance. Generative models, particularly Stable Diffusion, have shown remarkable potential in recovering fine details and reconstructing missing regions through large-scale training on diverse datasets. However, models like CLIP, which rely on textual embeddings, face limitations in complex outdoor environments where rich context information is needed. These limitations reduce their effectiveness in such challenging scenarios. Here, we propose a novel image-based semantic embedding that extracts contextual information directly from visual features, significantly improving depth prediction in complex environments. Evaluated on the KITTI and Waymo datasets, our method achieves performance comparable to state-of-the-art models while addressing the shortcomings of CLIP embeddings in handling outdoor scenes. By leveraging visual semantics directly, our method demonstrates enhanced robustness and adaptability in depth estimation tasks, showcasing its potential for application to other visual perception tasks.
zh

[CV-97] FruitPAL: An IoT-Enabled Framework for Automatic Monitoring of Fruit Consumption in Smart Healthcare

【速读】：该论文旨在解决水果消费中的健康风险问题，特别是过敏反应和营养摄入不均衡的问题。解决方案的关键在于开发了两个全自动设备FruitPAL和FruitPAL 2.0，它们利用高质量的十五种水果数据集，并采用先进的YOLOv8和YOLOv5 V6.0模型来提高检测准确性。FruitPAL能够识别多种水果类型并检测过敏反应，而FruitPAL 2.0在此基础上进一步估计水果的营养价值，从而促进健康的饮食习惯。通过实时通知和提供饮食见解，这些设备支持用户做出明智的选择，平衡健康益处与过敏意识。

链接: https://arxiv.org/abs/2502.01643
作者: Abdulrahman Alkinani,Alakananda Mitra,Saraju P. Mohanty,Elias Kougianos
机构: University of North Texas (北德克萨斯大学); Nebraska Water Center (内布拉斯加水中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 Pages, 17 Figures, 5 Tables

点击查看摘要

Abstract:Fruits are rich sources of essential vitamins and nutrients that are vital for human health. This study introduces two fully automated devices, FruitPAL and its updated version, FruitPAL 2.0, which aim to promote safe fruit consumption while reducing health risks. Both devices leverage a high-quality dataset of fifteen fruit types and use advanced models- YOLOv8 and YOLOv5 V6.0- to enhance detection accuracy. The original FruitPAL device can identify various fruit types and notify caregivers if an allergic reaction is detected, thanks to YOLOv8’s improved accuracy and rapid response time. Notifications are transmitted via the cloud to mobile devices, ensuring real-time updates and immediate accessibility. FruitPAL 2.0 builds upon this by not only detecting fruit but also estimating its nutritional value, thereby encouraging healthy consumption. Trained on the YOLOv5 V6.0 model, FruitPAL 2.0 analyzes fruit intake to provide users with valuable dietary insights. This study aims to promote fruit consumption by helping individuals make informed choices, balancing health benefits with allergy awareness. By alerting users to potential allergens while encouraging the consumption of nutrient-rich fruits, these devices support both health maintenance and dietary awareness.
zh

[CV-98] MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks

【速读】：该论文旨在解决多模态数据集较小导致多模态模型性能下降的问题，特别是在医学领域。同时，随着模态数量的增加，多模态网络的规模也相应增大，这在医学应用场景中可能并不理想。针对这些问题，论文提出了一种名为Modality-INformed知识蒸馏（MIND）的框架，其关键是通过知识蒸馏将不同大小的预训练深度神经网络的知识转移到一个更小的多模态学生模型中。MIND采用多头联合融合模型，使单模态样本在不进行插值或掩码处理的情况下也能利用单模态编码器，从而优化多模态模型的表现，并平衡训练过程中的多模态学习。

链接: https://arxiv.org/abs/2502.01158
作者: Alejandro Guerra-Manzanares,Farah E. Shamout
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Transactions on Machine Learning Research (01/2025), this https URL

点击查看摘要

Abstract:Multimodal fusion leverages information across modalities to learn better feature representations with the goal of improving performance in fusion-based tasks. However, multimodal datasets, especially in medical settings, are typically smaller than their unimodal counterparts, which can impede the performance of multimodal models. Additionally, the increase in the number of modalities is often associated with an overall increase in the size of the multimodal network, which may be undesirable in medical use cases. Utilizing smaller unimodal encoders may lead to sub-optimal performance, particularly when dealing with high-dimensional clinical data. In this paper, we propose the Modality-INformed knowledge Distillation (MIND) framework, a multimodal model compression approach based on knowledge distillation that transfers knowledge from ensembles of pre-trained deep neural networks of varying sizes into a smaller multimodal student. The teacher models consist of unimodal networks, allowing the student to learn from diverse representations. MIND employs multi-head joint fusion models, as opposed to single-head models, enabling the use of unimodal encoders in the case of unimodal samples without requiring imputation or masking of absent modalities. As a result, MIND generates an optimized multimodal model, enhancing both multimodal and unimodal representations. It can also be leveraged to balance multimodal learning during training. We evaluate MIND on binary and multilabel clinical prediction tasks using time series data and chest X-ray images. Additionally, we assess the generalizability of the MIND framework on three non-medical multimodal multiclass datasets. Experimental results demonstrate that MIND enhances the performance of the smaller multimodal network across all five tasks, as well as various fusion methods and multimodal architectures, compared to state-of-the-art baselines.
zh

[CV-99] Adaptive Object Detection for Indoor Navigation Assistance: A Performance Evaluation of Real-Time Algorithms

【速读】：该论文旨在解决视觉障碍人士在室内导航辅助技术中实时物体检测的准确性和效率问题。研究的关键在于评估YOLO、SSD、Faster R-CNN和Mask R-CNN四种实时物体检测算法（Real-time Object Detection Algorithms）在室内环境中的表现，通过分析检测精度、处理速度以及对室内环境的适应性，为实时辅助导航选择最优算法提供依据。

链接: https://arxiv.org/abs/2501.18444
作者: Abhinav Pratap,Sushant Kumar,Suchinton Chakravarty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, 3 tables

点击查看摘要

Abstract:This study addresses the need for accurate and efficient object detection in assistive technologies for visually impaired individuals. We evaluate four real-time object detection algorithms YOLO, SSD, Faster R-CNN, and Mask R-CNN within the context of indoor navigation assistance. Using the Indoor Objects Detection dataset, we analyze detection accuracy, processing speed, and adaptability to indoor environments. Our findings highlight the trade-offs between precision and efficiency, offering insights into selecting optimal algorithms for realtime assistive navigation. This research advances adaptive machine learning applications, enhancing indoor navigation solutions for the visually impaired and promoting accessibility.
zh

[CV-100] Particle Trajectory Representation Learning with Masked Point Modeling

【速读】：该论文旨在解决在高能物理（High Energy Physics, HEP）领域中，利用自监督学习（Self-Supervised Learning, SSL）方法对Time Projection Chambers (TPCs)中的三维粒子轨迹进行分析的问题。论文的关键解决方案是提出了一种基于点云的液氩掩码自动编码器（Point-based Liquid Argon Masked Autoencoder, PoLAr-MAE），通过体素化令牌化将稀疏的电离点分组成分辨率无关的块，并引入辅助的能量填充任务以增强轨迹语义。这种方法在无需任何标记数据的情况下，达到了与有监督基线相当的99.4%的轨迹分类F分数和97.7%的簇射分类F分数。

链接: https://arxiv.org/abs/2502.02558
作者: Sam Young,Yeon-jae Jwa,Kazuhiro Terao
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 15 figures. Project page at this https URL

点击查看摘要

Abstract:Effective self-supervised learning (SSL) techniques have been key to unlocking large datasets for representation learning. While many promising methods have been developed using online corpora and captioned photographs, their application to scientific domains, where data encodes highly specialized knowledge, remains in its early stages. We present a self-supervised masked modeling framework for 3D particle trajectory analysis in Time Projection Chambers (TPCs). These detectors produce globally sparse (1% occupancy) but locally dense point clouds, capturing meter-scale particle trajectories at millimeter resolution. Starting with PointMAE, this work proposes volumetric tokenization to group sparse ionization points into resolution-agnostic patches, as well as an auxiliary energy infilling task to improve trajectory semantics. This approach – which we call Point-based Liquid Argon Masked Autoencoder (PoLAr-MAE) – achieves 99.4% track and 97.7% shower classification F-scores, matching that of supervised baselines without any labeled data. While the model learns rich particle trajectory representations, it struggles with sub-token phenomena like overlapping or short-lived particle trajectories. To support further research, we release PILArNet-M – the largest open LArTPC dataset (1M+ events, 5.2B labeled points) – to advance SSL in high energy physics (HEP). Project site: this https URL
zh

[CV-101] AAD-DCE: An Aggregated Multimodal Attention Mechanism for Early and Late Dynamic Contrast Enhanced Prostate MRI Synthesis

【速读】：该论文旨在解决动态对比增强磁共振成像（DCE-MRI）中由于使用基于钆的对比剂而带来的毒性和局部灌注信息不足的问题。论文的关键在于提出了一种名为AAD-DCE的生成对抗网络（GAN），该网络包含一个聚合注意力判别器模块，由全局和局部判别器组成。该模块通过提供空间嵌入注意力图来驱动生成器合成早期和晚期响应的DCE-MRI图像。此外，AAD-DCE方法采用多模态输入（包括T2加权、表观扩散系数和T1对比前图像）以更好地捕捉局部灌注信息。

链接: https://arxiv.org/abs/2502.02555
作者: Divya Bharti,Sriprabha Ramanarayanan,Sadhana S,Kishore Kumar M,Keerthi Ram,Harsh Agarwal,Ramesh Venkatesan,Mohanasankar Sivaprakasam
机构: Indian Institute of Technology Madras (IITM), India; Healthcare Technology Innovation Centre (HTIC), India; GE HealthCare (GE), India
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE-MRI) is a medical imaging technique that plays a crucial role in the detailed visualization and identification of tissue perfusion in abnormal lesions and radiological suggestions for biopsy. However, DCE-MRI involves the administration of a Gadolinium based (Gad) contrast agent, which is associated with a risk of toxicity in the body. Previous deep learning approaches that synthesize DCE-MR images employ unimodal non-contrast or low-dose contrast MRI images lacking focus on the local perfusion information within the anatomy of interest. We propose AAD-DCE, a generative adversarial network (GAN) with an aggregated attention discriminator module consisting of global and local discriminators. The discriminators provide a spatial embedded attention map to drive the generator to synthesize early and late response DCE-MRI images. Our method employs multimodal inputs - T2 weighted (T2W), Apparent Diffusion Coefficient (ADC), and T1 pre-contrast for image synthesis. Extensive comparative and ablation studies on the ProstateX dataset show that our model (i) is agnostic to various generator benchmarks and (ii) outperforms other DCE-MRI synthesis approaches with improvement margins of +0.64 dB PSNR, +0.0518 SSIM, -0.015 MAE for early response and +0.1 dB PSNR, +0.0424 SSIM, -0.021 MAE for late response, and (ii) emphasize the importance of attention ensembling. Our code is available at this https URL.
zh

[CV-102] he Skin Game: Revolutionizing Standards for AI Dermatology Model Comparison

【速读】：该论文旨在解决皮肤疾病分类研究中存在的方法学挑战，包括数据准备、增强策略以及性能报告中的显著不一致性。关键解决方案在于提出一个全面的训练和评估框架，并通过实验展示了DINOv2-Large视觉变换器在三个基准数据集(HAM10000, DermNet, ISIC Atlas)上的表现，从而实现标准化的评估协议和实施策略。实验结果表明，DINOv2在皮肤疾病分类任务中达到了宏平均F1分数0.85（HAM10000）、0.71（DermNet）和0.84（ISIC Atlas），并揭示了模型决策中的关键模式及其在典型与非典型病例中的表现差异。

链接: https://arxiv.org/abs/2502.02500
作者: Łukasz Miętkiewicz,Leon Ciechanowski,Dariusz Jemielniak
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
备注: 60 pages, 69 figures

点击查看摘要

Abstract:Deep Learning approaches in dermatological image classification have shown promising results, yet the field faces significant methodological challenges that impede proper evaluation. This paper presents a dual contribution: first, a systematic analysis of current methodological practices in skin disease classification research, revealing substantial inconsistencies in data preparation, augmentation strategies, and performance reporting; second, a comprehensive training and evaluation framework demonstrated through experiments with the DINOv2-Large vision transformer across three benchmark datasets (HAM10000, DermNet, ISIC Atlas). The analysis identifies concerning patterns, including pre-split data augmentation and validation-based reporting, potentially leading to overestimated metrics, while highlighting the lack of unified methodology standards. The experimental results demonstrate DINOv2’s performance in skin disease classification, achieving macro-averaged F1-scores of 0.85 (HAM10000), 0.71 (DermNet), and 0.84 (ISIC Atlas). Attention map analysis reveals critical patterns in the model’s decision-making, showing sophisticated feature recognition in typical presentations but significant vulnerabilities with atypical cases and composite images. Our findings highlight the need for standardized evaluation protocols and careful implementation strategies in clinical settings. We propose comprehensive methodological recommendations for model development, evaluation, and clinical deployment, emphasizing rigorous data preparation, systematic error analysis, and specialized protocols for different image types. To promote reproducibility, we provide our implementation code through GitHub. This work establishes a foundation for rigorous evaluation standards in dermatological image classification and provides insights for responsible AI implementation in clinical dermatology.
zh

[CV-103] Style transfer as data augmentation: evaluating unpaired image-to-image translation models in mammography

【速读】：该论文旨在解决乳腺癌检测深度学习模型在不同患者群体间因数据域差异导致的过拟合和泛化性能不佳的问题。关键在于利用图像到图像翻译模型（如CycleGAN和SynDiff）进行数据增强，通过转移一个数据集的特征表示（即风格）到另一个数据集，从而提升模型的泛化能力。

链接: https://arxiv.org/abs/2502.02475
作者: Emir Ahmed,Spencer A. Thomas,Ciaran Bench
机构: Department of Data Science and AI, National Physical Laboratory(国家物理实验室), Hampton Road, Teddington, United Kingdom, TW11 0LW
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Several studies indicate that deep learning models can learn to detect breast cancer from mammograms (X-ray images of the breasts). However, challenges with overfitting and poor generalisability prevent their routine use in the clinic. Models trained on data from one patient population may not perform well on another due to differences in their data domains, emerging due to variations in scanning technology or patient characteristics. Data augmentation techniques can be used to improve generalisability by expanding the diversity of feature representations in the training data by altering existing examples. Image-to-image translation models are one approach capable of imposing the characteristic feature representations (i.e. style) of images from one dataset onto another. However, evaluating model performance is non-trivial, particularly in the absence of ground truths (a common reality in medical imaging). Here, we describe some key aspects that should be considered when evaluating style transfer algorithms, highlighting the advantages and disadvantages of popular metrics, and important factors to be mindful of when implementing them in practice. We consider two types of generative models: a cycle-consistent generative adversarial network (CycleGAN) and a diffusion-based SynDiff model. We learn unpaired image-to-image translation across three mammography datasets. We highlight that undesirable aspects of model performance may determine the suitability of some metrics, and also provide some analysis indicating the extent to which various metrics assess unique aspects of model performance. We emphasise the need to use several metrics for a comprehensive assessment of model performance.
zh

[CV-104] st Time Training for 4D Medical Image Interpolation

【速读】：该论文旨在解决4D医学图像插值在不同分布条件下泛化性能差的问题。关键在于提出了一种新颖的测试时训练框架，通过自监督任务（如旋转预测或图像重建）在不使用任何标签的情况下，使模型适应新的数据分布。

链接: https://arxiv.org/abs/2502.02341
作者: Qikang Zhang,Yingjie Lei,Zihao Zheng,Ziyang Chen,Zhonghao Xie
机构: Aberdeen Institution of Data Science and Artificial Intelligence, South China Normal University (南华师范大学 Aberdeen 数据科学与人工智能研究所), Foshan, China
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:4D medical image interpolation is essential for improving temporal resolution and diagnostic precision in clinical applications. Previous works ignore the problem of distribution shifts, resulting in poor generalization under different distribution. A natural solution would be to adapt the model to a new test distribution, but this cannot be done if the test input comes without a ground truth label. In this paper, we propose a novel test time training framework which uses self-supervision to adapt the model to a new distribution without requiring any labels. Indeed, before performing frame interpolation on each test video, the model is trained on the same instance using a self-supervised task, such as rotation prediction or image reconstruction. We conduct experiments on two publicly available 4D medical image interpolation datasets, Cardiac and 4D-Lung. The experimental results show that the proposed method achieves significant performance across various evaluation metrics on both datasets. It achieves higher peak signal-to-noise ratio values, 33.73dB on Cardiac and 34.02dB on 4D-Lung. Our method not only advances 4D medical image interpolation but also provides a template for domain adaptation in other fields such as image segmentation and image registration.
zh

[CV-105] Deep Ensemble approach for Enhancing Brain Tumor Segmentation in Resource-Limited Settings

【速读】：该论文旨在解决在资源有限的环境中，如撒哈拉以南非洲地区，脑肿瘤分割过程中手动分割耗时且主观的问题。解决方案的关键在于开发了一种深度学习集成模型，该模型结合了UNet3D、V-Net和MSA-VNet模型进行胶质瘤的语义分割。通过首先在BraTS-GLI数据集上训练并在BraTS-SSA数据集上微调，显著提高了模型性能，最终实现了更高的Dice系数：Tumor Core为0.8358，Whole Tumor为0.8521，Enhancing Tumor为0.8167，从而证明了集成方法在提高自动化脑肿瘤分割的准确性和可靠性方面的潜力。

链接: https://arxiv.org/abs/2502.02179
作者: Jeremiah Fadugba,Isabel Lieberman,Olabode Ajayi,Mansour Osman,Solomon Oluwole Akinola,Tinashe Mustvangwa,Dong Zhang,Udunna C Anazondo,Raymond Confidence
机构: University of Ibadan (伊巴丹大学), Nigeria; African Institute for Mathematical Sciences (非洲数学科学研究所), Kigali, Rwanda; University of Cape Town (开普敦大学), Cape Town, South Africa; South African National Bioinformatics Institute (SANBI) (南非国家生物信息学研究所), University of Western Cape, South Africa; Carnegie Mellon University (卡内基梅隆大学), Kigali, Rwanda; Lawson Health Research Institute (劳森健康研究所), London, Ontario, Canada; Department of Electrical and Computer Engineering (电气与计算机工程系), University of British Columbia (不列颠哥伦比亚大学), Vancouver, Canada; Montreal Neurological Institute (蒙特利尔神经科学研究所), McGill University (麦吉尔大学), Montréal, Canada; Udunna C Anazondo, Medical Artificial Intelligence Laboratory (MAI Lab) (医学人工智能实验室), Lagos, Nigeria; Raymond Confidence, Lawson Health Research Institute (劳森健康研究所), London, Ontario, Canada; Raymond Confidence, Department of Electrical and Computer Engineering (电气与计算机工程系), University of British Columbia (不列颠哥伦比亚大学), Vancouver, Canada
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation of brain tumors is a critical step in treatment planning, yet manual segmentation is both time-consuming and subjective, relying heavily on the expertise of radiologists. In Sub-Saharan Africa, this challenge is magnified by overburdened medical systems and limited access to advanced imaging modalities and expert radiologists. Automating brain tumor segmentation using deep learning offers a promising solution. Convolutional Neural Networks (CNNs), especially the U-Net architecture, have shown significant potential. However, a major challenge remains: achieving generalizability across different datasets. This study addresses this gap by developing a deep learning ensemble that integrates UNet3D, V-Net, and MSA-VNet models for the semantic segmentation of gliomas. By initially training on the BraTS-GLI dataset and fine-tuning with the BraTS-SSA dataset, we enhance model performance. Our ensemble approach significantly outperforms individual models, achieving DICE scores of 0.8358 for Tumor Core, 0.8521 for Whole Tumor, and 0.8167 for Enhancing Tumor. These results underscore the potential of ensemble methods in improving the accuracy and reliability of automated brain tumor segmentation, particularly in resource-limited settings.
zh

[CV-106] UD-Mamba: A pixel-level uncertainty-driven Mamba model for medical image segmentation

【速读】：该论文旨在解决Mamba框架在医学图像分割中难以有效建模局部特征的问题。这些问题源于传统基于位置扫描方法的间歇性以及医学图像中复杂且模糊的边界。为克服这些挑战，论文提出了一种名为Uncertainty-Driven Mamba (UD-Mamba)的方法，通过将通道不确定性引入扫描机制来重新定义像素顺序扫描过程。UD-Mamba的关键在于引入两种扫描技术：1）顺序扫描，按行优先处理高不确定性区域；2）跳过扫描，垂直处理列，并以固定间隔从高到低或从低到高不确定性移动。这两种技术分别提高了边界和前景对象的分割精度，以及增强了背景与前景区域之间的交互。此外，论文还引入四个可学习参数来平衡不同扫描方法提取特征的重要性，并采用余弦一致性损失以缓解扫描过程中不确定性区域转换带来的问题。

链接: https://arxiv.org/abs/2502.02024
作者: Weiren Zhao,Feng Wang,Yanran Wang,Yutong Xie,Qi Wu,Yuyin Zhou
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages

点击查看摘要

Abstract:Recent advancements have highlighted the Mamba framework, a state-space model known for its efficiency in capturing long-range dependencies with linear computational complexity. While Mamba has shown competitive performance in medical image segmentation, it encounters difficulties in modeling local features due to the sporadic nature of traditional location-based scanning methods and the complex, ambiguous boundaries often present in medical images. To overcome these challenges, we propose Uncertainty-Driven Mamba (UD-Mamba), which redefines the pixel-order scanning process by incorporating channel uncertainty into the scanning mechanism. UD-Mamba introduces two key scanning techniques: 1) sequential scanning, which prioritizes regions with high uncertainty by scanning in a row-by-row fashion, and 2) skip scanning, which processes columns vertically, moving from high-to-low or low-to-high uncertainty at fixed intervals. Sequential scanning efficiently clusters high-uncertainty regions, such as boundaries and foreground objects, to improve segmentation precision, while skip scanning enhances the interaction between background and foreground regions, allowing for timely integration of background information to support more accurate foreground inference. Recognizing the advantages of scanning from certain to uncertain areas, we introduce four learnable parameters to balance the importance of features extracted from different scanning methods. Additionally, a cosine consistency loss is employed to mitigate the drawbacks of transitioning between uncertain and certain regions during the scanning process. Our method demonstrates robust segmentation performance, validated across three distinct medical imaging datasets involving pathology, dermatological lesions, and cardiac tasks.
zh

[CV-107] Layer Separation: Adjustable Joint Space Width Images Synthesis in Conventional Radiography

【速读】：该论文旨在解决计算机辅助诊断（CAD）系统在基于深度学习的关节间隙宽度（JSW）分析中面临的因数据质量问题导致的挑战，包括数据不平衡、种类有限及标注困难。论文的关键解决方案是引入了一种图像合成场景，并提出了分层分离网络（Layer Separation Networks, LSN），用于精确分离手指关节常规X光片中的软组织层、上骨层和下骨层。利用这些分离出的层，可以合成可调的JSW图像，从而应对数据质量问题并实现真实值（GT）生成。实验结果表明，基于LSN的合成图像与真实X光片高度相似，并显著提升了下游任务的表现。

链接: https://arxiv.org/abs/2502.01972
作者: Haolin Wang,Yafei Ou,Prasoon Ambalathankandy,Gen Ota,Pengyu Dai,Masayuki Ikebe,Kenji Suzuki,Tamotsu Kamishima
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Rheumatoid arthritis (RA) is a chronic autoimmune disease characterized by joint inflammation and progressive structural damage. Joint space width (JSW) is a critical indicator in conventional radiography for evaluating disease progression, which has become a prominent research topic in computer-aided diagnostic (CAD) systems. However, deep learning-based radiological CAD systems for JSW analysis face significant challenges in data quality, including data imbalance, limited variety, and annotation difficulties. This work introduced a challenging image synthesis scenario and proposed Layer Separation Networks (LSN) to accurately separate the soft tissue layer, the upper bone layer, and the lower bone layer in conventional radiographs of finger joints. Using these layers, the adjustable JSW images can be synthesized to address data quality challenges and achieve ground truth (GT) generation. Experimental results demonstrated that LSN-based synthetic images closely resemble real radiographs, and significantly enhanced the performance in downstream tasks. The code and dataset will be available.
zh

[CV-108] A Novel Real-Time Full-Color 3D Holographic (Diffractive) Video Capture Processing and Transmission Pipeline Using Off-The-Shelf Hardware

【速读】：该论文旨在实现世界上首次使用现成硬件进行实时三维全息（衍射）视频通话。论文的关键解决方案在于引入了一种新颖的工作流程，该流程涵盖了RGBZ数据的捕捉、处理与传输。具体而言，使用iPhone进行图像和深度捕捉，并利用VividQ的软件开发工具包(SDK)进行全息图生成，同时配合专用显示硬件来完成最终的显示。这一系列技术组合确保了从数据捕捉到显示的完整链路，从而实现了实时三维全息视频通话。

链接: https://arxiv.org/abs/2502.01695
作者: Ankur Samanta,Gregor Mackenzie,Tyler Rathkamp,Adrian Cable,Darran Milne,Andrzej Kaczorowski,Ronjon Nag
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Published and Presented at Session 63: Emerging Approaches for AR/VR/MR, SID Display Week 2022. 4 pages, 9 figures

点击查看摘要

Abstract:This paper details the world’s first live 3D holographic (diffractive) video call using off-the-shelf hardware. We introduce a novel pipeline that facilitates the capture, processing, and transmission of RGBZ data, using an iPhone for image and depth capture with VividQ’s SDK for hologram generation and hardware for display.
zh

[CV-109] Efficient Brain Tumor Classification with Lightweight CNN Architecture: A Novel Approach

【速读】：该论文旨在解决脑肿瘤分类在医学诊断中的准确性与计算效率难以兼顾的问题，并提升模型在不同数据集上的鲁棒性。论文的关键解决方案在于提出了一种新颖的模型架构，该架构整合了可分离卷积（Separable Convolutions）和挤压激励（Squeeze and Excitation, SE）块，以增强特征提取同时保持计算效率。此外，该模型通过批量归一化（Batch Normalization）和丢弃层（Dropout）防止过拟合，确保稳定可靠的性能。此模型轻量级的特点体现在使用可分离卷积减少参数数量，并采用全局平均池化替代全连接层以降低计算复杂度，同时保持高精度。

链接: https://arxiv.org/abs/2502.01674
作者: Priyam Ganguly,Akhilbaran Ghosh
机构: Widener University (威明纳大学); Academy of Technology, Kolkata (技术学院, 加尔各答)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in FMLDS 2024

点击查看摘要

Abstract:Brain tumor classification using MRI images is critical in medical diagnostics, where early and accurate detection significantly impacts patient outcomes. While recent advancements in deep learning (DL), particularly CNNs, have shown promise, many models struggle with balancing accuracy and computational efficiency and often lack robustness across diverse datasets. To address these challenges, we propose a novel model architecture integrating separable convolutions and squeeze and excitation (SE) blocks, designed to enhance feature extraction while maintaining computational efficiency. Our model further incorporates batch normalization and dropout to prevent overfitting, ensuring stable and reliable performance. The proposed model is lightweight because it uses separable convolutions, which reduce the number of parameters, and incorporates global average pooling instead of fully connected layers to minimize computational complexity while maintaining high accuracy. Our model does better than other models by about 0.5% to 1.0% in accuracy and 1.5% to 2.5% in loss reduction, as shown by many experiments. It has a validation accuracy of 99.22% and a test accuracy of 98.44%. These results highlight the model’s ability to generalize effectively across different brain tumour types, offering a robust tools for clinical applications. Our work sets a new benchmark in the field, providing a foundation for future research in optimizing the accuracy and efficiency of DL models for medical image analysis.
zh

[CV-110] Entropy-based measure of rock sample heterogeneity derived from micro-CT images

【速读】：该论文旨在解决通过传统方法（如图像分割）测量岩石异质性时存在的局限性，包括耗时、成本高和主观性强等问题。解决方案的关键在于提出了一种直接处理微计算机断层扫描（micro-CT）图像的自动化方法，通过熵等属性量化岩石的纹理异质性，并将其应用于从巴西储层中获取的4,935个圆柱形岩心样本的图像数据集。该方法能够适应不同的样本特性，并在专家评估中表现出与传统纹理属性相比更好的一致性，从而提供了一个有助于岩石表征的附加参数。

链接: https://arxiv.org/abs/2502.01665
作者: Luan Coelho Vieira Silva,Júlio de Castro Vargas Fernandes,Felipe Belilaqua Foldes Guimarães,Pedro Henrique Braga Lisboa,Carlos Eduardo Menezes dos Anjos,Thais Fernandes de Matos,Marcelo Ramalho Albuquerque,Rodrigo Surmas,Alexandre Gonçalves Evsukoff
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 11 figures

点击查看摘要

Abstract:This study presents an automated method for objectively measuring rock heterogeneity via raw X-ray micro-computed tomography (micro-CT) images, thereby addressing the limitations of traditional methods, which are time-consuming, costly, and subjective. Unlike approaches that rely on image segmentation, the proposed method processes micro-CT images directly, identifying textural heterogeneity. The image is partitioned into subvolumes, where attributes are calculated for each one, with entropy serving as a measure of uncertainty. This method adapts to varying sample characteristics and enables meaningful comparisons across distinct sets of samples. It was applied to a dataset consisting of 4,935 images of cylindrical plug samples derived from Brazilian reservoirs. The results showed that the selected attributes play a key role in producing desirable outcomes, such as strong correlations with structural heterogeneity. To assess the effectiveness of our method, we used evaluations provided by four experts who classified 175 samples as either heterogeneous or homogeneous, where each expert assessed a different number of samples. One of the presented attributes demonstrated a statistically significant difference between the homogeneous and heterogeneous samples labelled by all the experts, whereas the other two attributes yielded nonsignificant differences for three out of the four experts. The method was shown to better align with the expert choices than traditional textural attributes known for extracting heterogeneous properties from images. This textural heterogeneity measure provides an additional parameter that can assist in rock characterization, and the automated approach ensures easy reproduction and high cost-effectiveness.
zh

人工智能

[AI-0] QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

链接: https://arxiv.org/abs/2502.02584
作者: Zongyu Lin,Yao Tang,Xingcheng Yao,Da Yin,Ziniu Hu,Yizhou Sun,Kai-Wei Chang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.

[AI-1] Fairness in Survival Analysis: A Novel Conditional Mutual Information Augmentation Approach

链接: https://arxiv.org/abs/2502.02567
作者: Tianyang Xie,Yong Ge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Survival analysis, a vital tool for predicting the time to event, has been used in many domains such as healthcare, criminal justice, and finance. Like classification tasks, survival analysis can exhibit bias against disadvantaged groups, often due to biases inherent in data or algorithms. Several studies in both the IS and CS communities have attempted to address fairness in survival analysis. However, existing methods often overlook the importance of prediction fairness at pre-defined evaluation time points, which is crucial in real-world applications where decision making often hinges on specific time frames. To address this critical research gap, we introduce a new fairness concept: equalized odds (EO) in survival analysis, which emphasizes prediction fairness at pre-defined time points. To achieve the EO fairness in survival analysis, we propose a Conditional Mutual Information Augmentation (CMIA) approach, which features a novel fairness regularization term based on conditional mutual information and an innovative censored data augmentation technique. Our CMIA approach can effectively balance prediction accuracy and fairness, and it is applicable to various survival models. We evaluate the CMIA approach against several state-of-the-art methods within three different application domains, and the results demonstrate that CMIA consistently reduces prediction disparity while maintaining good accuracy and significantly outperforms the other competing methods across multiple datasets and survival models (e.g., linear COX, deep AFT).

[AI-2] Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents

链接: https://arxiv.org/abs/2502.02561
作者: Shayan Kiyani,George Pappas,Aaron Roth,Hamed Hassani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A fundamental question in data-driven decision making is how to quantify the uncertainty of predictions in ways that can usefully inform downstream action. This interface between prediction uncertainty and decision-making is especially important in risk-sensitive domains, such as medicine. In this paper, we develop decision-theoretic foundations that connect uncertainty quantification using prediction sets with risk-averse decision-making. Specifically, we answer three fundamental questions: (1) What is the correct notion of uncertainty quantification for risk-averse decision makers? We prove that prediction sets are optimal for decision makers who wish to optimize their value at risk. (2) What is the optimal policy that a risk averse decision maker should use to map prediction sets to actions? We show that a simple max-min decision policy is optimal for risk-averse decision makers. Finally, (3) How can we derive prediction sets that are optimal for such decision makers? We provide an exact characterization in the population regime and a distribution free finite-sample construction. Answering these questions naturally leads to an algorithm, Risk-Averse Calibration (RAC), which follows a provably optimal design for deriving action policies from predictions. RAC is designed to be both practical-capable of leveraging the quality of predictions in a black-box manner to enhance downstream utility-and safe-adhering to a user-defined risk threshold and optimizing the corresponding risk quantile of the user’s downstream utility. Finally, we experimentally demonstrate the significant advantages of RAC in applications such as medical diagnosis and recommendation systems. Specifically, we show that RAC achieves a substantially improved trade-off between safety and utility, offering higher utility compared to existing methods while maintaining the safety guarantee.

[AI-3] Anytime Incremental rhoPOMDP Planning in Continuous Spaces IJCAI2025

链接: https://arxiv.org/abs/2502.02549
作者: Ron Benchetrit,Idan Lev-Yehudi,Andrey Zhitnikov,Vadim Indelman
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted to IJCAI 2025

点击查看摘要

Abstract:Partially Observable Markov Decision Processes (POMDPs) provide a robust framework for decision-making under uncertainty in applications such as autonomous driving and robotic exploration. Their extension, \rho POMDPs, introduces belief-dependent rewards, enabling explicit reasoning about uncertainty. Existing online \rho POMDP solvers for continuous spaces rely on fixed belief representations, limiting adaptability and refinement - critical for tasks such as information-gathering. We present \rho POMCPOW, an anytime solver that dynamically refines belief representations, with formal guarantees of improvement over time. To mitigate the high computational cost of updating belief-dependent rewards, we propose a novel incremental computation approach. We demonstrate its effectiveness for common entropy estimators, reducing computational cost by orders of magnitude. Experimental results show that \rho POMCPOW outperforms state-of-the-art solvers in both efficiency and solution quality.

[AI-4] Addressing Label Shift in Distributed Learning via Entropy Regularization ICLR2025

链接: https://arxiv.org/abs/2502.02544
作者: Zhiyuan Wu,Changkyu Choi,Xiangcheng Cao,Volkan Cevher,Ali Ramezani-Kebrya
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:We address the challenge of minimizing true risk in multi-node distributed learning. These systems are frequently exposed to both inter-node and intra-node label shifts, which present a critical obstacle to effectively optimizing model performance while ensuring that data remains confined to each node. To tackle this, we propose the Versatile Robust Label Shift (VRLS) method, which enhances the maximum likelihood estimation of the test-to-train label density ratio. VRLS incorporates Shannon entropy-based regularization and adjusts the density ratio during training to better handle label shifts at the test time. In multi-node learning environments, VRLS further extends its capabilities by learning and adapting density ratios across nodes, effectively mitigating label shifts and improving overall model performance. Experiments conducted on MNIST, Fashion MNIST, and CIFAR-10 demonstrate the effectiveness of VRLS, outperforming baselines by up to 20% in imbalanced settings. These results highlight the significant improvements VRLS offers in addressing label shifts. Our theoretical analysis further supports this by establishing high-probability bounds on estimation errors.

[AI-5] Flow Q-Learning

链接: https://arxiv.org/abs/2502.02538
作者: Seohong Park,Qiyang Li,Sergey Levine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present flow Q-learning (FQL), a simple and performant offline reinforcement learning (RL) method that leverages an expressive flow-matching policy to model arbitrarily complex action distributions in data. Training a flow policy with RL is a tricky problem, due to the iterative nature of the action generation process. We address this challenge by training an expressive one-step policy with RL, rather than directly guiding an iterative flow policy to maximize values. This way, we can completely avoid unstable recursive backpropagation, eliminate costly iterative action generation at test time, yet still mostly maintain expressivity. We experimentally show that FQL leads to strong performance across 73 challenging state- and pixel-based OGBench and D4RL tasks in offline RL and offline-to-online RL. Project page: this https URL

[AI-6] Why human-AI relationships need socioaffective alignment

链接: https://arxiv.org/abs/2502.02528
作者: Hannah Rose Kirk,Iason Gabriel,Chris Summerfield,Bertie Vidgen,Scott A. Hale
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Humans strive to design safe AI systems that align with our goals and remain under our control. However, as AI capabilities advance, we face a new challenge: the emergence of deeper, more persistent relationships between humans and AI systems. We explore how increasingly capable AI agents may generate the perception of deeper relationships with users, especially as AI becomes more personalised and agentic. This shift, from transactional interaction to ongoing sustained social engagement with AI, necessitates a new focus on socioaffective alignment-how an AI system behaves within the social and psychological ecosystem co-created with its user, where preferences and perceptions evolve through mutual influence. Addressing these dynamics involves resolving key intrapersonal dilemmas, including balancing immediate versus long-term well-being, protecting autonomy, and managing AI companionship alongside the desire to preserve human social bonds. By framing these challenges through a notion of basic psychological needs, we seek AI systems that support, rather than exploit, our fundamental nature as social and emotional beings.

[AI-7] Adaptive Exploration for Multi-Reward Multi-Policy Evaluation

链接: https://arxiv.org/abs/2502.02516
作者: Alessio Russo,Aldo Pacchiano
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the policy evaluation problem in an online multi-reward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an (\epsilon,\delta) -PAC perspective to achieve \epsilon -accurate estimates with high confidence across finite or convex sets of rewards, a setting that has not been investigated in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex approximation that holds for both finite and convex reward sets. Experiments in tabular domains demonstrate the effectiveness of this adaptive exploration scheme.

[AI-8] he Causal-Effect Score in Data Management

链接: https://arxiv.org/abs/2502.02495
作者: Felipe Azua,Leopoldo Bertossi
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: To appear in Proceedings of the 4th Conference on Causal Learning and Reasoning, 2025

点击查看摘要

Abstract:The Causal Effect (CE) is a numerical measure of causal influence of variables on observed results. Despite being widely used in many areas, only preliminary attempts have been made to use CE as an attribution score in data management, to measure the causal strength of tuples for query answering in databases. In this work, we introduce, generalize and investigate the so-called Causal-Effect Score in the context of classical and probabilistic databases.

[AI-9] Modular Training of Neural Networks aids Interpretability

链接: https://arxiv.org/abs/2502.02470
作者: Satvik Golechha,Maheep Chaudhary,Joan Velja,Alessandro Abate,Nandi Schoots
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, under review. arXiv admin note: text overlap with arXiv:2409.15747

点击查看摘要

Abstract:An approach to improve neural network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We define a measure for clusterability and show that pre-trained models form highly enmeshed clusters via spectral graph clustering. We thus train models to be more modular using a ``clusterability loss’’ function that encourages the formation of non-interacting clusters. Using automated interpretability techniques, we show that our method can help train models that are more modular and learn different, disjoint, and smaller circuits. We investigate CNNs trained on MNIST and CIFAR, small transformers trained on modular addition, and language models. Our approach provides a promising direction for training neural networks that learn simpler functions and are easier to interpret.

[AI-10] Model Human Learners: Computational Models to Guide Instructional Design

链接: https://arxiv.org/abs/2502.02456
作者: Christopher J. MacLellan
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注: 6 pages, 6 figures, 1 table

点击查看摘要

Abstract:Instructional designers face an overwhelming array of design choices, making it challenging to identify the most effective interventions. To address this issue, I propose the concept of a Model Human Learner, a unified computational model of learning that can aid designers in evaluating candidate interventions. This paper presents the first successful demonstration of this concept, showing that a computational model can accurately predict the outcomes of two human A/B experiments – one testing a problem sequencing intervention and the other testing an item design intervention. It also demonstrates that such a model can generate learning curves without requiring human data and provide theoretical insights into why an instructional intervention is effective. These findings lay the groundwork for future Model Human Learners that integrate cognitive and learning theories to support instructional design across diverse tasks and interventions.

[AI-11] owards graph neural networks for provably solving convex optimization problems

链接: https://arxiv.org/abs/2502.02446
作者: Chendi Qian,Christopher Morris
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Recently, message-passing graph neural networks (MPNNs) have shown potential for solving combinatorial and continuous optimization problems due to their ability to capture variable-constraint interactions. While existing approaches leverage MPNNs to approximate solutions or warm-start traditional solvers, they often lack guarantees for feasibility, particularly in convex optimization settings. Here, we propose an iterative MPNN framework to solve convex optimization problems with provable feasibility guarantees. First, we demonstrate that MPNNs can provably simulate standard interior-point methods for solving quadratic problems with linear constraints, covering relevant problems such as SVMs. Secondly, to ensure feasibility, we introduce a variant that starts from a feasible point and iteratively restricts the search within the feasible region. Experimental results show that our approach outperforms existing neural baselines in solution quality and feasibility, generalizes well to unseen problem sizes, and, in some cases, achieves faster solution times than state-of-the-art solvers such as Gurobi.

[AI-12] LLM ER: Crafting Interactive Extended Reality Worlds with JSON Data Generated by Large Language Models

链接: https://arxiv.org/abs/2502.02441
作者: Jiangong Chen,Xiaoyi Wu,Tian Lan,Bin Li
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) like GPT-4 with Extended Reality (XR) technologies offers the potential to build truly immersive XR environments that interact with human users through natural language, e.g., generating and animating 3D scenes from audio inputs. However, the complexity of XR environments makes it difficult to accurately extract relevant contextual data and scene/object parameters from an overwhelming volume of XR artifacts. It leads to not only increased costs with pay-per-use models, but also elevated levels of generation errors. Moreover, existing approaches focusing on coding script generation are often prone to generation errors, resulting in flawed or invalid scripts, application crashes, and ultimately a degraded user experience. To overcome these challenges, we introduce LLMER, a novel framework that creates interactive XR worlds using JSON data generated by LLMs. Unlike prior approaches focusing on coding script generation, LLMER translates natural language inputs into JSON data, significantly reducing the likelihood of application crashes and processing latency. It employs a multi-stage strategy to supply only the essential contextual information adapted to the user’s request and features multiple modules designed for various XR tasks. Our preliminary user study reveals the effectiveness of the proposed system, with over 80% reduction in consumed tokens and around 60% reduction in task completion time compared to state-of-the-art approaches. The analysis of users’ feedback also illuminates a series of directions for further optimization.

[AI-13] Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment AAAI2025

链接: https://arxiv.org/abs/2502.02438
作者: Yaling Shen,Zhixiong Zhuang,Kun Yuan,Maria-Irina Nicolae,Nassir Navab,Nicolas Padoy,Mario Fritz
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Medical multimodal large language models (MLLMs) are becoming an instrumental part of healthcare systems, assisting medical personnel with decision making and results analysis. Models for radiology report generation are able to interpret medical imagery, thus reducing the workload of radiologists. As medical data is scarce and protected by privacy regulations, medical MLLMs represent valuable intellectual property. However, these assets are potentially vulnerable to model stealing, where attackers aim to replicate their functionality via black-box access. So far, model stealing for the medical domain has focused on classification; however, existing attacks are not effective against MLLMs. In this paper, we introduce Adversarial Domain Alignment (ADA-STEAL), the first stealing attack against medical MLLMs. ADA-STEAL relies on natural images, which are public and widely available, as opposed to their medical counterparts. We show that data augmentation with adversarial noise is sufficient to overcome the data distribution gap between natural images and the domain-specific distribution of the victim MLLM. Experiments on the IU X-RAY and MIMIC-CXR radiology datasets demonstrate that Adversarial Domain Alignment enables attackers to steal the medical MLLM without any access to medical data.

[AI-14] Connections between Schedule-Free Optimizers AdEMAMix and Accelerated SGD Variants

链接: https://arxiv.org/abs/2502.02431
作者: Depen Morwani,Nikhil Vyas,Hanlin Zhang,Sham Kakade
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient’s weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: this https URL.

[AI-15] he Cost Perspective of Liquid Democracy: Feasibility and Control

链接: https://arxiv.org/abs/2502.02380
作者: Shiri Alouf-Heffetz,Łukasz Janeczko,Grzegorz Lisowski,Georgios Papasotiropoulos
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We examine an approval-based model of Liquid Democracy with a budget constraint on voting and delegating costs, aiming to centrally select casting voters ensuring complete representation of the electorate. From a computational complexity perspective, we focus on minimizing overall costs, maintaining short delegation paths, and preventing excessive concentration of voting power. Furthermore, we explore computational aspects of strategic control, specifically, whether external agents can change election components to influence the voting power of certain voters.

[AI-16] A Minimax Approach to Ad Hoc Teamwork AAMAS2025

链接: https://arxiv.org/abs/2502.02377
作者: Victor Villin,Thomas Kleine Buening,Christos Dimitrakakis
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at AAMAS 2025

点击查看摘要

Abstract:We propose a minimax-Bayes approach to Ad Hoc Teamwork (AHT) that optimizes policies against an adversarial prior over partners, explicitly accounting for uncertainty about partners at time of deployment. Unlike existing methods that assume a specific distribution over partners, our approach improves worst-case performance guarantees. Extensive experiments, including evaluations on coordinated cooking tasks from the Melting Pot suite, show our method’s superior robustness compared to self-play, fictitious play, and best response learning. Our work highlights the importance of selecting an appropriate training distribution over teammates to achieve robustness in AHT.

[AI-17] Evaluating the Effectiveness of LLM s in Fixing Maintainability Issues in Real-World Projects

链接: https://arxiv.org/abs/2502.02368
作者: Henrique Nunes,Eduardo Figueiredo,Larissa Rocha,Sarah Nadi,Fischer Ferreira,Geanderson Esteves
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained attention for addressing coding problems, but their effectiveness in fixing code maintainability remains unclear. This study evaluates LLMs capability to resolve 127 maintainability issues from 10 GitHub repositories. We use zero-shot prompting for Copilot Chat and Llama 3.1, and few-shot prompting with Llama only. The LLM-generated solutions are assessed for compilation errors, test failures, and new maintainability problems. Llama with few-shot prompting successfully fixed 44.9% of the methods, while Copilot Chat and Llama zero-shot fixed 32.29% and 30%, respectively. However, most solutions introduced errors or new maintainability issues. We also conducted a human study with 45 participants to evaluate the readability of 51 LLM-generated solutions. The human study showed that 68.63% of participants observed improved readability. Overall, while LLMs show potential for fixing maintainability issues, their introduction of errors highlights their current limitations.

[AI-18] EdgeGFL: Rethinking Edge Information in Graph Feature Preference Learning

链接: https://arxiv.org/abs/2502.02302
作者: Shengda Zhuo,Jiwang Fang,Hongguang Lin,Yin Tang,Min Chen,Changdong Wang,Shuqiang Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have significant advantages in handling non-Euclidean data and have been widely applied across various areas, thus receiving increasing attention in recent years. The framework of GNN models mainly includes the information propagation phase and the aggregation phase, treating nodes and edges as information entities and propagation channels, respectively. However, most existing GNN models face the challenge of disconnection between node and edge feature information, as these models typically treat the learning of edge and node features as independent tasks. To address this limitation, we aim to develop an edge-empowered graph feature preference learning framework that can capture edge embeddings to assist node embeddings. By leveraging the learned multidimensional edge feature matrix, we construct multi-channel filters to more effectively capture accurate node features, thereby obtaining the non-local structural characteristics and fine-grained high-order node features. Specifically, the inclusion of multidimensional edge information enhances the functionality and flexibility of the GNN model, enabling it to handle complex and diverse graph data more effectively. Additionally, integrating relational representation learning into the message passing framework allows graph nodes to receive more useful information, thereby facilitating node representation learning. Finally, experiments on four real-world heterogeneous graphs demonstrate the effectiveness of theproposed model.

[AI-19] FRAUD-RLA: A new reinforcement learning adversarial attack against credit card fraud detection

链接: https://arxiv.org/abs/2502.02290
作者: Daniele Lunghi,Yannick Molinghen,Alkis Simitsis,Tom Lenaerts,Gianluca Bontempi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adversarial attacks pose a significant threat to data-driven systems, and researchers have spent considerable resources studying them. Despite its economic relevance, this trend largely overlooked the issue of credit card fraud detection. To address this gap, we propose a new threat model that demonstrates the limitations of existing attacks and highlights the necessity to investigate new approaches. We then design a new adversarial attack for credit card fraud detection, employing reinforcement learning to bypass classifiers. This attack, called FRAUD-RLA, is designed to maximize the attacker’s reward by optimizing the exploration-exploitation tradeoff and working with significantly less required knowledge than competitors. Our experiments, conducted on three different heterogeneous datasets and against two fraud detection systems, indicate that FRAUD-RLA is effective, even considering the severe limitations imposed by our threat model.

[AI-20] Error Distribution Smoothing:Advancing Low-Dimensional Imbalanced Regression

链接: https://arxiv.org/abs/2502.02277
作者: Donghe Chen,Jiaxuan Yue,Tengjie Zheng,Lanxuan Wang,Lin Cheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 12 figures

点击查看摘要

Abstract:In real-world regression tasks, datasets frequently exhibit imbalanced distributions, characterized by a scarcity of data in high-complexity regions and an abundance in low-complexity areas. This imbalance presents significant challenges for existing classification methods with clear class boundaries, while highlighting a scarcity of approaches specifically designed for imbalanced regression problems. To better address these issues, we introduce a novel concept of Imbalanced Regression, which takes into account both the complexity of the problem and the density of data points, extending beyond traditional definitions that focus only on data density. Furthermore, we propose Error Distribution Smoothing (EDS) as a solution to tackle imbalanced regression, effectively selecting a representative subset from the dataset to reduce redundancy while maintaining balance and representativeness. Through several experiments, EDS has shown its effectiveness, and the related code and dataset can be accessed at this https URL.

[AI-21] Adviser-Actor-Critic: Eliminating Steady-State Error in Reinforcement Learning Control

链接: https://arxiv.org/abs/2502.02265
作者: Donghe Chen,Yubin Peng,Tengjie Zheng,Han Wang,Chaoran Qu,Lin Cheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 9 figures

点击查看摘要

Abstract:High-precision control tasks present substantial challenges for reinforcement learning (RL) algorithms, frequently resulting in suboptimal performance attributed to network approximation inaccuracies and inadequate sample this http URL issues are exacerbated when the task requires the agent to achieve a precise goal state, as is common in robotics and other real-world this http URL introduce Adviser-Actor-Critic (AAC), designed to address the precision control dilemma by combining the precision of feedback control theory with the adaptive learning capability of RL and featuring an Adviser that mentors the actor to refine control actions, thereby enhancing the precision of goal this http URL, through benchmark tests, AAC outperformed standard RL algorithms in precision-critical, goal-conditioned tasks, demonstrating AAC’s high precision, reliability, and this http URL are available at: this https URL.

[AI-22] Bias Detection via Maximum Subgroup Discrepancy

链接: https://arxiv.org/abs/2502.02221
作者: Jiří Němeček,Mark Kozdoba,Illia Kryvoviaz,Tomáš Pevný,Jakub Mareček
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bias evaluation is fundamental to trustworthy AI, both in terms of checking data quality and in terms of checking the outputs of AI systems. In testing data quality, for example, one may study a distance of a given dataset, viewed as a distribution, to a given ground-truth reference dataset. However, classical metrics, such as the Total Variation and the Wasserstein distances, are known to have high sample complexities and, therefore, may fail to provide meaningful distinction in many practical scenarios. In this paper, we propose a new notion of distance, the Maximum Subgroup Discrepancy (MSD). In this metric, two distributions are close if, roughly, discrepancies are low for all feature subgroups. While the number of subgroups may be exponential, we show that the sample complexity is linear in the number of features, thus making it feasible for practical applications. Moreover, we provide a practical algorithm for the evaluation of the distance, based on Mixed-integer optimization (MIO). We also note that the proposed distance is easily interpretable, thus providing clearer paths to fixing the biases once they have been identified. It also provides guarantees for all subgroups. Finally, we empirically evaluate, compare with other metrics, and demonstrate the above properties of MSD on real-world datasets. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2502.02221 [cs.LG] (or arXiv:2502.02221v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.02221 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-23] An Efficient Local Search Approach for Polarized Community Discovery in Signed Networks

链接: https://arxiv.org/abs/2502.02197
作者: Linus Aronsson,Morteza Haghir Chehreghani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Signed networks, where edges are labeled as positive or negative to indicate friendly or antagonistic interactions, offer a natural framework for studying polarization, trust, and conflict in social systems. Detecting meaningful group structures in these networks is crucial for understanding online discourse, political division, and trust dynamics. A key challenge is to identify groups that are cohesive internally yet antagonistic externally, while allowing for neutral or unaligned vertices. In this paper, we address this problem by identifying k polarized communities that are large, dense, and balanced in size. We develop an approach based on Frank-Wolfe optimization, leading to a local search procedure with provable convergence guarantees. Our method is both scalable and efficient, outperforming state-of-the-art baselines in solution quality while remaining competitive in terms of computational efficiency.

[AI-24] he Elicitation Game: Evaluating Capability Elicitation Techniques

链接: https://arxiv.org/abs/2502.02180
作者: Felix Hofstätter,Teun van der Weij,Jayden Teoh,Henning Bartsch,Francis Rhys Ward
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system’s capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods for eliciting latent capabilities from models. In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms – language models with hidden capabilities that are revealed by a password. We introduce a novel method for training model organisms, based on circuit breaking, which is more robust to elicitation techniques than standard password-locked models. We focus on elicitation techniques based on prompting and activation steering, and compare these to fine-tuning methods. Prompting techniques can elicit the actual capability of both password-locked and circuit-broken model organisms in an MCQA setting, while steering fails to do so. For a code-generation task, only fine-tuning can elicit the hidden capabilities of our novel model organism. Additionally, our results suggest that combining techniques improves elicitation. Still, if possible, fine-tuning should be the method of choice to improve the trustworthiness of capability evaluations.

[AI-25] Graph Neural Networks for O-RAN Mobility Management: A Link Prediction Approach

链接: https://arxiv.org/abs/2502.02170
作者: Ana Gonzalez Bermudez,Miquel Farreras,Milan Groshev,José Antonio Trujillo,Isabel de la Bandera,Raquel Barco
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures, 2 tables. Submitted to IEEE Vehicular Technology Magazine, Special Issue on “AI for 6G O-RAN Intelligent, Cost-Efficient and Secure Automation”

点击查看摘要

Abstract:Mobility performance has been a key focus in cellular networks up to 5G. To enhance handover (HO) performance, 3GPP introduced Conditional Handover (CHO) and Layer 1/Layer 2 Triggered Mobility (LTM) mechanisms in 5G. While these reactive HO strategies address the trade-off between HO failures (HOF) and ping-pong effects, they often result in inefficient radio resource utilization due to additional HO preparations. To overcome these challenges, this article proposes a proactive HO framework for mobility management in O-RAN, leveraging user-cell link predictions to identify the optimal target cell for HO. We explore various categories of Graph Neural Networks (GNNs) for link prediction and analyze the complexity of applying them to the mobility management domain. Two GNN models are compared using a real-world dataset, with experimental results demonstrating their ability to capture the dynamic and graph-structured nature of cellular networks. Finally, we present key insights from our study and outline future steps to enable the integration of GNN-based link prediction for mobility management in 6G networks.

[AI-26] Standard Neural Computation Alone Is Insufficient for Logical Intelligence

链接: https://arxiv.org/abs/2502.02135
作者: Youngsung Kim
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks, as currently designed, fall short of achieving true logical intelligence. Modern AI models rely on standard neural computation-inner-product-based transformations and nonlinear activations-to approximate patterns from data. While effective for inductive learning, this architecture lacks the structural guarantees necessary for deductive inference and logical consistency. As a result, deep networks struggle with rule-based reasoning, structured generalization, and interpretability without extensive post-hoc modifications. This position paper argues that standard neural layers must be fundamentally rethought to integrate logical reasoning. We advocate for Logical Neural Units (LNUs)-modular components that embed differentiable approximations of logical operations (e.g., AND, OR, NOT) directly within neural architectures. We critique existing neurosymbolic approaches, highlight the limitations of standard neural computation for logical inference, and present LNUs as a necessary paradigm shift in AI. Finally, we outline a roadmap for implementation, discussing theoretical foundations, architectural integration, and key challenges for future research.

[AI-27] Synthesis of Model Predictive Control and Reinforcement Learning: Survey and Classification

链接: https://arxiv.org/abs/2502.02133
作者: Rudolf Reiter,Jasper Hoffmann,Dirk Reinhardt,Florian Messerer,Katrin Baumgärtner,Shamburaj Sawant,Joschka Boedecker,Moritz Diehl,Sebastien Gros
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The fields of MPC and RL consider two successful control techniques for Markov decision processes. Both approaches are derived from similar fundamental principles, and both are widely used in practical applications, including robotics, process control, energy systems, and autonomous driving. Despite their similarities, MPC and RL follow distinct paradigms that emerged from diverse communities and different requirements. Various technical discrepancies, particularly the role of an environment model as part of the algorithm, lead to methodologies with nearly complementary advantages. Due to their orthogonal benefits, research interest in combination methods has recently increased significantly, leading to a large and growing set of complex ideas leveraging MPC and RL. This work illuminates the differences, similarities, and fundamentals that allow for different combination algorithms and categorizes existing work accordingly. Particularly, we focus on the versatile actor-critic RL approach as a basis for our categorization and examine how the online optimization approach of MPC can be used to improve the overall closed-loop performance of a policy.

[AI-28] How Memory in Optimization Algorithms Implicitly Modifies the Loss

链接: https://arxiv.org/abs/2502.02132
作者: Matias D. Cattaneo,Boris Shigida
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In modern optimization methods used in deep learning, each update depends on the history of previous iterations, often referred to as memory, and this dependence decays fast as the iterates go further into the past. For example, gradient descent with momentum has exponentially decaying memory through exponentially averaged past gradients. We introduce a general technique for identifying a memoryless algorithm that approximates an optimization algorithm with memory. It is obtained by replacing all past iterates in the update by the current one, and then adding a correction term arising from memory (also a function of the current iterate). This correction term can be interpreted as a perturbation of the loss, and the nature of this perturbation can inform how memory implicitly (anti-)regularizes the optimization dynamics. As an application of our theory, we find that Lion does not have the kind of implicit anti-regularization induced by memory that AdamW does, providing a theory-based explanation for Lion’s better generalization performance recently documented.

[AI-29] Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care

链接: https://arxiv.org/abs/2502.02109
作者: Yuxiao Cheng,Xinxin Song,Ziqian Wang,Qin Zhong,Kunlun He,Jinli Suo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in deep learning (DL) have prompted the development of high-performing early warning score (EWS) systems, predicting clinical deteriorations such as acute kidney injury, acute myocardial infarction, or circulatory failure. DL models have proven to be powerful tools for various tasks but come with the cost of lacking interpretability and limited generalizability, hindering their clinical applications. To develop a practical EWS system applicable to various outcomes, we propose causally-informed explainable early prediction model, which leverages causal discovery to identify the underlying causal relationships of prediction and thus owns two unique advantages: demonstrating the explicit interpretation of the prediction while exhibiting decent performance when applied to unfamiliar environments. Benefiting from these features, our approach achieves superior accuracy for 6 different critical deteriorations and achieves better generalizability across different patient groups, compared to various baseline algorithms. Besides, we provide explicit causal pathways to serve as references for assistant clinical diagnosis and potential interventions. The proposed approach enhances the practical application of deep learning in various medical scenarios.

[AI-30] Neural Networks Learn Distance Metrics

链接: https://arxiv.org/abs/2502.02103
作者: Alan Oursland
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 14 pages, 1 figures. Code and additional resources available at this https URL

点击查看摘要

Abstract:Neural networks may naturally favor distance-based representations, where smaller activations indicate closer proximity to learned prototypes. This contrasts with intensity-based approaches, which rely on activation magnitudes. To test this hypothesis, we conducted experiments with six MNIST architectural variants constrained to learn either distance or intensity representations. Our results reveal that the underlying representation affects model performance. We develop a novel geometric framework that explains these findings and introduce OffsetL2, a new architecture based on Mahalanobis distance equations, to further validate this framework. This work highlights the importance of considering distance-based learning in neural network design.

[AI-31] Online Clustering of Dueling Bandits

链接: https://arxiv.org/abs/2502.02079
作者: Zhiyong Wang,Jiahang Sun,Mingze Kong,Jize Xie,Qinghua Hu,John C.S. Lui,Zhongxiang Dai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:The contextual multi-armed bandit (MAB) is a widely used framework for problems requiring sequential decision-making under uncertainty, such as recommendation systems. In applications involving a large number of users, the performance of contextual MAB can be significantly improved by facilitating collaboration among multiple users. This has been achieved by the clustering of bandits (CB) methods, which adaptively group the users into different clusters and achieve collaboration by allowing the users in the same cluster to share data. However, classical CB algorithms typically rely on numerical reward feedback, which may not be practical in certain real-world applications. For instance, in recommendation systems, it is more realistic and reliable to solicit preference feedback between pairs of recommended items rather than absolute rewards. To address this limitation, we introduce the first “clustering of dueling bandit algorithms” to enable collaborative decision-making based on preference feedback. We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions. Both algorithms are supported by rigorous theoretical analyses, demonstrating that user collaboration leads to improved regret bounds. Extensive empirical evaluations on synthetic and real-world datasets further validate the effectiveness of our methods, establishing their potential in real-world applications involving multiple users with preference-based feedback.

[AI-32] CH-MARL: Constrained Hierarchical Multiagent Reinforcement Learning for Sustainable Maritime Logistics

链接: https://arxiv.org/abs/2502.02060
作者: Saad Alqithami
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Addressing global challenges such as greenhouse gas emissions and resource inequity demands advanced AI-driven coordination among autonomous agents. We propose CH-MARL (Constrained Hierarchical Multiagent Reinforcement Learning), a novel framework that integrates hierarchical decision-making with dynamic constraint enforcement and fairness-aware reward shaping. CH-MARL employs a real-time constraint-enforcement layer to ensure adherence to global emission caps, while incorporating fairness metrics that promote equitable resource distribution among agents. Experiments conducted in a simulated maritime logistics environment demonstrate considerable reductions in emissions, along with improvements in fairness and operational efficiency. Beyond this domain-specific success, CH-MARL provides a scalable, generalizable solution to multi-agent coordination challenges in constrained, dynamic settings, thus advancing the state of the art in reinforcement learning.

[AI-33] From Human Hands to Robotic Limbs: A Study in Motor Skill Embodiment for Telemanipulation

链接: https://arxiv.org/abs/2502.02036
作者: Haoyi Shi,Mingxi Su,Ted Morris,Vassilios Morellas,Nikolaos Papanikolopoulos
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a teleoperation system for controlling a redundant degree of freedom robot manipulator using human arm gestures. We propose a GRU-based Variational Autoencoder to learn a latent representation of the manipulator’s configuration space, capturing its complex joint kinematics. A fully connected neural network maps human arm configurations into this latent space, allowing the system to mimic and generate corresponding manipulator trajectories in real time through the VAE decoder. The proposed method shows promising results in teleoperating the manipulator, enabling the generation of novel manipulator configurations from human features that were not present during training.

[AI-34] Multi-Domain Graph Foundation Models: Robust Knowledge Transfer via Topology Alignment

链接: https://arxiv.org/abs/2502.02017
作者: Shuo Wang,Bokui Wang,Zhixiang Shen,Boyan Deng,Zhao Kang
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in CV and NLP have inspired researchers to develop general-purpose graph foundation models through pre-training across diverse domains. However, a fundamental challenge arises from the substantial differences in graph topologies across domains. Additionally, real-world graphs are often sparse and prone to noisy connections and adversarial attacks. To address these issues, we propose the Multi-Domain Graph Foundation Model (MDGFM), a unified framework that aligns and leverages cross-domain topological information to facilitate robust knowledge transfer. MDGFM bridges different domains by adaptively balancing features and topology while refining original graphs to eliminate noise and align topological structures. To further enhance knowledge transfer, we introduce an efficient prompt-tuning approach. By aligning topologies, MDGFM not only improves multi-domain pre-training but also enables robust knowledge transfer to unseen domains. Theoretical analyses provide guarantees of MDGFM’s effectiveness and domain generalization capabilities. Extensive experiments on both homophilic and heterophilic graph datasets validate the robustness and efficacy of our method.

[AI-35] A Periodic Bayesian Flow for Material Generation ICLR25

链接: https://arxiv.org/abs/2502.02016
作者: Hanlin Wu,Yuxuan Song,Jingjing Gong,Ziyao Cao,Yawen Ouyang,Jianbing Zhang,Hao Zhou,Wei-Ying Ma,Jingjing Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to ICLR25

点击查看摘要

Abstract:Generative modeling of crystal data distribution is an important yet challenging task due to the unique periodic physical symmetry of crystals. Diffusion-based methods have shown early promise in modeling crystal distribution. More recently, Bayesian Flow Networks were introduced to aggregate noisy latent variables, resulting in a variance-reduced parameter space that has been shown to be advantageous for modeling Euclidean data distributions with structural constraints (Song et al., 2023). Inspired by this, we seek to unlock its potential for modeling variables located in non-Euclidean manifolds e.g. those within crystal structures, by overcoming challenging theoretical issues. We introduce CrysBFN, a novel crystal generation method by proposing a periodic Bayesian flow, which essentially differs from the original Gaussian-based BFN by exhibiting non-monotonic entropy dynamics. To successfully realize the concept of periodic Bayesian flow, CrysBFN integrates a new entropy conditioning mechanism and empirically demonstrates its significance compared to time-conditioning. Extensive experiments over both crystal ab initio generation and crystal structure prediction tasks demonstrate the superiority of CrysBFN, which consistently achieves new state-of-the-art on all benchmarks. Surprisingly, we found that CrysBFN enjoys a significant improvement in sampling efficiency, e.g., ~100x speedup 10 v.s. 2000 steps network forwards) compared with previous diffusion-based methods on MP-20 dataset. Code is available at this https URL.

[AI-36] Analytical Lyapunov Function Discovery: An RL-based Generative Approach

链接: https://arxiv.org/abs/2502.02014
作者: Haohan Zou,Jie Feng,Hao Zhao,Yuanyuan Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC); Systems and Control (eess.SY)
*备注: 26 pages (8+18), preprint for discussion. Haohan and Jie contribute equally

点击查看摘要

Abstract:Despite advances in learning-based methods, finding valid Lyapunov functions for nonlinear dynamical systems remains challenging. Current neural network approaches face two main issues: challenges in scalable verification and limited interpretability. To address these, we propose an end-to-end framework using transformers to construct analytical Lyapunov functions (local), which simplifies formal verification, enhances interpretability, and provides valuable insights for control engineers. Our framework consists of a transformer-based trainer that generates candidate Lyapunov functions and a falsifier that verifies candidate expressions and refines the model via risk-seeking policy gradient. Unlike Alfarano et al. (2024), which utilizes pre-training and seeks global Lyapunov functions for low-dimensional systems, our model is trained from scratch via reinforcement learning (RL) and succeeds in finding local Lyapunov functions for high-dimensional and non-polynomial systems. Given the analytical nature of the candidates, we employ efficient optimization methods for falsification during training and formal verification tools for the final verification. We demonstrate the efficiency of our approach on a range of nonlinear dynamical systems with up to ten dimensions and show that it can discover Lyapunov functions not previously identified in the control literature.

[AI-37] LLM SecConfig: An LLM -Based Approach for Fixing Software Container Misconfigurations

链接: https://arxiv.org/abs/2502.02009
作者: Ziyang Ye,Triet Huynh Minh Le,M. Ali Babar
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Security misconfigurations in Container Orchestrators (COs) can pose serious threats to software systems. While Static Analysis Tools (SATs) can effectively detect these security vulnerabilities, the industry currently lacks automated solutions capable of fixing these misconfigurations. The emergence of Large Language Models (LLMs), with their proven capabilities in code understanding and generation, presents an opportunity to address this limitation. This study introduces LLMSecConfig, an innovative framework that bridges this gap by combining SATs with LLMs. Our approach leverages advanced prompting techniques and Retrieval-Augmented Generation (RAG) to automatically repair security misconfigurations while preserving operational functionality. Evaluation of 1,000 real-world Kubernetes configurations achieved a 94% success rate while maintaining a low rate of introducing new misconfigurations. Our work makes a promising step towards automated container security management, reducing the manual effort required for configuration maintenance. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2502.02009 [cs.SE] (or arXiv:2502.02009v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2502.02009 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-38] Generative Data Mining with Longtail-Guided Diffusion

链接: https://arxiv.org/abs/2502.01980
作者: David S. Hayden,Mao Ye,Timur Garipov,Gregory P. Meyer,Carl Vondrick,Zhao Chen,Yuning Chai,Eric Wolff,Siddhartha S. Srinivasa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages

点击查看摘要

Abstract:It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on image classification benchmarks, and can be analyzed to proactively discover, explain, and address conceptual gaps in a predictive model.

[AI-39] DHP: Discrete Hierarchical Planning for Hierarchical Reinforcement Learning Agents

链接: https://arxiv.org/abs/2502.01956
作者: Shashank Sharma,Janina Hoffmann,Vinay Namboodiri
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we address the challenge of long-horizon visual planning tasks using Hierarchical Reinforcement Learning (HRL). Our key contribution is a Discrete Hierarchical Planning (DHP) method, an alternative to traditional distance-based approaches. We provide theoretical foundations for the method and demonstrate its effectiveness through extensive empirical evaluations. Our agent recursively predicts subgoals in the context of a long-term goal and receives discrete rewards for constructing plans as compositions of abstract actions. The method introduces a novel advantage estimation strategy for tree trajectories, which inherently encourages shorter plans and enables generalization beyond the maximum tree depth. The learned policy function allows the agent to plan efficiently, requiring only \log N computational steps, making re-planning highly efficient. The agent, based on a soft-actor critic (SAC) framework, is trained using on-policy imagination data. Additionally, we propose a novel exploration strategy that enables the agent to generate relevant training examples for the planning modules. We evaluate our method on long-horizon visual planning tasks in a 25-room environment, where it significantly outperforms previous benchmarks at success rate and average episode length. Furthermore, an ablation study highlights the individual contributions of key modules to the overall performance. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.01956 [cs.RO] (or arXiv:2502.01956v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2502.01956 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

链接: https://arxiv.org/abs/2502.01932
作者: Zelai Xu,Chao Yu,Ruize Zhang,Huining Yuan,Xiangmin Yi,Shilong Ji,Chuqi Wang,Wenhao Tang,Yu Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) has made significant progress, largely fueled by the development of specialized testbeds that enable systematic evaluation of algorithms in controlled yet challenging scenarios. However, existing testbeds often focus on purely virtual simulations or limited robot morphologies such as robotic arms, quadrupeds, and humanoids, leaving high-mobility platforms with real-world physical constraints like drones underexplored. To bridge this gap, we present VolleyBots, a new MARL testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots features a turn-based interaction model under volleyball rules, a hierarchical decision-making process that combines motion control and strategic play, and a high-fidelity simulation for seamless sim-to-real transfer. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative MARL and game-theoretic algorithms. Results in simulation show that while existing algorithms handle simple tasks effectively, they encounter difficulty in complex tasks that require both low-level control and high-level strategy. We further demonstrate zero-shot deployment of a simulation-learned policy to real-world drones, highlighting VolleyBots’ potential to propel MARL research involving agile robotic platforms. The project page is at this https URL.

[AI-41] Distributionally Robust Direct Preference Optimization

链接: https://arxiv.org/abs/2502.01930
作者: Zaiyan Xu,Sushil Vemuri,Kishan Panaganti,Dileep Kalathil,Rahul Jain,Deepak Ramachandran
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift. LLM alignment algorithms rely on static preference datasets, assuming that they accurately represent real-world user preferences. However, user preferences vary significantly across geographical regions, demographics, linguistic patterns, and evolving cultural trends. This preference distribution shift leads to catastrophic alignment failures in many real-world applications. We address this problem using the principled framework of distributionally robust optimization, and develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO). We characterize the sample complexity of learning the optimal policy parameters for WDPO and KLDPO. Moreover, we propose scalable gradient descent-style learning algorithms by developing suitable approximations for the challenging minimax loss functions of WDPO and KLDPO. Our empirical experiments demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.

[AI-42] LAST SToP For Modeling Asynchronous Time Series

链接: https://arxiv.org/abs/2502.01922
作者: Shubham Gupta,Thibaut Durand,Graham Taylor,Lilian W. Białokozowicz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a novel prompt design for Large Language Models (LLMs) tailored to Asynchronous Time Series. Unlike regular time series, which assume values at evenly spaced time points, asynchronous time series consist of timestamped events occurring at irregular intervals, each described in natural language. Our approach effectively utilizes the rich natural language of event descriptions, allowing LLMs to benefit from their broad world knowledge for reasoning across different domains and tasks. This allows us to extend the scope of asynchronous time series analysis beyond forecasting to include tasks like anomaly detection and data imputation. We further introduce Stochastic Soft Prompting, a novel prompt-tuning mechanism that significantly improves model performance, outperforming existing fine-tuning methods such as QLoRA. Through extensive experiments on real world datasets, we demonstrate that our approach achieves state-of-the-art performance across different tasks and datasets.

[AI-43] Wake-Informed 3D Path Planning for Autonomous Underwater Vehicles Using A* and Neural Network Approximations

链接: https://arxiv.org/abs/2502.01918
作者: Zachary Cooper-Baldock,Stephen Turnock,Karl Sammut
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures, preprint of journal paper

点击查看摘要

Abstract:Autonomous Underwater Vehicles (AUVs) encounter significant energy, control and navigation challenges in complex underwater environments, particularly during close-proximity operations, such as launch and recovery (LAR), where fluid interactions and wake effects present additional navigational and energy challenges. Traditional path planning methods fail to incorporate these detailed wake structures, resulting in increased energy consumption, reduced control stability, and heightened safety risks. This paper presents a novel wake-informed, 3D path planning approach that fully integrates localized wake effects and global currents into the planning algorithm. Two variants of the A* algorithm - a current-informed planner and a wake-informed planner - are created to assess its validity and two neural network models are then trained to approximate these planners for real-time applications. Both the A* planners and NN models are evaluated using important metrics such as energy expenditure, path length, and encounters with high-velocity and turbulent regions. The results demonstrate a wake-informed A* planner consistently achieves the lowest energy expenditure and minimizes encounters with high-velocity regions, reducing energy consumption by up to 11.3%. The neural network models are observed to offer computational speedup of 6 orders of magnitude, but exhibit 4.51 - 19.79% higher energy expenditures and 9.81 - 24.38% less optimal paths. These findings underscore the importance of incorporating detailed wake structures into traditional path planning algorithms and the benefits of neural network approximations to enhance energy efficiency and operational safety for AUVs in complex 3D domains.

[AI-44] Displacement-Sparse Neural Optimal Transport

链接: https://arxiv.org/abs/2502.01889
作者: Peter Chen,Yue Xie,Qingpeng Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 6 Figures

点击查看摘要

Abstract:Optimal Transport (OT) theory seeks to determine the map T:X \to Y that transports a source measure P to a target measure Q , minimizing the cost c(\mathbfx, T(\mathbfx)) between \mathbfx and its image T(\mathbfx) . Building upon the Input Convex Neural Network OT solver and incorporating the concept of displacement-sparse maps, we introduce a sparsity penalty into the minimax Wasserstein formulation, promote sparsity in displacement vectors \Delta(\mathbfx) := T(\mathbfx) - \mathbfx , and enhance the interpretability of the resulting map. However, increasing sparsity often reduces feasibility, causing T_#§ to deviate more significantly from the target measure. In low-dimensional settings, we propose a heuristic framework to balance the trade-off between sparsity and feasibility by dynamically adjusting the sparsity intensity parameter during training. For high-dimensional settings, we directly constrain the dimensionality of displacement vectors by enforcing \dim(\Delta(\mathbfx)) \leq l , where l d for X \subseteq \mathbbR^d . Among maps satisfying this constraint, we aim to identify the most feasible one. This goal can be effectively achieved by adapting our low-dimensional heuristic framework without resorting to dimensionality reduction. We validate our method on both synthesized sc-RNA and real 4i cell perturbation datasets, demonstrating improvements over existing methods.

[AI-45] A Privacy-Preserving Domain Adversarial Federated learning for multi-site brain functional connectivity analysis

链接: https://arxiv.org/abs/2502.01885
作者: Yipu Zhang,Likai Wang,Kuan-Jui Su,Aiying Zhang,Hao Zhu,Xiaowen Liu,Hui Shen,Vince D. Calhoun,Yuping Wang,Hongwen Deng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 34pages, 13 figures

点击查看摘要

Abstract:Resting-state functional magnetic resonance imaging (rs-fMRI) and its derived functional connectivity networks (FCNs) have become critical for understanding neurological disorders. However, collaborative analyses and the generalizability of models still face significant challenges due to privacy regulations and the non-IID (non-independent and identically distributed) property of multiple data sources. To mitigate these difficulties, we propose Domain Adversarial Federated Learning (DAFed), a novel federated deep learning framework specifically designed for non-IID fMRI data analysis in multi-site settings. DAFed addresses these challenges through feature disentanglement, decomposing the latent feature space into domain-invariant and domain-specific components, to ensure robust global learning while preserving local data specificity. Furthermore, adversarial training facilitates effective knowledge transfer between labeled and unlabeled datasets, while a contrastive learning module enhances the global representation of domain-invariant features. We evaluated DAFed on the diagnosis of ASD and further validated its generalizability in the classification of AD, demonstrating its superior classification accuracy compared to state-of-the-art methods. Additionally, an enhanced Score-CAM module identifies key brain regions and functional connectivity significantly associated with ASD and MCI, respectively, uncovering shared neurobiological patterns across sites. These findings highlight the potential of DAFed to advance multi-site collaborative research in neuroimaging while protecting data confidentiality.

[AI-46] Online Curvature-Aware Replay: Leverag ing mathbf2nd Order Information for Online Continual Learning

链接: https://arxiv.org/abs/2502.01866
作者: Edoardo Urettini,Antonio Carta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Online Continual Learning (OCL) models continuously adapt to nonstationary data streams, usually without task information. These settings are complex and many traditional CL methods fail, while online methods (mainly replay-based) suffer from instabilities after the task shift. To address this issue, we formalize replay-based OCL as a second-order online joint optimization with explicit KL-divergence constraints on replay data. We propose Online Curvature-Aware Replay (OCAR) to solve the problem: a method that leverages second-order information of the loss using a K-FAC approximation of the Fisher Information Matrix (FIM) to precondition the gradient. The FIM acts as a stabilizer to prevent forgetting while also accelerating the optimization in non-interfering directions. We show how to adapt the estimation of the FIM to a continual setting stabilizing second-order optimization for non-iid data, uncovering the role of the Tikhonov regularization in the stability-plasticity tradeoff. Empirical results show that OCAR outperforms state-of-the-art methods in continual metrics achieving higher average accuracy throughout the training process in three different benchmarks.

[AI-47] Learning Human Perception Dynamics for Informative Robot Communication

链接: https://arxiv.org/abs/2502.01857
作者: Shenghui Chen,Ruihan Zhao,Sandeep Chinchali,Ufuk Topcu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human-robot cooperative navigation is challenging in environments with incomplete information. We introduce CoNav-Maze, a simulated robotics environment where a robot navigates using local perception while a human operator provides guidance based on an inaccurate map. The robot can share its camera views to improve the operator’s understanding of the environment. To enable efficient human-robot cooperation, we propose Information Gain Monte Carlo Tree Search (IG-MCTS), an online planning algorithm that balances autonomous movement and informative communication. Central to IG-MCTS is a neural human perception dynamics model that estimates how humans distill information from robot communications. We collect a dataset through a crowdsourced mapping task in CoNav-Maze and train this model using a fully convolutional architecture with data augmentation. User studies show that IG-MCTS outperforms teleoperation and instruction-following baselines, achieving comparable task performance with significantly less communication and lower human cognitive load, as evidenced by eye-tracking metrics.

[AI-48] Sample Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

链接: https://arxiv.org/abs/2502.01839
作者: Eric Zhao,Pranjal Awasthi,Sreenivas Gollapudi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one – typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model’s reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts – chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

[AI-49] ESS: A Scalable Temporally and Spatially Local Learning Rule for Spiking Neural Networks

链接: https://arxiv.org/abs/2502.01837
作者: Marco Paul E. Apolinario,Kaushik Roy,Charlotte Frenkel
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:The demand for low-power inference and training of deep neural networks (DNNs) on edge devices has intensified the need for algorithms that are both scalable and energy-efficient. While spiking neural networks (SNNs) allow for efficient inference by processing complex spatio-temporal dynamics in an event-driven fashion, training them on resource-constrained devices remains challenging due to the high computational and memory demands of conventional error backpropagation (BP)-based approaches. In this work, we draw inspiration from biological mechanisms such as eligibility traces, spike-timing-dependent plasticity, and neural activity synchronization to introduce TESS, a temporally and spatially local learning rule for training SNNs. Our approach addresses both temporal and spatial credit assignments by relying solely on locally available signals within each neuron, thereby allowing computational and memory overheads to scale linearly with the number of neurons, independently of the number of time steps. Despite relying on local mechanisms, we demonstrate performance comparable to the backpropagation through time (BPTT) algorithm, within \sim1.4 accuracy points on challenging computer vision scenarios relevant at the edge, such as the IBM DVS Gesture dataset, CIFAR10-DVS, and temporal versions of CIFAR10, and CIFAR100. Being able to produce comparable performance to BPTT while keeping low time and memory complexity, TESS enables efficient and scalable on-device learning at the edge.

[AI-50] Building a Cognitive Twin Using a Distributed Cognitive System and an Evolution Strategy

链接: https://arxiv.org/abs/2502.01834
作者: Wandemberg Gibaut,Ricardo Gudwin
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: first submitted on 09/22/2022, published on 01/20/2025

点击查看摘要

Abstract:This work presents a technique to build interaction-based Cognitive Twins (a computational version of an external agent) using input-output training and an Evolution Strategy on top of a framework for distributed Cognitive Architectures. Here, we show that it’s possible to orchestrate many simple physical and virtual devices to achieve good approximations of a person’s interaction behavior by training the system in an end-to-end fashion and present performance metrics. The generated Cognitive Twin may later be used to automate tasks, generate more realistic human-like artificial agents or further investigate its behaviors.

[AI-51] Assessing Data Augmentation-Induced Bias in Training and Testing of Machine Learning Models

链接: https://arxiv.org/abs/2502.01825
作者: Riddhi More,Jeremy S. Bradbury
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 4 pages

点击查看摘要

Abstract:Data augmentation has become a standard practice in software engineering to address limited or imbalanced data sets, particularly in specialized domains like test classification and bug detection where data can be scarce. Although techniques such as SMOTE and mutation-based augmentation are widely used in software testing and debugging applications, a rigorous understanding of how augmented training data impacts model bias is lacking. It is especially critical to consider bias in scenarios where augmented data sets are used not just in training but also in testing models. Through a comprehensive case study of flaky test classification, we demonstrate how to test for bias and understand the impact that the inclusion of augmented samples in testing sets can have on model evaluation.

[AI-52] Agent ic Bug Reproduction for Effective Automated Program Repair at Google

链接: https://arxiv.org/abs/2502.01821
作者: Runxiang Cheng,Michele Tufano,Jürgen Cito,José Cambronero,Pat Rondon,Renyao Wei,Aaron Sun,Satish Chandra
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bug reports often lack sufficient detail for developers to reproduce and fix the underlying defects. Bug Reproduction Tests (BRTs), tests that fail when the bug is present and pass when it has been resolved, are crucial for debugging, but they are rarely included in bug reports, both in open-source and in industrial settings. Thus, automatically generating BRTs from bug reports has the potential to accelerate the debugging process and lower time to repair. This paper investigates automated BRT generation within an industry setting, specifically at Google, focusing on the challenges of a large-scale, proprietary codebase and considering real-world industry bugs extracted from Google’s internal issue tracker. We adapt and evaluate a state-of-the-art BRT generation technique, LIBRO, and present our agent-based approach, BRT Agent, which makes use of a fine-tuned Large Language Model (LLM) for code editing. Our BRT Agent significantly outperforms LIBRO, achieving a 28% plausible BRT generation rate, compared to 10% by LIBRO, on 80 human-reported bugs from Google’s internal issue tracker. We further investigate the practical value of generated BRTs by integrating them with an Automated Program Repair (APR) system at Google. Our results show that providing BRTs to the APR system results in 30% more bugs with plausible fixes. Additionally, we introduce Ensemble Pass Rate (EPR), a metric which leverages the generated BRTs to select the most promising fixes from all fixes generated by APR system. Our evaluation on EPR for Top-K and threshold-based fix selections demonstrates promising results and trade-offs. For example, EPR correctly selects a plausible fix from a pool of 20 candidates in 70% of cases, based on its top-1 ranking.

[AI-53] Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning

链接: https://arxiv.org/abs/2502.01819
作者: Hanyang Zhao,Haoxian Chen,Ji Zhang,David D. Yao,Wenpin Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: arXiv admin note: text overlap with arXiv:2409.08400

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area use a discrete-time formulation, which is prone to induced errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tune diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby making connections to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models of Stable Diffusion v1.5.

[AI-54] oward Neurosymbolic Program Comprehension

链接: https://arxiv.org/abs/2502.01806
作者: Alejandro Velasco,Aya Garryyeva,David N. Palacio,Antonio Mastropaolo,Denys Poshyvanyk
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have paved the way for Large Code Models (LCMs), enabling automation in complex software engineering tasks, such as code generation, software testing, and program comprehension, among others. Tools like GitHub Copilot and ChatGPT have shown substantial benefits in supporting developers across various practices. However, the ambition to scale these models to trillion-parameter sizes, exemplified by GPT-4, poses significant challenges that limit the usage of Artificial Intelligence (AI)-based systems powered by large Deep Learning (DL) models. These include rising computational demands for training and deployment and issues related to trustworthiness, bias, and interpretability. Such factors can make managing these models impractical for many organizations, while their "black-box’’ nature undermines key aspects, including transparency and accountability. In this paper, we question the prevailing assumption that increasing model parameters is always the optimal path forward, provided there is sufficient new data to learn additional patterns. In particular, we advocate for a Neurosymbolic research direction that combines the strengths of existing DL techniques (e.g., LLMs) with traditional symbolic methods–renowned for their reliability, speed, and determinism. To this end, we outline the core features and present preliminary results for our envisioned approach, aimed at establishing the first Neurosymbolic Program Comprehension (NsPC) framework to aid in identifying defective code components.

[AI-55] Discovering Chunks in Neural Embeddings for Interpretability

链接: https://arxiv.org/abs/2502.01803
作者: Shuchen Wu,Stephan Alaniz,Eric Schulz,Zeynep Akata
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.

[AI-56] Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

链接: https://arxiv.org/abs/2502.01800
作者: Aidan Curtis,Eric Li,Michael Noseworthy,Nishad Gothoskar,Sachin Chitta,Hui Li,Leslie Pack Kaelbling,Nicole Carey
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain randomization in reinforcement learning is an established technique for increasing the robustness of control policies trained in simulation. By randomizing environment properties during training, the learned policy can become robust to uncertainties along the randomized dimensions. While the environment distribution is typically specified by hand, in this paper we investigate automatically discovering a sampling distribution via entropy-regularized reward maximization of a normalizing-flow-based neural sampling distribution. We show that this architecture is more flexible and provides greater robustness than existing approaches that learn simpler, parameterized sampling distributions, as demonstrated in six simulated and one real-world robotics domain. Lastly, we explore how these learned sampling distributions, combined with a privileged value function, can be used for out-of-distribution detection in an uncertainty-aware multi-step manipulation planner.

[AI-57] An Agent ic AI Workflow for Detecting Cognitive Concerns in Real-world Data

链接: https://arxiv.org/abs/2502.01789
作者: Jiazi Tian,Liqin Wang,Pedram Fard,Valdery Moura Junior,Deborah Blacker,Jennifer S. Haas,Chirag Patel,Shawn N. Murphy,Lidia M.V.R. Moura,Hossein Estiri
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Early identification of cognitive concerns is critical but often hindered by subtle symptom presentation. This study developed and validated a fully automated, multi-agent AI workflow using LLaMA 3 8B to identify cognitive concerns in 3,338 clinical notes from Mass General Brigham. The agentic workflow, leveraging task-specific agents that dynamically collaborate to extract meaningful insights from clinical notes, was compared to an expert-driven benchmark. Both workflows achieved high classification performance, with F1-scores of 0.90 and 0.91, respectively. The agentic workflow demonstrated improved specificity (1.00) and achieved prompt refinement in fewer iterations. Although both workflows showed reduced performance on validation data, the agentic workflow maintained perfect specificity. These findings highlight the potential of fully automated multi-agent AI workflows to achieve expert-level accuracy with greater efficiency, offering a scalable and cost-effective solution for detecting cognitive concerns in clinical settings.

[AI-58] Grokking Explained: A Statistical Phenomenon

链接: https://arxiv.org/abs/2502.01774
作者: Breno W. Carvalho,Artur S. d’Avila Garcez,Luís C. Lamb,Emílio Vital Brazil
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model’s training set loss has converged. This challenges conventional understanding of the training dynamics in deep learning networks. In this paper, we formalize and investigate grokking, highlighting that a key factor in its emergence is a distribution shift between training and test data. We introduce two synthetic datasets specifically designed to analyze grokking. One dataset examines the impact of limited sampling, and the other investigates transfer learning’s role in grokking. By inducing distribution shifts through controlled imbalanced sampling of sub-categories, we systematically reproduce the phenomenon, demonstrating that while small-sampling is strongly associated with grokking, it is not its cause. Instead, small-sampling serves as a convenient mechanism for achieving the necessary distribution shift. We also show that when classes form an equivariant map, grokking can be explained by the model’s ability to learn from similar classes or sub-categories. Unlike earlier work suggesting that grokking primarily arises from high regularization and sparse data, we demonstrate that it can also occur with dense data and minimal hyper-parameter tuning. Our findings deepen the understanding of grokking and pave the way for developing better stopping criteria in future training processes.

[AI-59] Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

链接: https://arxiv.org/abs/2502.01770
作者: Mark Horton,Tergel Molom-Ochir,Peter Liu,Bhavna Gopal,Chiyue Wei,Cong Guo,Brady Taylor,Deliang Fan,Shan X. Wang,Hai Li,Yiran Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into -1, +1 vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just \mathbf1.78% performance losses on GLUE compared to 9.08% in state-of-the-art binarization work, and \mathbf2.5% performance losses on ImageNet compared to 12.14% , all while targeting custom hardware with a \mathbf79% area reduction and \mathbf87% power reduction compared to its standard attention counterpart.

[AI-60] Robust Federated Finetuning of LLM s via Alternating Optimization of LoRA ICML24

链接: https://arxiv.org/abs/2502.01755
作者: Shuangyi Chen,Yuanxin Guo,Yue Ju,Harik Dalal,Ashish Khisti
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: A preliminary version was in ICML24 workshop, arXiv:2409.02346

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) optimize federated training by reducing computational and communication costs. We propose RoLoRA, a federated framework using alternating optimization to fine-tune LoRA adapters. Our approach emphasizes the importance of learning up and down projection matrices to enhance expressiveness and robustness. We use both theoretical analysis and extensive experiments to demonstrate the advantages of RoLoRA over prior approaches that either generate imperfect model updates or limit expressiveness of the model. We present theoretical analysis on a simplified linear model to demonstrate the importance of learning both down-projection and up-projection matrices in LoRA. We provide extensive experimental evaluations on a toy neural network on MNIST as well as large language models including RoBERTa-Large, Llama-2-7B on diverse tasks to demonstrate the advantages of RoLoRA over other methods.

[AI-61] Grokking vs. Learning: Same Features Different Encodings

链接: https://arxiv.org/abs/2502.01739
作者: Dmitry Manning-Coe,Jacopo Gliozzi,Alexander G. Stapleton,Edward Hirst,Giuseppe De Tomasi,Barry Bradlyn,David S. Berman
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
*备注: Code available at: this https URL

点击查看摘要

Abstract:Grokking typically achieves similar loss to ordinary, “steady”, learning. We ask whether these different learning paths - grokking versus ordinary training - lead to fundamental differences in the learned models. To do so we compare the features, compressibility, and learning dynamics of models trained via each path in two tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel “compressive regime” of steady training in which there emerges a linear trade-off between model loss and compressibility, and which is absent in grokking. In this regime, we can achieve compression factors 25x times the base model, and 5x times the compression achieved in grokking. We then track how model features and compressibility develop through training. We show that model development in grokking is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau. Finally, novel information-geometric measures are introduced which demonstrate that models undergoing grokking follow a straight path in information space.

[AI-62] Process-Supervised Reinforcement Learning for Code Generation

链接: https://arxiv.org/abs/2502.01715
作者: Yufan Ye,Ting Zhang,Wenbin Jiang,Hua Huang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing reinforcement learning strategies based on outcome supervision have proven effective in enhancing the performance of large language models(LLMs) for code generation. While reinforcement learning based on process supervision has shown great promise in handling multi-step reasoning tasks, its effectiveness in code generation remains largely underexplored and underjustified. The primary obstacle stems from the resource-intensive nature of constructing high-quality process-supervised data, which demands substantial human expertise and computational resources. In response to this challenge, we propose a “statement mutation/refactoring-compile and execution verification” strategy: mutating and refactoring code line-by-line through a teacher model, and utilizing compiler execution results to automatically label each line, resulting in line-by-line process-supervised data, which is pivotal for training a process-supervised reward model. The trained reward model is then integrated into the PRLCoder framework, followed by experimental validation on several benchmarks. Experimental results demonstrate that process-supervised reinforcement learning significantly surpasses methods relying solely on outcome supervision. Notably, in tackling complex code generation tasks, process-supervised reinforcement learning shows a clear advantage, ensuring both the integrity of the code generation process and the correctness of the generation results.

[AI-63] Position: Towards a Responsible LLM -empowered Multi-Agent Systems

链接: https://arxiv.org/abs/2502.01714
作者: Jinwei Hu,Yi Dong,Shuang Ao,Zhuoyun Li,Boxuan Wang,Lokesh Singh,Guangliang Cheng,Sarvapali D. Ramchurn,Xiaowei Huang
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:The rise of Agent AI and Large Language Model-powered Multi-Agent Systems (LLM-MAS) has underscored the need for responsible and dependable system operation. Tools like LangChain and Retrieval-Augmented Generation have expanded LLM capabilities, enabling deeper integration into MAS through enhanced knowledge retrieval and reasoning. However, these advancements introduce critical challenges: LLM agents exhibit inherent unpredictability, and uncertainties in their outputs can compound across interactions, threatening system stability. To address these risks, a human-centered design approach with active dynamic moderation is essential. Such an approach enhances traditional passive oversight by facilitating coherent inter-agent communication and effective system governance, allowing MAS to achieve desired outcomes more efficiently.

[AI-64] Aspects of Artificial Intelligence: Transforming Machine Learning Systems Naturally

链接: https://arxiv.org/abs/2502.01708
作者: Xiuzhan Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:In this paper, we study the machine learning elements which we are interested in together as a machine learning system, consisting of a collection of machine learning elements and a collection of relations between the elements. The relations we concern are algebraic operations, binary relations, and binary relations with composition that can be reasoned categorically. A machine learning system transformation between two systems is a map between the systems, which preserves the relations we concern. The system transformations given by quotient or clustering, representable functor, and Yoneda embedding are highlighted and discussed by machine learning examples. An adjunction between machine learning systems, a special machine learning system transformation loop, provides the optimal way of solving problems. Machine learning system transformations are linked and compared by their maps at 2-cell, natural transformations. New insights and structures can be obtained from universal properties and algebraic structures given by monads, which are generated from adjunctions.

[AI-65] Metastable Dynamics of Chain-of-Thought Reasoning : Provable Benefits of Search RL and Distillation

链接: https://arxiv.org/abs/2502.01694
作者: Juno Kim,Denny Wu,Jason Lee,Taiji Suzuki
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 55 pages, 3 figures

点击查看摘要

Abstract:A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inference-time compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model or distill its reasoning patterns into more efficient models. In this paper, we study inference-time compute by viewing chain-of-thought (CoT) generation as a metastable Markov process: easy reasoning steps (e.g., algebraic manipulations) form densely connected clusters, while hard reasoning steps (e.g., applying a relevant theorem) create sparse, low-probability edges between clusters, leading to phase transitions at longer timescales. Under this framework, we prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed metastable representation of the reasoning dynamics can be distilled into a smaller, more efficient model.

[AI-66] Graph Neural Networks for Identifying Steady-State Behavior in Complex Networks

链接: https://arxiv.org/abs/2502.01693
作者: Priodyuti Pradhan,Amit Reza
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:In complex systems, information propagation can be defined as diffused or delocalized, weakly localized, and strongly localized. Can a machine learning model learn the behavior of a linear dynamical system on networks? In this work, we develop a graph neural network framework for identifying the steady-state behavior of the linear dynamical system. We reveal that our model learns the different states with high accuracy. To understand the explainability of our model, we provide an analytical derivation for the forward and backward propagation of our framework. Finally, we use the real-world graphs in our model for validation.

[AI-67] Fast Direct: Query-Efficient Online Black-box Guidance for Diffusion-model Target Generation

链接: https://arxiv.org/abs/2502.01692
作者: Kim Yong Tan,Yueming Lyu,Ivor Tsang,Yew-Soon Ong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Guided diffusion-model generation is a promising direction for customizing the generation process of a pre-trained diffusion-model to address the specific downstream tasks. Existing guided diffusion models either rely on training of the guidance model with pre-collected datasets or require the objective functions to be differentiable. However, for most real-world tasks, the offline datasets are often unavailable, and their objective functions are often not differentiable, such as image generation with human preferences, molecular generation for drug discovery, and material design. Thus, we need an \textbfonline algorithm capable of collecting data during runtime and supporting a \textbfblack-box objective function. Moreover, the \textbfquery efficiency of the algorithm is also critical because the objective evaluation of the query is often expensive in the real-world scenarios. In this work, we propose a novel and simple algorithm, \textbfFast Direct, for query-efficient online black-box target generation. Our Fast Direct builds a pseudo-target on the data manifold to update the noise sequence of the diffusion model with a universal direction, which is promising to perform query-efficient guided generation. Extensive experiments on twelve high-resolution ( \small 1024 \times 1024 ) image target generation tasks and six 3D-molecule target generation tasks show \textbf6\times up to \textbf10\times query efficiency improvement and \textbf11\times up to \textbf44\times query efficiency improvement, respectively. Our implementation is publicly available at: this https URL

[AI-68] Leverag ing Joint Predictive Embedding and Bayesian Inference in Graph Self Supervised Learning

链接: https://arxiv.org/abs/2502.01684
作者: Srinitish Srinivasan,Omkumar CU
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction, yet prevailing self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse. Existing approaches often depend on feature reconstruction, negative sampling, or complex decoders, which introduce training overhead and hinder generalization. Further, current techniques which address such limitations fail to account for the contribution of node embeddings to a certain prediction in the absence of labeled nodes. To address these limitations, we propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information. Additionally, we introduce a semantic-aware objective term that incorporates pseudo-labels derived from Gaussian Mixture Models (GMMs), enhancing node discriminability by evaluating latent feature contributions. Extensive experiments demonstrate that our framework outperforms state-of-the-art graph SSL methods across benchmarks, achieving superior performance without contrastive loss or complex decoders. Key innovations include (1) a non-contrastive, view-invariant joint embedding predictive architecture, (2) Leveraging single context and multiple targets relationship between subgraphs, and (3) GMM-based pseudo-label scoring to capture semantic contributions. This work advances graph SSL by offering a computationally efficient, collapse-resistant paradigm that bridges spatial and semantic graph features for downstream tasks. The code for our paper can be found at this https URL

[AI-69] Neurosymbolic AI for Travel Demand Prediction: Integrating Decision Tree Rules into Neural Networks

链接: https://arxiv.org/abs/2502.01680
作者: Kamal Acharya,Mehul Lad,Liang Sun,Houbing Song
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures, this paper is under review in the conference

点击查看摘要

Abstract:Travel demand prediction is crucial for optimizing transportation planning, resource allocation, and infrastructure development, ensuring efficient mobility and economic sustainability. This study introduces a Neurosymbolic Artificial Intelligence (Neurosymbolic AI) framework that integrates decision tree (DT)-based symbolic rules with neural networks (NNs) to predict travel demand, leveraging the interpretability of symbolic reasoning and the predictive power of neural learning. The framework utilizes data from diverse sources, including geospatial, economic, and mobility datasets, to build a comprehensive feature set. DTs are employed to extract interpretable if-then rules that capture key patterns, which are then incorporated as additional features into a NN to enhance its predictive capabilities. Experimental results show that the combined dataset, enriched with symbolic rules, consistently outperforms standalone datasets across multiple evaluation metrics, including Mean Absolute Error (MAE), (R^2), and Common Part of Commuters (CPC). Rules selected at finer variance thresholds (e.g., 0.0001) demonstrate superior effectiveness in capturing nuanced relationships, reducing prediction errors, and aligning with observed commuter patterns. By merging symbolic and neural learning paradigms, this Neurosymbolic approach achieves both interpretability and accuracy.

[AI-70] LEAD: Large Foundation Model for EEG-Based Alzheimers Disease Detection

链接: https://arxiv.org/abs/2502.01678
作者: Yihe Wang,Nan Huang,Nadia Mammone,Marco Cecchi,Xiang Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Electroencephalogram (EEG) provides a non-invasive, highly accessible, and cost-effective solution for Alzheimer’s Disease (AD) detection. However, existing methods, whether based on manual feature extraction or deep learning, face two major challenges: the lack of large-scale datasets for robust feature learning and evaluation, and poor detection performance due to inter-subject variations. To address these challenges, we curate an EEG-AD corpus containing 813 subjects, which forms the world’s largest EEG-AD dataset to the best of our knowledge. Using this unique dataset, we propose LEAD, the first large foundation model for EEG-based AD detection. Our method encompasses an entire pipeline, from data selection and preprocessing to self-supervised contrastive pretraining, fine-tuning, and key setups such as subject-independent evaluation and majority voting for subject-level detection. We pre-train the model on 11 EEG datasets and unified fine-tune it on 5 AD datasets. Our self-supervised pre-training design includes sample-level and subject-level contrasting to extract useful general EEG features. Fine-tuning is performed on 5 channel-aligned datasets together. The backbone encoder incorporates temporal and channel embeddings to capture features across both temporal and spatial dimensions. Our method demonstrates outstanding AD detection performance, achieving up to a 9.86% increase in F1 score at the sample-level and up to a 9.31% at the subject-level compared to state-of-the-art methods. The results of our model strongly confirm the effectiveness of contrastive pre-training and channel-aligned unified fine-tuning for addressing inter-subject variation. The source code is at this https URL.

[AI-71] AI Scaling: From Up to Down and Out

链接: https://arxiv.org/abs/2502.01677
作者: Yunke Wang,Yanxi Li,Chang Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI Scaling has traditionally been synonymous with Scaling Up, which builds larger and more powerful models. However, the growing demand for efficiency, adaptability, and collaboration across diverse applications necessitates a broader perspective. This position paper presents a holistic framework for AI scaling, encompassing Scaling Up, Scaling Down, and Scaling Out. It argues that while Scaling Up of models faces inherent bottlenecks, the future trajectory of AI scaling lies in Scaling Down and Scaling Out. These paradigms address critical technical and societal challenges, such as reducing carbon footprint, ensuring equitable access, and enhancing cross-domain collaboration. We explore transformative applications in healthcare, smart manufacturing, and content creation, demonstrating how AI Scaling can enable breakthroughs in efficiency, personalization, and global connectivity. Additionally, we highlight key challenges, including balancing model complexity with interpretability, managing resource constraints, and fostering ethical development. By synthesizing these approaches, we propose a unified roadmap that redefines the future of AI research and application, paving the way for advancements toward Artificial General Intelligence (AGI).

[AI-72] Life-Cycle Emissions of AI Hardware: A Cradle-To-Grave Approach and Generational Trends

链接: https://arxiv.org/abs/2502.01671
作者: Ian Schneider,Hui Xu,Stephan Benecke,David Patterson,Keguo Huang,Parthasarathy Ranganathan,Cooper Elsworth
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Specialized hardware accelerators aid the rapid advancement of artificial intelligence (AI), and their efficiency impacts AI’s environmental sustainability. This study presents the first publication of a comprehensive AI accelerator life-cycle assessment (LCA) of greenhouse gas emissions, including the first publication of manufacturing emissions of an AI accelerator. Our analysis of five Tensor Processing Units (TPUs) encompasses all stages of the hardware lifespan - from raw material extraction, manufacturing, and disposal, to energy consumption during development, deployment, and serving of AI models. Using first-party data, it offers the most comprehensive evaluation to date of AI hardware’s environmental impact. We include detailed descriptions of our LCA to act as a tutorial, road map, and inspiration for other computer engineers to perform similar LCAs to help us all understand the environmental impacts of our chips and of AI. A byproduct of this study is the new metric compute carbon intensity (CCI) that is helpful in evaluating AI hardware sustainability and in estimating the carbon footprint of training and inference. This study shows that CCI improves 3x from TPU v4i to TPU v6e. Moreover, while this paper’s focus is on hardware, software advancements leverage and amplify these gains. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.01671 [cs.AR] (or arXiv:2502.01671v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2502.01671 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Cooper Elsworth [view email] [v1] Sat, 1 Feb 2025 17:26:19 UTC (3,208 KB)

[AI-73] Addressing Delayed Feedback in Conversion Rate Prediction via Influence Functions

链接: https://arxiv.org/abs/2502.01669
作者: Chenlu Ding,Jiancan Wu,Yancheng Yuan,Junfeng Fang,Cunchun Li,Xiang Wang,Xiangnan He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the realm of online digital advertising, conversion rate (CVR) prediction plays a pivotal role in maximizing revenue under cost-per-conversion (CPA) models, where advertisers are charged only when users complete specific actions, such as making a purchase. A major challenge in CVR prediction lies in the delayed feedback problem-conversions may occur hours or even weeks after initial user interactions. This delay complicates model training, as recent data may be incomplete, leading to biases and diminished performance. Although existing methods attempt to address this issue, they often fall short in adapting to evolving user behaviors and depend on auxiliary models, which introduces computational inefficiencies and the risk of model inconsistency. In this work, we propose an Influence Function-empowered framework for Delayed Feedback Modeling (IF-DFM). IF-DFM leverages influence functions to estimate how newly acquired and delayed conversion data impact model parameters, enabling efficient parameter updates without the need for full retraining. Additionally, we present a scalable algorithm that efficiently computes parameter updates by reframing the inverse Hessian-vector product as an optimization problem, striking a balance between computational efficiency and effectiveness. Extensive experiments on benchmark datasets demonstrate that IF-DFM consistently surpasses state-of-the-art methods, significantly enhancing both prediction accuracy and model adaptability.

[AI-74] Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking

链接: https://arxiv.org/abs/2502.01667
作者: Jie Ren,Yuhang Zhang,Dongrui Liu,Xiaopeng Zhang,Qi Tian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Direct preference optimization (DPO) has shown success in aligning diffusion models with human preference. Previous approaches typically assume a consistent preference label between final generations and noisy samples at intermediate steps, and directly apply DPO to these noisy samples for fine-tuning. However, we theoretically identify inherent issues in this assumption and its impacts on the effectiveness of preference alignment. We first demonstrate the inherent issues from two perspectives: gradient direction and preference order, and then propose a Tailored Preference Optimization (TailorPO) framework for aligning diffusion models with human preference, underpinned by some theoretical insights. Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues through a simple yet efficient design. Additionally, we incorporate the gradient guidance of diffusion models into preference alignment to further enhance the optimization effectiveness. Experimental results demonstrate that our method significantly improves the model’s ability to generate aesthetically pleasing and human-preferred images.

[AI-75] Employee Turnover Prediction: A Cross-component Attention Transformer with Consideration of Competitor Influence and Contagious Effect

链接: https://arxiv.org/abs/2502.01660
作者: Hao Liu(Deakin University),Yong Ge(The University of Arizona)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Employee turnover refers to an individual’s termination of employment from the current organization. It is one of the most persistent challenges for firms, especially those ones in Information Technology (IT) industry that confront high turnover rates. Effective prediction of potential employee turnovers benefits multiple stakeholders such as firms and online recruiters. Prior studies have focused on either the turnover prediction within a single firm or the aggregated employee movement among firms. How to predict the individual employees’ turnovers among multiple firms has gained little attention in literature, and thus remains a great research challenge. In this study, we propose a novel deep learning approach based on job embeddedness theory to predict the turnovers of individual employees across different firms. Through extensive experimental evaluations using a real-world dataset, our developed method demonstrates superior performance over several state-of-the-art benchmark methods. Additionally, we estimate the cost saving for recruiters by using our turnover prediction solution and interpret the attributions of various driving factors to employee’s turnover to showcase its practical business value.

[AI-76] Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques

链接: https://arxiv.org/abs/2502.01659
作者: Nathaniel Tomczak,Sanmukh Kuppannagari
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate pairwise interactions between individual tokens of sequential data. However, the primary limitation of this operation is its quadratic memory and time complexity in relation to the input’s context length - the length of a sequence over which the interactions need to be captured. This significantly limits the length of sequences that can be inferred upon by these models. Extensive research has been conducted to reduce the number of pairwise interactions to sub-quadratic in relation to the context length by introducing sparsity into the attention mechanism through the development of sparse attention masks. However, efficient implementations that achieve “true sparsity” are lacking. In this work, we address this issue by proposing a graph computing view of attention where tokens are perceived as nodes of the graph and the attention mask determines the edges of the graph. Using this view, we develop graph processing algorithms to implement the attention mechanism. Both theoretically and empirically, we demonstrate that our algorithms only perform the needed computations, i.e., they are work optimal. We also perform extensive experimentation using popular attention masks to explore the impact of sparsity on execution time and achievable context length. Our experiments demonstrate significant speedups in execution times compared to state-of-the-art attention implementations such as FlashAttention for large sequence lengths. We also demonstrate that our algorithms are able to achieve extremely long sequence lengths of as high as 160 million on a single NVIDIA A100 GPU (SXM4 80GB). Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF) Cite as: arXiv:2502.01659 [cs.LG] (or arXiv:2502.01659v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.01659 Focus to learn more arXiv-issued DOI via DataCite

[AI-77] Improving Rule-based Reasoning in LLM s via Neurosymbolic Representations

链接: https://arxiv.org/abs/2502.01657
作者: Varun Dhanraj,Chris Eliasmith
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) continue to face challenges in reliably solving reasoning tasks, particularly tasks that involve precise rule following, as often found in mathematical reasoning tasks. This paper introduces a novel neurosymbolic method that improves LLM reasoning by encoding hidden states into neurosymbolic vectors, allowing for problem-solving within a neurosymbolic vector space. The results are decoded and combined with the original hidden state, boosting the model’s performance on numerical reasoning tasks. By offloading computation through neurosymbolic representations, this method improves efficiency, reliability, and interpretability. Our experimental results demonstrate an average of 82.86% lower cross entropy loss and 24.50 times more problems correctly solved on a suite of mathematical reasoning problems compared to chain-of-thought prompting and supervised fine-tuning (LoRA), while at the same time not hindering the performance of the LLM on other tasks.

[AI-78] A binary PSO based ensemble under-sampling model for rebalancing imbalanced training data

链接: https://arxiv.org/abs/2502.01655
作者: Jinyan Li,Yaoyang Wu,Simon Fong,Antonio J. Tallón-Ballesteros,Xin-she Yang,Sabah Mohammed,Feng Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 22 pages, 18 figures

点击查看摘要

Abstract:Ensemble technique and under-sampling technique are both effective tools used for imbalanced dataset classification problems. In this paper, a novel ensemble method combining the advantages of both ensemble learning for biasing classifiers and a new under-sampling method is proposed. The under-sampling method is named Binary PSO instance selection; it gathers with ensemble classifiers to find the most suitable length and combination of the majority class samples to build a new dataset with minority class samples. The proposed method adopts multi-objective strategy, and contribution of this method is a notable improvement of the performances of imbalanced classification, and in the meantime guaranteeing a best integrity possible for the original dataset. We experimented the proposed method and compared its performance of processing imbalanced datasets with several other conventional basic ensemble methods. Experiment is also conducted on these imbalanced datasets using an improved version where ensemble classifiers are wrapped in the Binary PSO instance selection. According to experimental results, our proposed methods outperform single ensemble methods, state-of-the-art under-sampling methods, and also combinations of these methods with the traditional PSO instance selection algorithm.

[AI-79] Hybrid Group Relative Policy Optimization: A Multi-Sample Approach to Enhancing Policy Optimization

链接: https://arxiv.org/abs/2502.01652
作者: Soham Sane
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 Pages, 18 Equations, 1 Table

点击查看摘要

Abstract:Hybrid Group Relative Policy Optimization (Hybrid GRPO) is a reinforcement learning framework that extends Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) by incorporating empirical multi-sample action evaluation while preserving the stability of value function-based learning. Unlike DeepSeek GRPO, which eliminates the value function in favor of purely empirical reward estimation, Hybrid GRPO introduces a structured advantage computation method that balances empirical action sampling with bootstrapped value estimation. This approach enhances sample efficiency, improves learning stability, and mitigates variance amplification observed in purely empirical methods. A detailed mathematical comparison between PPO, DeepSeek GRPO, and Hybrid GRPO is presented, highlighting key differences in advantage estimation and policy updates. Experimental validation in a controlled reinforcement learning environment demonstrates that Hybrid GRPO achieves superior convergence speed, more stable policy updates, and improved sample efficiency compared to existing methods. Several extensions to Hybrid GRPO are explored, including entropy-regularized sampling, hierarchical multi-step sub-sampling, adaptive reward normalization, and value-based action selection. Beyond reinforcement learning in simulated environments, Hybrid GRPO provides a scalable framework for bridging the gap between large language models (LLMs) and real-world agent-based decision-making. By integrating structured empirical sampling with reinforcement learning stability mechanisms, Hybrid GRPO has potential applications in autonomous robotics, financial modeling, and AI-driven control systems. These findings suggest that Hybrid GRPO serves as a robust and adaptable reinforcement learning methodology, paving the way for further advancements in policy optimization.

[AI-80] Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency

链接: https://arxiv.org/abs/2502.01651
作者: Sazzad Hossain,Touhidul Alam Seyam,Avijit Chowdhury,Munis Xamidov,Rajib Ghose,Abhijit Pathak
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, conference paper. International conference on Artificial Intelligence and Future Civilization

点击查看摘要

Abstract:This paper presents a comparative study aimed at optimizing Llama2 inference, a critical aspect of machine learning and natural language processing (NLP). We evaluate various programming languages and frameworks, including TensorFlow, PyTorch, Python, Mojo, C++, and Java, analyzing their performance in terms of speed, memory consumption, and ease of implementation through extensive benchmarking. Strengths and limitations of each approach are highlighted, along with proposed optimization strategies for parallel processing and hardware utilization. Furthermore, we investigate the Mojo SDK, a novel framework designed for large language model (LLM) inference on Apple Silicon, benchmarking its performance against implementations in C, C++, Rust, Zig, Go, and Julia. Our experiments, conducted on an Apple M1 Max, demonstrate Mojo SDK’s competitive performance, ease of use, and seamless Python compatibility, positioning it as a strong alternative for LLM inference on Apple Silicon. We also discuss broader implications for LLM deployment on resource-constrained hardware and identify potential directions for future research.

[AI-81] From Public Square to Echo Chamber: The Frag mentation of Online Discourse

链接: https://arxiv.org/abs/2501.18441
作者: Abhinav Pratap,Amit Pathak
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 6 pages, 7 figures, 1 table

点击查看摘要

Abstract:This paper examines how social media algorithms and filter bubbles contribute to the fragmentation of online discourse, fostering ideological divides and undermining shared understanding. Drawing on Michael Sandels philosophical emphasis on community and shared values, the study explores how digital platforms amplify discrimination discourse including sexism, racism, xenophobia, ableism, homophobia, and religious intolerance during periods of heightened societal tension. By analyzing the dynamics of digital communities, the research highlights mechanisms driving the emergence and evolution of discourse fragments in response to real world events. The findings reveal how social media structures exacerbate polarization, restrict cross group dialogue, and erode the collective reasoning essential for a just society. This study situates philosophical perspectives within a computational analysis of social media interactions, offering a nuanced understanding of the challenges posed by fragmented discourse in the digital age.

[AI-82] Personalized Image Generation with Large Multimodal Models WWW’25

链接: https://arxiv.org/abs/2410.14170
作者: Yiyan Xu,Wenjie Wang,Yang Zhang,Biao Tang,Peng Yan,Fuli Feng,Xiangnan He
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Accepted for publication in WWW’25

点击查看摘要

Abstract:Personalized content filtering, such as recommender systems, has become a critical infrastructure to alleviate information overload. However, these systems merely filter existing content and are constrained by its limited diversity, making it difficult to meet users’ varied content needs. To address this limitation, personalized content generation has emerged as a promising direction with broad applications. Nevertheless, most existing research focuses on personalized text generation, with relatively little attention given to personalized image generation. The limited work in personalized image generation faces challenges in accurately capturing users’ visual preferences and needs from noisy user-interacted images and complex multimodal instructions. Worse still, there is a lack of supervised data for training personalized image generation models. To overcome the challenges, we propose a Personalized Image Generation Framework named Pigeon, which adopts exceptional large multimodal models with three dedicated modules to capture users’ visual preferences and needs from noisy user history and multimodal instructions. To alleviate the data scarcity, we introduce a two-stage preference alignment scheme, comprising masked preference reconstruction and pairwise preference alignment, to align Pigeon with the personalized image generation task. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines. Comments: Accepted for publication in WWW’25 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM) Cite as: arXiv:2410.14170 [cs.IR] (or arXiv:2410.14170v2 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.14170 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3696410.3714843 Focus to learn more DOI(s) linking to related resources

[AI-83] Accurate Pocket Identification for Binding-Site-Agnostic Docking

链接: https://arxiv.org/abs/2502.02371
作者: Yaroslav Balytskyi,Inna Hubenko,Alina Balytska,Christopher V. Kelly
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Accurate identification of druggable pockets is essential for structure-based drug design. However, most pocket-identification algorithms prioritize their geometric properties over downstream docking performance. To address this limitation, we developed RAPID-Net, a pocket-finding algorithm for seamless integration with docking workflows. When guiding AutoDock Vina, RAPID-Net outperforms DiffBindFR on the PoseBusters benchmark and enables blind docking on large proteins that AlphaFold 3 cannot process as a whole. Furthermore, RAPID-Net surpasses PUResNet and Kalasanty in docking accuracy and pocket-ligand intersection rates across diverse datasets, including PoseBusters, Astex Diverse Set, BU48, and Coach420. When accuracy is evaluated as ``at least one correct pose in the ensemble’', RAPID-Net outperforms AlphaFold 3 on the PoseBusters benchmark, suggesting that our approach can be further improved with a suitable pose reweighting tool offering a cost-effective and competitive alternative to AlphaFold 3 for docking. Finally, using several therapeutically relevant examples, we demonstrate the ability of RAPID-Net to identify remote functional sites, highlighting its potential to facilitate the development of innovative therapeutics.

[AI-84] Heteroscedastic Double Bayesian Elastic Net

链接: https://arxiv.org/abs/2502.02032
作者: Masanari Kimura
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In many practical applications, regression models are employed to uncover relationships between predictors and a response variable, yet the common assumption of constant error variance is frequently violated. This issue is further compounded in high-dimensional settings where the number of predictors exceeds the sample size, necessitating regularization for effective estimation and variable selection. To address this problem, we propose the Heteroscedastic Double Bayesian Elastic Net (HDBEN), a novel framework that jointly models the mean and log-variance using hierarchical Bayesian priors incorporating both \ell_1 and \ell_2 penalties. Our approach simultaneously induces sparsity and grouping in the regression coefficients and variance parameters, capturing complex variance structures in the data. Theoretical results demonstrate that proposed HDBEN achieves posterior concentration, variable selection consistency, and asymptotic normality under mild conditions which justifying its behavior. Simulation studies further illustrate that HDBEN outperforms existing methods, particularly in scenarios characterized by heteroscedasticity and high dimensionality.

[AI-85] heoretical and Practical Analysis of Frechet Regression via Comparison Geometry

链接: https://arxiv.org/abs/2502.01995
作者: Masanari Kimura,Howard Bondell
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fréchet regression extends classical regression methods to non-Euclidean metric spaces, enabling the analysis of data relationships on complex structures such as manifolds and graphs. This work establishes a rigorous theoretical analysis for Fréchet regression through the lens of comparison geometry which leads to important considerations for its use in practice. The analysis provides key results on the existence, uniqueness, and stability of the Fréchet mean, along with statistical guarantees for nonparametric regression, including exponential concentration bounds and convergence rates. Additionally, insights into angle stability reveal the interplay between curvature of the manifold and the behavior of the regression estimator in these non-Euclidean contexts. Empirical experiments validate the theoretical findings, demonstrating the effectiveness of proposed hyperbolic mappings, particularly for data with heteroscedasticity, and highlighting the practical usefulness of these results.

[AI-86] scGSDR: Harnessing Gene Semantics for Single-Cell Pharmacological Profiling

链接: https://arxiv.org/abs/2502.01689
作者: Yu-An Huang,Xiyue Cao,Zhu-Hong You,Yue-Chao Li,Xuequn Shang,Zhi-An Huang
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of single-cell sequencing technologies has revolutionized the exploration of drug resistance, revealing the crucial role of cellular heterogeneity in advancing precision medicine. By building computational models from existing single-cell drug response data, we can rapidly annotate cellular responses to drugs in subsequent trials. To this end, we developed scGSDR, a model that integrates two computational pipelines grounded in the knowledge of cellular states and gene signaling pathways, both essential for understanding biological gene semantics. scGSDR enhances predictive performance by incorporating gene semantics and employs an interpretability module to identify key pathways contributing to drug resistance phenotypes. Our extensive validation, which included 16 experiments covering 11 drugs, demonstrates scGSDR’s superior predictive accuracy, when trained with either bulk-seq or scRNA-seq data, achieving high AUROC, AUPR, and F1 Scores. The model’s application has extended from single-drug predictions to scenarios involving drug combinations. Leveraging pathways of known drug target genes, we found that scGSDR’s cell-pathway attention scores are biologically interpretable, which helped us identify other potential drug-related genes. Literature review of top-ranking genes in our predictions such as BCL2, CCND1, the AKT family, and PIK3CA for PLX4720; and ICAM1, VCAM1, NFKB1, NFKBIA, and RAC1 for Paclitaxel confirmed their relevance. In conclusion, scGSDR, by incorporating gene semantics, enhances predictive modeling of cellular responses to diverse drugs, proving invaluable for scenarios involving both single drug and combination therapies and effectively identifying key resistance-related pathways, thus advancing precision medicine and targeted therapy development.

[AI-87] Doubly Robust Monte Carlo Tree Search

链接: https://arxiv.org/abs/2502.01672
作者: Manqing Liu,Andrew L. Beam
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Doubly Robust Monte Carlo Tree Search (DR-MCTS), a novel algorithm that integrates Doubly Robust (DR) off-policy estimation into Monte Carlo Tree Search (MCTS) to enhance sample efficiency and decision quality in complex environments. Our approach introduces a hybrid estimator that combines MCTS rollouts with DR estimation, offering theoretical guarantees of unbiasedness and variance reduction under specified conditions. Empirical evaluations in Tic-Tac-Toe and the partially observable VirtualHome environment demonstrate DR-MCTS’s superior performance over standard MCTS. In Tic-Tac-Toe, DR-MCTS achieves an 88% win rate compared to a 10% win rate for standard MCTS. In compound VirtualHome tasks, DR-MCTS attains a 20.7% success rate versus 10.3% for standard MCTS. Our scaling analysis reveals that DR-MCTS exhibits better sample efficiency, notably outperforming standard MCTS with larger language models while using a smaller model. These results underscore DR-MCTS’s potential for efficient decision-making in complex, real-world scenarios where sample efficiency is paramount.

[AI-88] How to Build a Quantum Supercomputer: Scaling from Hundreds to Millions of Qubits

链接: https://arxiv.org/abs/2411.10406
作者: Masoud Mohseni,Artur Scherer,K. Grace Johnson,Oded Wertheim,Matthew Otten,Navid Anjum Aadit,Yuri Alexeev,Kirk M. Bresniker,Kerem Y. Camsari,Barbara Chapman,Soumitra Chatterjee,Gebremedhin A. Dagnew,Aniello Esposito,Farah Fahim,Marco Fiorentino,Archit Gajjar,Abdullah Khalid,Xiangzhou Kong,Bohdan Kulchytskyy,Elica Kyoseva,Ruoyu Li,P. Aaron Lott,Igor L. Markov,Robert F. McDermott,Giacomo Pedretti,Pooja Rao,Eleanor Rieffel,Allyson Silva,John Sorebo,Panagiotis Spentzouris,Ziv Steiner,Boyan Torosov,Davide Venturelli,Robert J. Visser,Zak Webb,Xin Zhan,Yonatan Cohen,Pooya Ronagh,Alan Ho,Raymond G. Beausoleil,John M. Martinis
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 76 pages, 46 figures. General revision, added figures, added references, added appendices

点击查看摘要

Abstract:In the span of four decades, quantum computation has evolved from an intellectual curiosity to a potentially realizable technology. Today, small-scale demonstrations have become possible for quantum algorithmic primitives on hundreds of physical qubits and proof-of-principle error-correction on a single logical qubit. Nevertheless, despite significant progress and excitement, the path toward a full-stack scalable technology is largely unknown. There are significant outstanding quantum hardware, fabrication, software architecture, and algorithmic challenges that are either unresolved or overlooked. These issues could seriously undermine the arrival of utility-scale quantum computers for the foreseeable future. Here, we provide a comprehensive review of these scaling challenges. We show how the road to scaling could be paved by adopting existing semiconductor technology to build much higher-quality qubits, employing system engineering approaches, and performing distributed quantum computation within heterogeneous high-performance computing infrastructures. These opportunities for research and development could unlock certain promising applications, in particular, efficient quantum simulation/learning of quantum data generated by natural or engineered quantum systems. To estimate the true cost of such promises, we provide a detailed resource and sensitivity analysis for classically hard quantum chemistry calculations on surface-code error-corrected quantum computers given current, target, and desired hardware specifications based on superconducting qubits, accounting for a realistic distribution of errors. Furthermore, we argue that, to tackle industry-scale classical optimization and machine learning problems in a cost-effective manner, heterogeneous quantum-probabilistic computing with custom-designed accelerators should be considered as a complementary path toward scalability.

机器学习

[LG-0] Open Materials Generation with Stochastic Interpolants

链接: https://arxiv.org/abs/2502.02582
作者: Philipp Hoellmer,Thomas Egg,Maya M. Martirossyan,Eric Fuemmeler,Amit Gupta,Zeren Shui,Pawan Prakash,Adrian Roitberg,Mingjie Liu,George Karypis,Mark Transtrum,Richard G. Hennig,Ellad B. Tadmor,Stefano Martiniani
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:The discovery of new materials is essential for enabling technological advancements. Computational approaches for predicting novel materials must effectively learn the manifold of stable crystal structures within an infinite design space. We introduce Open Materials Generation (OMG), a unifying framework for the generative design and discovery of inorganic crystalline materials. OMG employs stochastic interpolants (SI) to bridge an arbitrary base distribution to the target distribution of inorganic crystals via a broad class of tunable stochastic processes, encompassing both diffusion models and flow matching as special cases. In this work, we adapt the SI framework by integrating an equivariant graph representation of crystal structures and extending it to account for periodic boundary conditions in unit cell representations. Additionally, we couple the SI flow over spatial coordinates and lattice vectors with discrete flow matching for atomic species. We benchmark OMG’s performance on two tasks: Crystal Structure Prediction (CSP) for specified compositions, and ‘de novo’ generation (DNG) aimed at discovering stable, novel, and unique structures. In our ground-up implementation of OMG, we refine and extend both CSP and DNG metrics compared to previous works. OMG establishes a new state-of-the-art in generative modeling for materials discovery, outperforming purely flow-based and diffusion-based implementations. These results underscore the importance of designing flexible deep learning frameworks to accelerate progress in materials science.

[LG-1] Hierarchical Sparse Bayesian Multitask Model with Scalable Inference for Microbiome Analysis

链接: https://arxiv.org/abs/2502.02552
作者: Haonan Zhu,Andre R. Goncalves,Camilo Valdes,Hiranmayi Ranganathan,Boya Zhang,Jose Manuel Martí,Car Reen Kok,Monica K. Borucki,Nisha J. Mulakken,James B. Thissen,Crystal Jaing,Alfred Hero,Nicholas A. Be
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper proposes a hierarchical Bayesian multitask learning model that is applicable to the general multi-task binary classification learning problem where the model assumes a shared sparsity structure across different tasks. We derive a computationally efficient inference algorithm based on variational inference to approximate the posterior distribution. We demonstrate the potential of the new approach on various synthetic datasets and for predicting human health status based on microbiome profile. Our analysis incorporates data pooled from multiple microbiome studies, along with a comprehensive comparison with other benchmark methods. Results in synthetic datasets show that the proposed approach has superior support recovery property when the underlying regression coefficients share a common sparsity structure across different tasks. Our experiments on microbiome classification demonstrate the utility of the method in extracting informative taxa while providing well-calibrated predictions with uncertainty quantification and achieving competitive performance in terms of prediction metrics. Notably, despite the heterogeneity of the pooled datasets (e.g., different experimental objectives, laboratory setups, sequencing equipment, patient demographics), our method delivers robust results.

[LG-2] Optimal Spectral Transitions in High-Dimensional Multi-Index Models

链接: https://arxiv.org/abs/2502.02545
作者: Leonardo Defilippis,Yatin Dandi,Pierre Mergny,Florent Krzakala,Bruno Loureiro
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:

点击查看摘要

Abstract:We consider the problem of how many samples from a Gaussian multi-index model are required to weakly reconstruct the relevant index subspace. Despite its increasing popularity as a testbed for investigating the computational complexity of neural networks, results beyond the single-index setting remain elusive. In this work, we introduce spectral algorithms based on the linearization of a message passing scheme tailored to this problem. Our main contribution is to show that the proposed methods achieve the optimal reconstruction threshold. Leveraging a high-dimensional characterization of the algorithms, we show that above the critical threshold the leading eigenvector correlates with the relevant index subspace, a phenomenon reminiscent of the Baik-Ben Arous-Peche (BBP) transition in spiked models arising in random matrix theory. Supported by numerical experiments and a rigorous theoretical framework, our work bridges critical gaps in the computational limits of weak learnability in multi-index model.

[LG-3] OVERTHINKING: Slowdown Attacks on Reasoning LLM s

链接: https://arxiv.org/abs/2502.02542
作者: Abhinav Kumar,Jaechul Roh,Ali Naseh,Marzena Karpinska,Mohit Iyyer,Amir Houmansadr,Eugene Bagdasarian
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We increase overhead for applications that rely on reasoning LLMs-we force models to spend an amplified number of reasoning tokens, i.e., “overthink”, to respond to the user query while providing contextually correct answers. The adversary performs an OVERTHINK attack by injecting decoy reasoning problems into the public content that is used by the reasoning LLM (e.g., for RAG applications) during inference time. Due to the nature of our decoy problems (e.g., a Markov Decision Process), modified texts do not violate safety guardrails. We evaluated our attack across closed-(OpenAI o1, o1-mini, o3-mini) and open-(DeepSeek R1) weights reasoning models on the FreshQA and SQuAD datasets. Our results show up to 46x slowdown and high transferability of the attack across models. To protect applications, we discuss and implement defenses leveraging LLM-based and system design approaches. Finally, we discuss societal, financial, and energy impacts of OVERTHINK attack which could amplify the costs for third party applications operating reasoning models.

[LG-4] Deep Linear Network Training Dynamics from Random Initialization: Data Width Depth and Hyperparameter Transfer

链接: https://arxiv.org/abs/2502.02531
作者: Blake Bordelon,Cengiz Pehlevan
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We theoretically characterize gradient descent dynamics in deep linear networks trained at large width from random initialization and on large quantities of random data. Our theory captures the ``wider is better" effect of mean-field/maximum-update parameterized networks as well as hyperparameter transfer effects, which can be contrasted with the neural-tangent parameterization where optimal learning rates shift with model width. We provide asymptotic descriptions of both non-residual and residual neural networks, the latter of which enables an infinite depth limit when branches are scaled as 1/\sqrt\textdepth . We also compare training with one-pass stochastic gradient descent to the dynamics when training data are repeated at each iteration. Lastly, we show that this model recovers the accelerated power law training dynamics for power law structured data in the rich regime observed in recent works.

[LG-5] abPFN Unleashed: A Scalable and Effective Solution to Tabular Classification Problems

链接: https://arxiv.org/abs/2502.02527
作者: Si-Yang Liu,Han-Jia Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:TabPFN has emerged as a promising in-context learning model for tabular data, capable of directly predicting the labels of test samples given labeled training examples. It has demonstrated competitive performance, particularly on small-scale classification tasks. However, despite its effectiveness, TabPFN still requires further refinement in several areas, including handling high-dimensional features, aligning with downstream datasets, and scaling to larger datasets. In this paper, we revisit existing variants of TabPFN and observe that most approaches focus either on reducing bias or variance, often neglecting the need to address the other side, while also increasing inference overhead. To fill this gap, we propose Beta (Bagging and Encoder-based Fine-tuning for TabPFN Adaptation), a novel and effective method designed to minimize both bias and variance. To reduce bias, we introduce a lightweight encoder to better align downstream tasks with the pre-trained TabPFN. By increasing the number of encoders in a lightweight manner, Beta mitigate variance, thereby further improving the model’s performance. Additionally, bootstrapped sampling is employed to further reduce the impact of data perturbations on the model, all while maintaining computational efficiency during inference. Our approach enhances TabPFN’s ability to handle high-dimensional data and scale to larger datasets. Experimental results on over 200 benchmark classification datasets demonstrate that Beta either outperforms or matches state-of-the-art methods.

[LG-6] Brief analysis of DeepSeek R1 and its implications for Generative AI

链接: https://arxiv.org/abs/2502.02523
作者: Sarah Mercer,Samuel Spillard,Daniel P. Martin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In late January 2025, DeepSeek released their new reasoning model (DeepSeek R1); which was developed at a fraction of the cost yet remains competitive with OpenAI’s models, despite the US’s GPU export ban. This report discusses the model, and what its release means for the field of Generative AI more widely. We briefly discuss other models released from China in recent weeks, their similarities; innovative use of Mixture of Experts (MoE), Reinforcement Learning (RL) and clever engineering appear to be key factors in the capabilities of these models. This think piece has been written to a tight time-scale, providing broad coverage of the topic, and serves as introductory material for those looking to understand the model’s technical advancements, as well as it’s place in the ecosystem. Several further areas of research are identified.

[LG-7] Generative Modeling on Lie Groups via Euclidean Generalized Score Matching

链接: https://arxiv.org/abs/2502.02513
作者: Marco Bertolini,Tuan Le,Djork-Arné Clevert
类目: Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:We extend Euclidean score-based diffusion processes to generative modeling on Lie groups. Through the formalism of Generalized Score Matching, our approach yields a Langevin dynamics which decomposes as a direct sum of Lie algebra representations, enabling generative processes on Lie groups while operating in Euclidean space. Unlike equivariant models, which restrict the space of learnable functions by quotienting out group orbits, our method can model any target distribution on any (non-Abelian) Lie group. Standard score matching emerges as a special case of our framework when the Lie group is the translation group. We prove that our generalized generative processes arise as solutions to a new class of paired stochastic differential equations (SDEs), introduced here for the first time. We validate our approach through experiments on diverse data types, demonstrating its effectiveness in real-world applications such as SO(3)-guided molecular conformer generation and modeling ligand-specific global SE(3) transformations for molecular docking, showing improvement in comparison to Riemannian diffusion on the group itself. We show that an appropriate choice of Lie group enhances learning efficiency by reducing the effective dimensionality of the trajectory space and enables the modeling of transitions between complex data distributions. Additionally, we demonstrate the universality of our approach by deriving how it extends to flow matching.

[LG-8] Learning to generate physical ocean states: Towards hybrid climate modeling

链接: https://arxiv.org/abs/2502.02499
作者: Etienne Meunier,David Kamm,Guillaume Gachon,Redouane Lguensat,Julie Deshayes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ocean General Circulation Models require extensive computational resources to reach equilibrium states, while deep learning emulators, despite offering fast predictions, lack the physical interpretability and long-term stability necessary for climate scientists to understand climate sensitivity (to greenhouse gas emissions) and mechanisms of abrupt % variability such as tipping points. We propose to take the best from both worlds by leveraging deep generative models to produce physically consistent oceanic states that can serve as initial conditions for climate projections. We assess the viability of this hybrid approach through both physical metrics and numerical experiments, and highlight the benefits of enforcing physical constraints during generation. Although we train here on ocean variables from idealized numerical simulations, we claim that this hybrid approach, combining the computational efficiency of deep learning with the physical accuracy of numerical models, can effectively reduce the computational burden of running climate models to equilibrium, and reduce uncertainties in climate projections by minimizing drifts in baseline simulations.

[LG-9] Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries ICLR2025

链接: https://arxiv.org/abs/2502.02496
作者: Chris Kolb,Tobias Weber,Bernd Bischl,David Rügamer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: accepted at ICLR 2025

点击查看摘要

Abstract:Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging due to the non-differentiability of penalties like the L_1 norm, which is incompatible with stochastic gradient descent. A promising alternative is shallow weight factorization, where weights are decomposed into two factors, allowing for smooth optimization of L_1 -penalized neural networks by adding differentiable L_2 regularization to the factors. In this work, we introduce deep weight factorization, extending previous shallow approaches to more than two factors. We theoretically establish equivalence of our deep factorization with non-convex sparse regularization and analyze its impact on training dynamics and optimization. Due to the limitations posed by standard training practices, we propose a tailored initialization scheme and identify important learning rate requirements necessary for training factorized networks. We demonstrate the effectiveness of our deep weight factorization through experiments on various architectures and datasets, consistently outperforming its shallow counterpart and widely used pruning methods.

[LG-10] EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization

链接: https://arxiv.org/abs/2502.02493
作者: Yize Wu,Ke Gao,Yanjun Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU systems, inference latency can be further reduced through tensor parallelism (TP), while the optimal TP size of the draft model is typically smaller than that of the base model, leading to GPU idling during the drafting stage. To solve this problem, we propose EasySpec, a layer-parallel speculation strategy that optimizes the efficiency of multi-GPU this http URL breaks the sequential execution order of layers in the drafting model, enabling multi-layer parallelization across devices, albeit with some induced approximation errors. After each drafting-and-verification iteration, the draft model’s key-value (KV) cache is calibrated in a single forward pass, preventing long-term error accumulation at minimal additional latency. We evaluated EasySpec on several mainstream open-source LLMs, using smaller versions of models from the same series as drafters. The results demonstrate that EasySpec can achieve a peak speedup of 4.17x compared to vanilla decoding, while preserving the original distribution of the base LLMs. Specifically, the drafting stage can be accelerated by up to 1.62x with a maximum accuracy drop of only 7%, requiring no training or fine-tuning on the draft models.

[LG-11] Do Graph Diffusion Models Accurately Capture and Generate Substructure Distributions?

链接: https://arxiv.org/abs/2502.02488
作者: Xiyuan Wang,Yewei Liu,Lexi Pang,Siwei Chen,Muhan Zhang
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Diffusion models have gained popularity in graph generation tasks; however, the extent of their expressivity concerning the graph distributions they can learn is not fully understood. Unlike models in other domains, popular backbones for graph diffusion models, such as Graph Transformers, do not possess universal expressivity to accurately model the distribution scores of complex graph data. Our work addresses this limitation by focusing on the frequency of specific substructures as a key characteristic of target graph distributions. When evaluating existing models using this metric, we find that they fail to maintain the distribution of substructure counts observed in the training set when generating new graphs. To address this issue, we establish a theoretical connection between the expressivity of Graph Neural Networks (GNNs) and the overall performance of graph diffusion models, demonstrating that more expressive GNN backbones can better capture complex distribution patterns. By integrating advanced GNNs into the backbone architecture, we achieve significant improvements in substructure generation.

[LG-12] Distributional Diffusion Models with Scoring Rules

链接: https://arxiv.org/abs/2502.02483
作者: Valentin De Bortoli,Alexandre Galashov,J. Swaroop Guntupalli,Guangyao Zhou,Kevin Murphy,Arthur Gretton,Arnaud Doucet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models generate high-quality synthetic data. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively “denoises” a Gaussian sample into a sample from the data distribution. However, generating high-quality outputs requires many discretization steps to obtain a faithful approximation of the reverse process. This is expensive and has motivated the development of many acceleration methods. We propose to accomplish sample generation by learning the posterior \em distribution of clean data samples given their noisy versions, instead of only the mean of this distribution. This allows us to sample from the probability transitions of the reverse process on a coarse time scale, significantly accelerating inference with minimal degradation of the quality of the output. This is accomplished by replacing the standard regression loss used to estimate conditional means with a scoring rule. We validate our method on image and robot trajectory generation, where we consistently outperform standard diffusion models at few discretization steps.

[LG-13] Stable Port-Hamiltonian Neural Networks

链接: https://arxiv.org/abs/2502.02480
作者: Fabian J. Roth,Dominik K. Klein,Maximilian Kannapinn,Jan Peters,Oliver Weeger
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, nonlinear dynamic system identification using artificial neural networks has garnered attention due to its manifold potential applications in virtually all branches of science and engineering. However, purely data-driven approaches often struggle with extrapolation and may yield physically implausible forecasts. Furthermore, the learned dynamics can exhibit instabilities, making it difficult to apply such models safely and robustly. This article proposes stable port-Hamiltonian neural networks, a machine learning architecture that incorporates the physical biases of energy conservation or dissipation while guaranteeing global Lyapunov stability of the learned dynamics. Evaluations with illustrative examples and real-world measurement data demonstrate the model’s ability to generalize from sparse data, outperforming purely data-driven approaches and avoiding instability issues. In addition, the model’s potential for data-driven surrogate modeling is highlighted in application to multi-physics simulation data.

[LG-14] Using Random Noise Equivariantly to Boost Graph Neural Networks Universally

链接: https://arxiv.org/abs/2502.02479
作者: Xiyuan Wang,Muhan Zhang
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Recent advances in Graph Neural Networks (GNNs) have explored the potential of random noise as an input feature to enhance expressivity across diverse tasks. However, naively incorporating noise can degrade performance, while architectures tailored to exploit noise for specific tasks excel yet lack broad applicability. This paper tackles these issues by laying down a theoretical framework that elucidates the increased sample complexity when introducing random noise into GNNs without careful design. We further propose Equivariant Noise GNN (ENGNN), a novel architecture that harnesses the symmetrical properties of noise to mitigate sample complexity and bolster generalization. Our experiments demonstrate that using noise equivariantly significantly enhances performance on node-level, link-level, subgraph, and graph-level tasks and achieves comparable performance to models designed for specific tasks, thereby offering a general method to boost expressivity across various graph tasks.

[LG-15] Orientation-aware interaction-based deep material network in polycrystalline materials modeling

链接: https://arxiv.org/abs/2502.02457
作者: Ting-Ju Wei,Tung-Huan Su,Chuin-Shan Chen
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiscale simulations are indispensable for connecting microstructural features to the macroscopic behavior of polycrystalline materials, but their high computational demands limit their practicality. Deep material networks (DMNs) have been proposed as efficient surrogate models, yet they fall short of capturing texture evolution. To address this limitation, we propose the orientation-aware interaction-based deep material network (ODMN), which incorporates an orientation-aware mechanism and an interaction mechanism grounded in the Hill-Mandel principle. The orientation-aware mechanism learns the crystallographic textures, while the interaction mechanism captures stress-equilibrium directions among representative volume element (RVE) subregions, offering insight into internal microstructural mechanics. Notably, ODMN requires only linear elastic data for training yet generalizes effectively to complex nonlinear and anisotropic responses. Our results show that ODMN accurately predicts both mechanical responses and texture evolution under complex plastic deformation, thus expanding the applicability of DMNs to polycrystalline materials. By balancing computational efficiency with predictive fidelity, ODMN provides a robust framework for multiscale simulations of polycrystalline materials.

[LG-16] Sparse Data Generation Using Diffusion Models

链接: https://arxiv.org/abs/2502.02448
作者: Phil Ostheimer,Mayank Nagda,Marius Kloft,Sophie Fellenz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse data is ubiquitous, appearing in numerous domains, from economics and recommender systems to astronomy and biomedical sciences. However, efficiently and realistically generating sparse data remains a significant challenge. We introduce Sparse Data Diffusion (SDD), a novel method for generating sparse data. SDD extends continuous state-space diffusion models by explicitly modeling sparsity through the introduction of Sparsity Bits. Empirical validation on image data from various domains-including two scientific applications, physics and biology-demonstrates that SDD achieves high fidelity in representing data sparsity while preserving the quality of the generated data.

[LG-17] mPOLICE: Provable Enforcement of Multi-Region Affine Constraints in Deep Neural Networks

链接: https://arxiv.org/abs/2502.02434
作者: Mohammadmehdi Ataei,Hyunmin Cheong,Adrian Butscher
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks are increasingly employed in fields such as climate modeling, robotics, and industrial control, where strict output constraints must be upheld. Although prior methods like the POLICE algorithm can enforce affine constraints in a single convex region by adjusting network parameters, they struggle with multiple disjoint regions, often leading to conflicts or unintended affine extensions. We present mPOLICE, a new method that extends POLICE to handle constraints imposed on multiple regions. mPOLICE assigns a distinct activation pattern to each constrained region, preserving exact affine behavior locally while avoiding overreach into other parts of the input domain. We formulate a layer-wise optimization problem that adjusts both the weights and biases to assign unique activation patterns to each convex region, ensuring that constraints are met without conflicts, while maintaining the continuity and smoothness of the learned function. Our experiments show the enforcement of multi-region constraints for multiple scenarios, including regression and classification, function approximation, and non-convex regions through approximation. Notably, mPOLICE adds zero inference overhead and minimal training overhead.

[LG-18] ransformDAS: Mapping Phi-OTDR Signals to Riemannian Manifold for Robust Classification

链接: https://arxiv.org/abs/2502.02428
作者: Jiaju Kang,Puyu Han,Yang Chun,Xu Wang,Luqi Gong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Phase-sensitive optical time-domain reflectometry (\Phi-OTDR) is a widely used distributed fiber optic sensing system in engineering. Machine learning algorithms for \Phi-OTDR event classification require high volumes and quality of datasets; however, high-quality datasets are currently extremely scarce in the field, leading to a lack of robustness in models, which is manifested by higher false alarm rates in real-world scenarios. One promising approach to address this issue is to augment existing data using generative models combined with a small amount of real-world data. We explored mapping both \Phi-OTDR features in a GAN-based generative pipeline and signal features in a Transformer classifier to hyperbolic space to seek more effective model generalization. The results indicate that state-of-the-art models exhibit stronger generalization performance and lower false alarm rates in real-world scenarios when trained on augmented datasets. TransformDAS, in particular, demonstrates the best classification performance, highlighting the benefits of Riemannian manifold mapping in \Phi-OTDR data generation and model classification.

[LG-19] Pruning-aware Loss Functions for STOI-Optimized Pruned Recurrent Autoencoders for the Compression of the Stimulation Patterns of Cochlear Implants at Zero Delay

链接: https://arxiv.org/abs/2502.02424
作者: Reemt Hinrichs,Jörn Ostermann
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Preprint of Asilomar 2024 Paper

点击查看摘要

Abstract:Cochlear implants (CIs) are surgically implanted hearing devices, which allow to restore a sense of hearing in people suffering from profound hearing loss. Wireless streaming of audio from external devices to CI signal processors has become common place. Specialized compression based on the stimulation patterns of a CI by deep recurrent autoencoders can decrease the power consumption in such a wireless streaming application through bit-rate reduction at zero latency. While previous research achieved considerable bit-rate reductions, model sizes were ignored, which can be of crucial importance in hearing-aids due to their limited computational resources. This work investigates maximizing objective speech intelligibility of the coded stimulation patterns of deep recurrent autoencoders while minimizing model size. For this purpose, a pruning-aware loss is proposed, which captures the impact of pruning during training. This training with a pruning-aware loss is compared to conventional magnitude-informed pruning and is found to yield considerable improvements in objective intelligibility, especially at higher pruning rates. After fine-tuning, little to no degradation of objective intelligibility is observed up to a pruning rate of about 55,%. The proposed pruning-aware loss yields substantial gains in objective speech intelligibility scores after pruning compared to the magnitude-informed baseline for pruning rates above 45,%. Comments: Preprint of Asilomar 2024 Paper Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2502.02424 [cs.SD] (or arXiv:2502.02424v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2502.02424 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of Asilomar 2024

[LG-20] CVKAN: Complex-Valued Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2502.02417
作者: Matthias Wolff,Florian Eilers,Xiaoyi Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work we propose CKAN, a complex-valued KAN, to join the intrinsic interpretability of KANs and the advantages of Complex-Valued Neural Networks (CVNNs). We show how to transfer a KAN and the necessary associated mechanisms into the complex domain. To confirm that CKAN meets expectations we conduct experiments on symbolic complex-valued function fitting and physically meaningful formulae as well as on a more realistic dataset from knot theory. Our proposed CKAN is more stable and performs on par or better than real-valued KANs while requiring less parameters and a shallower network architecture, making it more explainable.

[LG-21] owards Fast Graph Generation via Autoregressive Noisy Filtration Modeling

链接: https://arxiv.org/abs/2502.02415
作者: Markus Krimmel,Jenna Wiens,Karsten Borgwardt,Dexiong Chen
类目: Machine Learning (cs.LG)
*备注: 32 pages, 27 tables, 6 figures

点击查看摘要

Abstract:Graph generative models often face a critical trade-off between learning complex distributions and achieving fast generation speed. We introduce Autoregressive Noisy Filtration Modeling (ANFM), a novel approach that addresses both challenges. ANFM leverages filtration, a concept from topological data analysis, to transform graphs into short sequences of monotonically increasing subgraphs. This formulation extends the sequence families used in previous autoregressive models. To learn from these sequences, we propose a novel autoregressive graph mixer model. Our experiments suggest that exposure bias might represent a substantial hurdle in autoregressive graph generation and we introduce two mitigation strategies to address it: noise augmentation and a reinforcement learning approach. Incorporating these techniques leads to substantial performance gains, making ANFM competitive with state-of-the-art diffusion models across diverse synthetic and real-world datasets. Notably, ANFM produces remarkably short sequences, achieving a 100-fold speedup in generation time compared to diffusion models. This work marks a significant step toward high-throughput graph generation.

[LG-22] ransolver: An Accurate Neural Solver for PDEs on Million-Scale Geometries

链接: https://arxiv.org/abs/2502.02414
作者: Huakun Luo,Haixu Wu,Hang Zhou,Lanxiang Xing,Yichen Di,Jianmin Wang,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although deep models have been widely explored in solving partial differential equations (PDEs), previous works are primarily limited to data only with up to tens of thousands of mesh points, far from the million-point scale required by industrial simulations that involve complex geometries. In the spirit of advancing neural PDE solvers to real industrial applications, we present Transolver++, a highly parallel and efficient neural solver that can accurately solve PDEs on million-scale geometries. Building upon previous advancements in solving PDEs by learning physical states via Transolver, Transolver++ is further equipped with an extremely optimized parallelism framework and a local adaptive mechanism to efficiently capture eidetic physical states from massive mesh points, successfully tackling the thorny challenges in computation and physics learning when scaling up input mesh size. Transolver++ increases the single-GPU input capacity to million-scale points for the first time and is capable of continuously scaling input size in linear complexity by increasing GPUs. Experimentally, Transolver++ yields 13% relative promotion across six standard PDE benchmarks and achieves over 20% performance gain in million-scale high-fidelity industrial simulations, whose sizes are 100 \times larger than previous benchmarks, covering car and 3D aircraft designs.

[LG-23] Privacy Amplification by Structured Subsampling for Deep Differentially Private Time Series Forecasting

链接: https://arxiv.org/abs/2502.02410
作者: Jan Schuchardt,Mina Dalirrooyfard,Jed Guzelkabaagac,Anderson Schneider,Yuriy Nevmyvaka,Stephan Günnemann
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many forms of sensitive data, such as web traffic, mobility data, or hospital occupancy, are inherently sequential. The standard method for training machine learning models while ensuring privacy for units of sensitive information, such as individual hospital visits, is differentially private stochastic gradient descent (DP-SGD). However, we observe in this work that the formal guarantees of DP-SGD are incompatible with timeseries-specific tasks like forecasting, since they rely on the privacy amplification attained by training on small, unstructured batches sampled from an unstructured dataset. In contrast, batches for forecasting are generated by (1) sampling sequentially structured time series from a dataset, (2) sampling contiguous subsequences from these series, and (3) partitioning them into context and ground-truth forecast windows. We theoretically analyze the privacy amplification attained by this structured subsampling to enable the training of forecasting models with sound and tight event- and user-level privacy guarantees. Towards more private models, we additionally prove how data augmentation amplifies privacy in self-supervised training of sequence models. Our empirical evaluation demonstrates that amplification by structured subsampling enables the training of forecasting models with strong formal privacy guarantees.

[LG-24] Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

链接: https://arxiv.org/abs/2502.02393
作者: Alireza Amiri,Xinting Huang,Mark Rofin,Michael Hahn
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注:

点击查看摘要

Abstract:Chain-of-thought reasoning and scratchpads have emerged as critical tools for enhancing the computational capabilities of transformers. While theoretical results show that polynomial-length scratchpads can extend transformers’ expressivity from TC^0 to PTIME , their required length remains poorly understood. Empirical evidence even suggests that transformers need scratchpads even for many problems in TC^0 , such as Parity or Multiplication, challenging optimistic bounds derived from circuit complexity. In this work, we initiate the study of systematic lower bounds for the number of CoT steps across different algorithmic problems, in the hard-attention regime. We study a variety of algorithmic problems, and provide bounds that are tight up to logarithmic factors. Overall, these results contribute to emerging understanding of the power and limitations of chain-of-thought reasoning.

[LG-25] Achieving Hiding and Smart Anti-Jamming Communication: A Parallel DRL Approach against Moving Reactive Jammer

链接: https://arxiv.org/abs/2502.02385
作者: Yangyang Li,Yuhua Xu,Wen Li,Guoxin Li,Zhibing Feng,Songyi Liu,Jiatao Du,Xinran Li
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of anti-jamming in moving reactive jamming scenarios. The moving reactive jammer initiates high-power tracking jamming upon detecting any transmission activity, and when unable to detect a signal, resorts to indiscriminate jamming. This presents dual imperatives: maintaining hiding to avoid the jammer’s detection and simultaneously evading indiscriminate jamming. Spread spectrum techniques effectively reduce transmitting power to elude detection but fall short in countering indiscriminate jamming. Conversely, changing communication frequencies can help evade indiscriminate jamming but makes the transmission vulnerable to tracking jamming without spread spectrum techniques to remain hidden. Current methodologies struggle with the complexity of simultaneously optimizing these two requirements due to the expansive joint action spaces and the dynamics of moving reactive jammers. To address these challenges, we propose a parallelized deep reinforcement learning (DRL) strategy. The approach includes a parallelized network architecture designed to decompose the action space. A parallel exploration-exploitation selection mechanism replaces the \varepsilon -greedy mechanism, accelerating convergence. Simulations demonstrate a nearly 90% increase in normalized throughput.

[LG-26] No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets

链接: https://arxiv.org/abs/2502.02379
作者: Corinna Coupette,Jeremy Wayland,Emily Simons,Bastian Rieck
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Benchmark datasets have proved pivotal to the success of graph learning, and good benchmark datasets are crucial to guide the development of the field. Recent research has highlighted problems with graph-learning datasets and benchmarking practices – revealing, for example, that methods which ignore the graph structure can outperform graph-based approaches on popular benchmark datasets. Such findings raise two questions: (1) What makes a good graph-learning dataset, and (2) how can we evaluate dataset quality in graph learning? Our work addresses these questions. As the classic evaluation setup uses datasets to evaluate models, it does not apply to dataset evaluation. Hence, we start from first principles. Observing that graph-learning datasets uniquely combine two modes – the graph structure and the node features – , we introduce RINGS, a flexible and extensible mode-perturbation framework to assess the quality of graph-learning datasets based on dataset ablations – i.e., by quantifying differences between the original dataset and its perturbed representations. Within this framework, we propose two measures – performance separability and mode complementarity – as evaluation tools, each assessing, from a distinct angle, the capacity of a graph dataset to benchmark the power and efficacy of graph-learning methods. We demonstrate the utility of our framework for graph-learning dataset evaluation in an extensive set of experiments and derive actionable recommendations for improving the evaluation of graph-learning methods. Our work opens new research directions in data-centric graph learning, and it constitutes a first step toward the systematic evaluation of evaluations.

[LG-27] Exploring the Feasibility of AI-Assisted Spine MRI Protocol Optimization Using DICOM Image Metadata

链接: https://arxiv.org/abs/2502.02351
作者: Alice Vian,Diego Andre Eifer,Mauricio Anes,Guilherme Ribeiro Garcia,Mariana Recamonde-Mendoza
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly being utilized to optimize magnetic resonance imaging (MRI) protocols. Given that image details are critical for diagnostic accuracy, optimizing MRI acquisition protocols is essential for enhancing image quality. While medical physicists are responsible for this optimization, the variability in equipment usage and the wide range of MRI protocols in clinical settings pose significant challenges. This study aims to validate the application of AI in optimizing MRI protocols using dynamic data from clinical practice, specifically DICOM metadata. To achieve this, four MRI spine exam databases were created, with the target attribute being the binary classification of image quality (good or bad). Five AI models were trained to identify trends in acquisition parameters that influence image quality, grounded in MRI theory. These trends were analyzed using SHAP graphs. The models achieved F1 performance ranging from 77% to 93% for datasets containing 292 or more instances, with the observed trends aligning with MRI theory. The models effectively reflected the practical realities of clinical MRI settings, offering a valuable tool for medical physicists in quality control tasks. In conclusion, AI has demonstrated its potential to optimize MRI protocols, supporting medical physicists in improving image quality and enhancing the efficiency of quality control in clinical practice.

[LG-28] Optimal Subspace Inference for the Laplace Approximation of Bayesian Neural Networks

链接: https://arxiv.org/abs/2502.02345
作者: Josua Faller,Jörg Martin
类目: Machine Learning (cs.LG)
*备注: for associated code, see this https URL

点击查看摘要

Abstract:Subspace inference for neural networks assumes that a subspace of their parameter space suffices to produce a reliable uncertainty quantification. In this work, we mathematically derive the optimal subspace model to a Bayesian inference scenario based on the Laplace approximation. We demonstrate empirically that, in the optimal case, often a fraction of parameters less than 1% is sufficient to obtain a reliable estimate of the full Laplace approximation. Since the optimal solution is derived, we can evaluate all other subspace models against a baseline. In addition, we give an approximation of our method that is applicable to larger problem settings, in which the optimal solution is not computable, and compare it to existing subspace models from the literature. In general, our approximation scheme outperforms previous work. Furthermore, we present a metric to qualitatively compare different subspace models even if the exact Laplace approximation is unknown.

[LG-29] Identifying Large-Scale Linear Parameter Varying Systems with Dynamic Mode Decomposition Methods

链接: https://arxiv.org/abs/2502.02336
作者: Jean Panaioti Jordanou,Eduardo Camponogara,Eduardo Gildin
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 39 pages, 8 figures. Submitted to Journal of Computational Physics

点击查看摘要

Abstract:Linear Parameter Varying (LPV) Systems are a well-established class of nonlinear systems with a rich theory for stability analysis, control, and analytical response finding, among other aspects. Although there are works on data-driven identification of such systems, the literature is quite scarce in terms of works that tackle the identification of LPV models for large-scale systems. Since large-scale systems are ubiquitous in practice, this work develops a methodology for the local and global identification of large-scale LPV systems based on nonintrusive reduced-order modeling. The developed method is coined as DMD-LPV for being inspired in the Dynamic Mode Decomposition (DMD). To validate the proposed identification method, we identify a system described by a discretized linear diffusion equation, with the diffusion gain defined by a polynomial over a parameter. The experiments show that the proposed method can easily identify a reduced-order LPV model of a given large-scale system without the need to perform identification in the full-order dimension, and with almost no performance decay over performing a reduction, given that the model structure is well-established.

[LG-30] Policy-Guided Causal State Representation for Offline Reinforcement Learning Recommendation

链接: https://arxiv.org/abs/2502.02327
作者: Siyu Wang,Xiaocong Chen,Lina Yao
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In offline reinforcement learning-based recommender systems (RLRS), learning effective state representations is crucial for capturing user preferences that directly impact long-term rewards. However, raw state representations often contain high-dimensional, noisy information and components that are not causally relevant to the reward. Additionally, missing transitions in offline data make it challenging to accurately identify features that are most relevant to user satisfaction. To address these challenges, we propose Policy-Guided Causal Representation (PGCR), a novel two-stage framework for causal feature selection and state representation learning in offline RLRS. In the first stage, we learn a causal feature selection policy that generates modified states by isolating and retaining only the causally relevant components (CRCs) while altering irrelevant components. This policy is guided by a reward function based on the Wasserstein distance, which measures the causal effect of state components on the reward and encourages the preservation of CRCs that directly influence user interests. In the second stage, we train an encoder to learn compact state representations by minimizing the mean squared error (MSE) loss between the latent representations of the original and modified states, ensuring that the representations focus on CRCs. We provide a theoretical analysis proving the identifiability of causal effects from interventions, validating the ability of PGCR to isolate critical state components for decision-making. Extensive experiments demonstrate that PGCR significantly improves recommendation performance, confirming its effectiveness for offline RL-based recommender systems.

[LG-31] DIME:Diffusion-Based Maximum Entropy Reinforcement Learning

链接: https://arxiv.org/abs/2502.02316
作者: Onur Celik,Zechu Li,Denis Blessing,Ge Li,Daniel Palanicek,Jan Peters,Georgia Chalvatzaki,Gerhard Neumann
类目: Machine Learning (cs.LG)
*备注: 8 pages main text, 18 pages all included

点击查看摘要

Abstract:Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges–primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.

[LG-32] MAGNNET: Multi-Agent Graph Neural Network-based Efficient Task Allocation for Autonomous Vehicles with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.02311
作者: Lavanya Ratnabala,Aleksey Fedoseev,Robinroy Peter,Dzmitry Tsetserukou
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Submitted to IEEE Intelligent Vehicle Symposium (2025)

点击查看摘要

Abstract:This paper addresses the challenge of decentralized task allocation within heterogeneous multi-agent systems operating under communication constraints. We introduce a novel framework that integrates graph neural networks (GNNs) with a centralized training and decentralized execution (CTDE) paradigm, further enhanced by a tailored Proximal Policy Optimization (PPO) algorithm for multi-agent deep reinforcement learning (MARL). Our approach enables unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to dynamically allocate tasks efficiently without necessitating central coordination in a 3D grid environment. The framework minimizes total travel time while simultaneously avoiding conflicts in task assignments. For the cost calculation and routing, we employ reservation-based A* and R* path planners. Experimental results revealed that our method achieves a high 92.5% conflict-free success rate, with only a 7.49% performance gap compared to the centralized Hungarian method, while outperforming the heuristic decentralized baseline based on greedy approach. Additionally, the framework exhibits scalability with up to 20 agents with allocation processing of 2.8 s and robustness in responding to dynamically generated tasks, underscoring its potential for real-world applications in complex multi-agent scenarios.

[LG-33] Real-Time Operator Takeover for Visuomotor Diffusion Policy Training

链接: https://arxiv.org/abs/2502.02308
作者: Nils Ingelhag,Jesper Munkeby,Michael C. Welle,Marco Moletta,Danica Kragic
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a Real-Time Operator Takeover (RTOT) paradigm enabling operators to seamlessly take control of a live visuomotor diffusion policy, guiding the system back into desirable states or reinforcing specific demonstrations. We presents new insights in using the Mahalonobis distance to automaicaly identify undesirable states. Once the operator has intervened and redirected the system, the control is seamlessly returned to the policy, which resumes generating actions until further intervention is required. We demonstrate that incorporating the targeted takeover demonstrations significantly improves policy performance compared to training solely with an equivalent number of, but longer, initial demonstrations. We provide an in-depth analysis of using the Mahalanobis distance to detect out-of-distribution states, illustrating its utility for identifying critical failure points during execution. Supporting materials, including videos of initial and takeover demonstrations and all rice-scooping experiments, are available on the project website: this https URL

[LG-34] Density Ratio Estimation with Conditional Probability Paths

链接: https://arxiv.org/abs/2502.02300
作者: Hanlin Yu,Arto Klami,Aapo Hyvärinen,Anna Korba,Omar Chehab
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Density ratio estimation in high dimensions can be reframed as integrating a certain quantity, the time score, over probability paths which interpolate between the two densities. In practice, the time score has to be estimated based on samples from the two densities. However, existing methods for this problem remain computationally expensive and can yield inaccurate estimates. Inspired by recent advances in generative modeling, we introduce a novel framework for time score estimation, based on a conditioning variable. Choosing the conditioning variable judiciously enables a closed-form objective function. We demonstrate that, compared to previous approaches, our approach results in faster learning of the time score and competitive or better estimation accuracies of the density ratio on challenging tasks. Furthermore, we establish theoretical guarantees on the error of the estimated density ratio.

[LG-35] Adaptive Resource Allocation Optimization Using Large Language Models in Dynamic Wireless Environments

链接: https://arxiv.org/abs/2502.02287
作者: Hyeonho Noh,Byonghyo Shim,Hyun Jong Yang
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning (DL) has made notable progress in addressing complex radio access network control challenges that conventional analytic methods have struggled to solve. However, DL has shown limitations in solving constrained NP-hard problems often encountered in network optimization, such as those involving quality of service (QoS) or discrete variables like user indices. Current solutions rely on domain-specific architectures or heuristic techniques, and a general DL approach for constrained optimization remains undeveloped. Moreover, even minor changes in communication objectives demand time-consuming retraining, limiting their adaptability to dynamic environments where task objectives, constraints, environmental factors, and communication scenarios frequently change. To address these challenges, we propose a large language model for resource allocation optimizer (LLM-RAO), a novel approach that harnesses the capabilities of LLMs to address the complex resource allocation problem while adhering to QoS constraints. By employing a prompt-based tuning strategy to flexibly convey ever-changing task descriptions and requirements to the LLM, LLM-RAO demonstrates robust performance and seamless adaptability in dynamic environments without requiring extensive retraining. Simulation results reveal that LLM-RAO achieves up to a 40% performance enhancement compared to conventional DL methods and up to an 80 % improvement over analytical approaches. Moreover, in scenarios with fluctuating communication objectives, LLM-RAO attains up to 2.9 times the performance of traditional DL-based networks.

[LG-36] A Revisit of Total Correlation in Disentangled Variational Auto-Encoder with Partial Disentanglement

链接: https://arxiv.org/abs/2502.02279
作者: Chengrui Li,Yunmiao Wang,Yule Wang,Weihan Li,Dieter Jaeger,Anqi Wu
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:A fully disentangled variational auto-encoder (VAE) aims to identify disentangled latent components from observations. However, enforcing full independence between all latent components may be too strict for certain datasets. In some cases, multiple factors may be entangled together in a non-separable manner, or a single independent semantic meaning could be represented by multiple latent components within a higher-dimensional manifold. To address such scenarios with greater flexibility, we develop the Partially Disentangled VAE (PDisVAE), which generalizes the total correlation (TC) term in fully disentangled VAEs to a partial correlation (PC) term. This framework can handle group-wise independence and can naturally reduce to either the standard VAE or the fully disentangled VAE. Validation through three synthetic experiments demonstrates the correctness and practicality of PDisVAE. When applied to real-world datasets, PDisVAE discovers valuable information that is difficult to find using fully disentangled VAEs, implying its versatility and effectiveness.

[LG-37] A User Guide to Sampling Strategies for Sliced Optimal Transport

链接: https://arxiv.org/abs/2502.02275
作者: Keanu Sisouk,Julie Delon,Julien Tierny
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:This paper serves as a user guide to sampling strategies for sliced optimal transport. We provide reminders and additional regularity results on the Sliced Wasserstein distance. We detail the construction methods, generation time complexity, theoretical guarantees, and conditions for each strategy. Additionally, we provide insights into their suitability for sliced optimal transport in theory. Extensive experiments on both simulated and real-world data offer a representative comparison of the strategies, culminating in practical recommendations for their best usage.

[LG-38] Exact Sequence Classification with Hardmax Transformers

链接: https://arxiv.org/abs/2502.02270
作者: Albert Alcalde,Giovanni Fantuzzi,Enrique Zuazua
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 14 pages, 5 figures. Funded by the European Union (Horizon Europe MSCA project ModConFlex, grant number 101073558)

点击查看摘要

Abstract:We prove that hardmax attention transformers perfectly classify datasets of N labeled sequences in \mathbbR^d , d\geq 2 . Specifically, given N sequences with an arbitrary but finite length in \mathbbR^d , we construct a transformer with \mathcalO(N) blocks and \mathcalO(Nd) parameters perfectly classifying this dataset. Our construction achieves the best complexity estimate to date, independent of the length of the sequences, by innovatively alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices within the attention mechanism, a common practice in real-life transformer implementations. Consequently, our analysis holds twofold significance: it substantially advances the mathematical theory of transformers and it rigorously justifies their exceptional real-world performance in sequence classification tasks.

[LG-39] Adversarial ML Problems Are Getting Harder to Solve and to Evaluate

链接: https://arxiv.org/abs/2502.02260
作者: Javier Rando,Jie Zhang,Nicholas Carlini,Florian Tramèr
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple “toy” problems (e.g., robustness to small adversarial perturbations) and is often hindered by non-rigorous evaluations. Today, adversarial ML research has shifted towards studying larger, general-purpose language models. In this position paper, we argue that the situation is now even worse: in the era of LLMs, the field of adversarial ML studies problems that are (1) less clearly defined, (2) harder to solve, and (3) even more challenging to evaluate. As a result, we caution that yet another decade of work on adversarial ML may fail to produce meaningful progress.

[LG-40] Flatten Graphs as Sequences: Transformers are Scalable Graph Generators

链接: https://arxiv.org/abs/2502.02216
作者: Dexiong Chen,Markus Krimmel,Karsten Borgwardt
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce AutoGraph, a novel autoregressive framework for generating large attributed graphs using decoder-only transformers. At the core of our approach is a reversible “flattening” process that transforms graphs into random sequences. By sampling and learning from these sequences, AutoGraph enables transformers to model and generate complex graph structures in a manner akin to natural language. In contrast to diffusion models that rely on computationally intensive node features, our approach operates exclusively on these sequences. The sampling complexity and sequence length scale linearly with the number of edges, making AutoGraph highly scalable for generating large sparse graphs. Empirically, AutoGraph achieves state-of-the-art performance across diverse synthetic and molecular graph generation benchmarks, while delivering a 100-fold generation and a 3-fold training speedup compared to leading diffusion models. Additionally, it demonstrates promising transfer capabilities and supports substructure-conditioned generation without additional fine-tuning. By extending language modeling techniques to graph generation, this work paves the way for developing graph foundation models.

[LG-41] On the Expressivity of Selective State-Space Layers: A Multivariate Polynomial Approach

链接: https://arxiv.org/abs/2502.02209
作者: Edo Cohen-Karlik,Itamar Zimerman,Liane Galanti,Ido Atad,Amir Globerson,Lior Wolf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in efficient sequence modeling have introduced selective state-space layers, a key component of the Mamba architecture, which have demonstrated remarkable success in a wide range of NLP and vision tasks. While Mamba’s empirical performance has matched or surpassed SoTA transformers on such diverse benchmarks, the theoretical foundations underlying its powerful representational capabilities remain less explored. In this work, we investigate the expressivity of selective state-space layers using multivariate polynomials, and prove that they surpass linear transformers in expressiveness. Consequently, our findings reveal that Mamba offers superior representational power over linear attention-based models for long sequences, while not sacrificing their generalization. Our theoretical insights are validated by a comprehensive set of empirical experiments on various datasets.

[LG-42] From Uncertain to Safe: Conformal Fine-Tuning of Diffusion Models for Safe PDE Control

链接: https://arxiv.org/abs/2502.02205
作者: Peiyan Hu,Xiaowei Qian,Wenhao Deng,Rui Wang,Haodong Feng,Ruiqi Feng,Tao Zhang,Long Wei,Yue Wang,Zhi-Ming Ma,Tailin Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The application of deep learning for partial differential equation (PDE)-constrained control is gaining increasing attention. However, existing methods rarely consider safety requirements crucial in real-world applications. To address this limitation, we propose Safe Diffusion Models for PDE Control (SafeDiffCon), which introduce the uncertainty quantile as model uncertainty quantification to achieve optimal control under safety constraints through both post-training and inference phases. Firstly, our approach post-trains a pre-trained diffusion model to generate control sequences that better satisfy safety constraints while achieving improved control objectives via a reweighted diffusion loss, which incorporates the uncertainty quantile estimated using conformal prediction. Secondly, during inference, the diffusion model dynamically adjusts both its generation process and parameters through iterative guidance and fine-tuning, conditioned on control targets while simultaneously integrating the estimated uncertainty quantile. We evaluate SafeDiffCon on three control tasks: 1D Burgers’ equation, 2D incompressible fluid, and controlled nuclear fusion problem. Results demonstrate that SafeDiffCon is the only method that satisfies all safety constraints, whereas other classical and deep learning baselines fail. Furthermore, while adhering to safety constraints, SafeDiffCon achieves the best control performance.

[LG-43] Multi-level Supervised Contrastive Learning

链接: https://arxiv.org/abs/2502.02202
作者: Naghmeh Ghanooni,Barbod Pajoum,Harshit Rawal,Sophie Fellenz,Vo Nguyen Le Duy,Marius Kloft
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive learning is a well-established paradigm in representation learning. The standard framework of contrastive learning minimizes the distance between “similar” instances and maximizes the distance between dissimilar ones in the projection space, disregarding the various aspects of similarity that can exist between two samples. Current methods rely on a single projection head, which fails to capture the full complexity of different aspects of a sample, leading to suboptimal performance, especially in scenarios with limited training data. In this paper, we present a novel supervised contrastive learning method in a unified framework called multilevel contrastive learning (MLCL), that can be applied to both multi-label and hierarchical classification tasks. The key strength of the proposed method is the ability to capture similarities between samples across different labels and/or hierarchies using multiple projection heads. Extensive experiments on text and image datasets demonstrate that the proposed approach outperforms state-of-the-art contrastive learning methods

[LG-44] Discovering Quality-Diversity Algorithms via Meta-Black-Box Optimization

链接: https://arxiv.org/abs/2502.02190
作者: Maxence Faldor,Robert Tjarko Lange,Antoine Cully
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quality-Diversity has emerged as a powerful family of evolutionary algorithms that generate diverse populations of high-performing solutions by implementing local competition principles inspired by biological evolution. While these algorithms successfully foster diversity and innovation, their specific mechanisms rely on heuristics, such as grid-based competition in MAP-Elites or nearest-neighbor competition in unstructured archives. In this work, we propose a fundamentally different approach: using meta-learning to automatically discover novel Quality-Diversity algorithms. By parameterizing the competition rules using attention-based neural architectures, we evolve new algorithms that capture complex relationships between individuals in the descriptor space. Our discovered algorithms demonstrate competitive or superior performance compared to established Quality-Diversity baselines while exhibiting strong generalization to higher dimensions, larger populations, and out-of-distribution domains like robot control. Notably, even when optimized solely for fitness, these algorithms naturally maintain diverse populations, suggesting meta-learning rediscovers that diversity is fundamental to effective optimization.

[LG-45] deCIFer: Crystal Structure Prediction from Powder Diffraction Data using Autoregressive Language Models

链接: https://arxiv.org/abs/2502.02189
作者: Frederik Lizak Johansen,Ulrik Friis-Jensen,Erik Bjørnager Dam,Kirsten Marie Ørnsbjerg Jensen,Rocío Mercado,Raghavendra Selvan
类目: Machine Learning (cs.LG)
*备注: 24 pages, 17 figures, 6 tables

点击查看摘要

Abstract:Novel materials drive progress across applications from energy storage to electronics. Automated characterization of material structures with machine learning methods offers a promising strategy for accelerating this key step in material design. In this work, we introduce an autoregressive language model that performs crystal structure prediction (CSP) from powder diffraction data. The presented model, deCIFer, generates crystal structures in the widely used Crystallographic Information File (CIF) format and can be conditioned on powder X-ray diffraction (PXRD) data. Unlike earlier works that primarily rely on high-level descriptors like composition, deCIFer performs CSP from diffraction data. We train deCIFer on nearly 2.3M unique crystal structures and validate on diverse sets of PXRD patterns for characterizing challenging inorganic crystal systems. Qualitative and quantitative assessments using the residual weighted profile and Wasserstein distance show that deCIFer produces structures that more accurately match the target diffraction data when conditioned, compared to the unconditioned case. Notably, deCIFer can achieve a 94% match rate on unseen data. deCIFer bridges experimental diffraction data with computational CSP, lending itself as a powerful tool for crystal structure characterization and accelerating materials discovery.

[LG-46] Generative Kernel Spectral Clustering

链接: https://arxiv.org/abs/2502.02185
作者: David Winant,Sonny Achten,Johan A. K. Suykens
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at ESANN 2025

点击查看摘要

Abstract:Modern clustering approaches often trade interpretability for performance, particularly in deep learning-based methods. We present Generative Kernel Spectral Clustering (GenKSC), a novel model combining kernel spectral clustering with generative modeling to produce both well-defined clusters and interpretable representations. By augmenting weighted variance maximization with reconstruction and clustering losses, our model creates an explorable latent space where cluster characteristics can be visualized through traversals along cluster directions. Results on MNIST and FashionMNIST datasets demonstrate the model’s ability to learn meaningful cluster representations.

[LG-47] Deep Neural Cellular Potts Models

链接: https://arxiv.org/abs/2502.02129
作者: Koen Minartz,Tim d’Hondt,Leon Hillmann,Jörn Starruß,Lutz Brusch,Vlado Menkovski
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The cellular Potts model (CPM) is a powerful computational method for simulating collective spatiotemporal dynamics of biological cells. To drive the dynamics, CPMs rely on physics-inspired Hamiltonians. However, as first principles remain elusive in biology, these Hamiltonians only approximate the full complexity of real multicellular systems. To address this limitation, we propose NeuralCPM, a more expressive cellular Potts model that can be trained directly on observational data. At the core of NeuralCPM lies the Neural Hamiltonian, a neural network architecture that respects universal symmetries in collective cellular dynamics. Moreover, this approach enables seamless integration of domain knowledge by combining known biological mechanisms and the expressive Neural Hamiltonian into a hybrid model. Our evaluation with synthetic and real-world multicellular systems demonstrates that NeuralCPM is able to model cellular dynamics that cannot be accounted for by traditional analytical Hamiltonians.

[LG-48] BILBO: BILevel Bayesian Optimization

链接: https://arxiv.org/abs/2502.02121
作者: Ruth Wan Theng Chew,Quoc Phong Nguyen,Bryan Kian Hsiang Low
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bilevel optimization is characterized by a two-level optimization structure, where the upper-level problem is constrained by optimal lower-level solutions, and such structures are prevalent in real-world problems. The constraint by optimal lower-level solutions poses significant challenges, especially in noisy, constrained, and derivative-free settings, as repeating lower-level optimizations is sample inefficient and predicted lower-level solutions may be suboptimal. We present BILevel Bayesian Optimization (BILBO), a novel Bayesian optimization algorithm for general bilevel problems with blackbox functions, which optimizes both upper- and lower-level problems simultaneously, without the repeated lower-level optimization required by existing methods. BILBO samples from confidence-bounds based trusted sets, which bounds the suboptimality on the lower level. Moreover, BILBO selects only one function query per iteration, where the function query selection strategy incorporates the uncertainty of estimated lower-level solutions and includes a conditional reassignment of the query to encourage exploration of the lower-level objective. The performance of BILBO is theoretically guaranteed with a sublinear regret bound for commonly used kernels and is empirically evaluated on several synthetic and real-world problems.

[LG-49] Concept-Aware Latent and Explicit Knowledge Integration for Enhanced Cognitive Diagnosis

链接: https://arxiv.org/abs/2502.02104
作者: Yawen Chen,Jiande Sun,Jing Li,Huaxiang Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cognitive diagnosis can infer the students’ mastery of specific knowledge concepts based on historical response logs. However, the existing cognitive diagnostic models (CDMs) represent students’ proficiency via a unidimensional perspective, which can’t assess the students’ mastery on each knowledge concept comprehensively. Moreover, the Q-matrix binarizes the relationship between exercises and knowledge concepts, and it can’t represent the latent relationship between exercises and knowledge concepts. Especially, when the granularity of knowledge attributes refines increasingly, the Q-matrix becomes incomplete correspondingly and the sparse binary representation (0/1) fails to capture the intricate relationships among knowledge concepts. To address these issues, we propose a Concept-aware Latent and Explicit Knowledge Integration model for cognitive diagnosis (CLEKI-CD). Specifically, a multidimensional vector is constructed according to the students’ mastery and exercise difficulty for each knowledge concept from multiple perspectives, which enhances the representation capabilities of the model. Moreover, a latent Q-matrix is generated by our proposed attention-based knowledge aggregation method, and it can uncover the coverage degree of exercises over latent knowledge. The latent Q-matrix can supplement the sparse explicit Q-matrix with the inherent relationships among knowledge concepts, and mitigate the knowledge coverage problem. Furthermore, we employ a combined cognitive diagnosis layer to integrate both latent and explicit knowledge, further enhancing cognitive diagnosis performance. Extensive experiments on real-world datasets demonstrate that CLEKI-CD outperforms the state-of-the-art models. The proposed CLEKI-CD is promising in practical applications in the field of intelligent education, as it exhibits good interpretability with diagnostic results.

[LG-50] A New Rejection Sampling Approach to k-mathttmeans With Improved Trade-Offs

链接: https://arxiv.org/abs/2502.02085
作者: Poojan Shah,Shashwat Agrawal,Ragesh Jaiswal
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The k - \mathttmeans ++ seeding algorithm (Arthur Vassilvitskii, 2007) is widely used in practice for the k -means clustering problem where the goal is to cluster a dataset \mathcalX \subset \mathbbR ^d into k clusters. The popularity of this algorithm is due to its simplicity and provable guarantee of being O(\log k) competitive with the optimal solution in expectation. However, its running time is O(|\mathcalX|kd) , making it expensive for large datasets. In this work, we present a simple and effective rejection sampling based approach for speeding up k - \mathttmeans ++. Our first method runs in time \tildeO(\mathttnnz (\mathcalX) + \beta k^2d) while still being O(\log k ) competitive in expectation. Here, \beta is a parameter which is the ratio of the variance of the dataset to the optimal k - \mathttmeans cost in expectation and \tildeO hides logarithmic factors in k and |\mathcalX| . Our second method presents a new trade-off between computational cost and solution quality. It incurs an additional scale-invariant factor of k^-\Omega( m/\beta) \operatornameVar (\mathcalX) in addition to the O(\log k) guarantee of k - \mathttmeans ++ improving upon a result of (Bachem et al, 2016a) who get an additional factor of m^-1\operatornameVar(\mathcalX) while still running in time \tildeO(\mathttnnz(\mathcalX) + mk^2d) . We perform extensive empirical evaluations to validate our theoretical results and to show the effectiveness of our approach on real datasets. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2502.02085 [cs.DS] (or arXiv:2502.02085v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2502.02085 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Poojan Shah [view email] [v1] Tue, 4 Feb 2025 08:05:34 UTC (696 KB) Full-text links: Access Paper: View a PDF of the paper titled A New Rejection Sampling Approach to k - \mathttmeans ++ With Improved Trade-Offs, by Poojan Shah and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.DS prev | next new | recent | 2025-02 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-51] ContinuouSP: Generative Model for Crystal Structure Prediction with Invariance and Continuity AAAI

链接: https://arxiv.org/abs/2502.02026
作者: Yuji Tone,Masatoshi Hanai,Mitsuaki Kawamura,Kenjiro Taura,Toyotaro Suzumura
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Accepted at the 4th Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE)

点击查看摘要

Abstract:The discovery of new materials using crystal structure prediction (CSP) based on generative machine learning models has become a significant research topic in recent years. In this paper, we study invariance and continuity in the generative machine learning for CSP. We propose a new model, called ContinuouSP, which effectively handles symmetry and periodicity in crystals. We clearly formulate the invariance and the continuity, and construct a model based on the energy-based model. Our preliminary evaluation demonstrates the effectiveness of this model with the CSP task.

[LG-52] Causal bandits with backdoor adjustment on unknown Gaussian DAGs

链接: https://arxiv.org/abs/2502.02020
作者: Yijia Zhao,Qing Zhou
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The causal bandit problem aims to sequentially learn the intervention that maximizes the expectation of a reward variable within a system governed by a causal graph. Most existing approaches assume prior knowledge of the graph structure, or impose unrealistically restrictive conditions on the graph. In this paper, we assume a Gaussian linear directed acyclic graph (DAG) over arms and the reward variable, and study the causal bandit problem when the graph structure is unknown. We identify backdoor adjustment sets for each arm using sequentially generated experimental and observational data during the decision process, which allows us to estimate causal effects and construct upper confidence bounds. By integrating estimates from both data sources, we develop a novel bandit algorithm, based on modified upper confidence bounds, to sequentially determine the optimal intervention. We establish both case-dependent and case-independent upper bounds on the cumulative regret for our algorithm, which improve upon the bounds of the standard multi-armed bandit algorithms. Our empirical study demonstrates its advantage over existing methods with respect to cumulative regret and computation time.

[LG-53] Dual Ensembled Multiagent Q-Learning with Hypernet Regularizer AAMAS2025

链接: https://arxiv.org/abs/2502.02018
作者: Yaodong Yang,Guangyong Chen,Hongyao Tang,Furui Liu,Danruo Deng,Pheng Ann Heng
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 15 pages, AAMAS 2025 version with appendix

点击查看摘要

Abstract:Overestimation in single-agent reinforcement learning has been extensively studied. In contrast, overestimation in the multiagent setting has received comparatively little attention although it increases with the number of agents and leads to severe learning instability. Previous works concentrate on reducing overestimation in the estimation process of target Q-value. They ignore the follow-up optimization process of online Q-network, thus making it hard to fully address the complex multiagent overestimation problem. To solve this challenge, in this study, we first establish an iterative estimation-optimization analysis framework for multiagent value-mixing Q-learning. Our analysis reveals that multiagent overestimation not only comes from the computation of target Q-value but also accumulates in the online Q-network’s optimization. Motivated by it, we propose the Dual Ensembled Multiagent Q-Learning with Hypernet Regularizer algorithm to tackle multiagent overestimation from two aspects. First, we extend the random ensemble technique into the estimation of target individual and global Q-values to derive a lower update target. Second, we propose a novel hypernet regularizer on hypernetwork weights and biases to constrain the optimization of online global Q-network to prevent overestimation accumulation. Extensive experiments in MPE and SMAC show that the proposed method successfully addresses overestimation across various tasks.

[LG-54] -SCEND: Test-time Scalable MCTS-enhanced Diffusion Model

链接: https://arxiv.org/abs/2502.01989
作者: Tao Zhang,Jia-Shu Pan,Ruiqi Feng,Tailin Wu
类目: Machine Learning (cs.LG)
*备注: 20 pages, 12 figures

点击查看摘要

Abstract:We introduce Test-time Scalable MCTS-enhanced Diffusion Model (T-SCEND), a novel framework that significantly improves diffusion model’s reasoning capabilities with better energy-based training and scaling up test-time computation. We first show that naïvely scaling up inference budget for diffusion models yields marginal gain. To address this, the training of T-SCEND consists of a novel linear-regression negative contrastive learning objective to improve the performance-energy consistency of the energy landscape, and a KL regularization to reduce adversarial sampling. During inference, T-SCEND integrates the denoising process with a novel hybrid Monte Carlo Tree Search (hMCTS), which sequentially performs best-of-N random search and MCTS as denoising proceeds. On challenging reasoning tasks of Maze and Sudoku, we demonstrate the effectiveness of T-SCEND’s training objective and scalable inference method. In particular, trained with Maze sizes of up to 6\times6 , our T-SCEND solves 88% of Maze problems with much larger sizes of 15\times15 , while standard diffusion completely this http URL to reproduce the experiments can be found at this https URL.

[LG-55] Ilargi: a GPU Compatible Factorized ML Model Training Framework

链接: https://arxiv.org/abs/2502.01985
作者: Wenbo Sun,Rihan Hai
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The machine learning (ML) training over disparate data sources traditionally involves materialization, which can impose substantial time and space overhead due to data movement and replication. Factorized learning, which leverages direct computation on disparate sources through linear algebra (LA) rewriting, has emerged as a viable alternative to improve computational efficiency. However, the adaptation of factorized learning to leverage the full capabilities of modern LA-friendly hardware like GPUs has been limited, often requiring manual intervention for algorithm compatibility. This paper introduces Ilargi, a novel factorized learning framework that utilizes matrix-represented data integration (DI) metadata to facilitate automatic factorization across CPU and GPU environments without the need for costly relational joins. Ilargi incorporates an ML-based cost estimator to intelligently selects between factorization and materialization based on data properties, algorithm complexity, hardware environments, and their interactions. This strategy ensures up to 8.9x speedups on GPUs and achieves over 20% acceleration in batch ML training workloads, thereby enhancing the practicability of ML training across diverse data integration scenarios and hardware platforms. To our knowledge, this work is the very first effort in GPU-compatible factorized learning.

[LG-56] MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving

链接: https://arxiv.org/abs/2502.01960
作者: Shiju Zhao,Junhao Hu,Rongxiao Huang,Jiaqi Zheng,Guihai Chen
类目: Machine Learning (cs.LG)
*备注: 14 pages, 11 figures, the first version

点击查看摘要

Abstract:The context caching technique is employed to accelerate the Multimodal Large Language Model (MLLM) inference by prevailing serving platforms currently. However, this approach merely reuses the Key-Value (KV) cache of the initial sequence of prompt, resulting in full KV cache recomputation even if the prefix differs slightly. This becomes particularly inefficient in the context of interleaved text and images, as well as multimodal retrieval-augmented generation. This paper proposes position-independent caching as a more effective approach for multimodal information management. We have designed and implemented a caching system, named MPIC, to address both system-level and algorithm-level challenges. MPIC stores the KV cache on local or remote disks when receiving multimodal data, and calculates and loads the KV cache in parallel during inference. To mitigate accuracy degradation, we have incorporated integrated reuse and recompute mechanisms within the system. The experimental results demonstrate that MPIC can achieve up to 54% reduction in response time compared to existing context caching systems, while maintaining negligible or no accuracy loss.

[LG-57] Constrained belief updates explain geometric structures in transformer representations

链接: https://arxiv.org/abs/2502.01954
作者: Mateusz Piotrowski,Paul M. Riechers,Daniel Filan,Adam S. Shai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:What computational structures emerge in transformers trained on next-token prediction? In this work, we provide evidence that transformers implement constrained Bayesian belief updating – a parallelized version of partial Bayesian inference shaped by architectural constraints. To do this, we integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models that generate rich geometric patterns in neural activations. We find that attention heads carry out an algorithm with a natural interpretation in the probability simplex, and create representations with distinctive geometric structure. We show how both the algorithmic behavior and the underlying geometry of these representations can be theoretically predicted in detail – including the attention pattern, OV-vectors, and embedding vectors – by modifying the equations for optimal future token predictions to account for the architectural constraints of attention. Our approach provides a principled lens on how gradient descent resolves the tension between optimal prediction and architectural design.

[LG-58] On the Emergence of Position Bias in Transformers

链接: https://arxiv.org/abs/2502.01951
作者: Xinyi Wu,Yifei Wang,Stefanie Jegelka,Ali Jadbabaie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies have revealed various manifestations of position bias in transformer architectures, from the “lost-in-the-middle” phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks and positional encodings shape these biases remains elusive. This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens interact with contextual information based on their sequential positions. We uncover two key insights: First, causal masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention layers – coupled with the causal mask – leads to a trade-off between the long-term decay effects and the cumulative importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism components and guiding more informed architectural design.

[LG-59] Query-Based and Unnoticeable Graph Injection Attack from Neighborhood Perspective

链接: https://arxiv.org/abs/2502.01936
作者: Chang Liu,Hai Huang,Yujie Xing,Xingquan Zuo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The robustness of Graph Neural Networks (GNNs) has become an increasingly important topic due to their expanding range of applications. Various attack methods have been proposed to explore the vulnerabilities of GNNs, ranging from Graph Modification Attacks (GMA) to the more practical and flexible Graph Injection Attacks (GIA). However, existing methods face two key challenges: (i) their reliance on surrogate models, which often leads to reduced attack effectiveness due to structural differences and prior biases, and (ii) existing GIA methods often sacrifice attack success rates in undefended settings to bypass certain defense models, thereby limiting their overall effectiveness. To overcome these limitations, we propose QUGIA, a Query-based and Unnoticeable Graph Injection Attack. QUGIA injects nodes by first selecting edges based on victim node connections and then generating node features using a Bayesian framework. This ensures that the injected nodes are similar to the original graph nodes, implicitly preserving homophily and making the attack more unnoticeable. Unlike previous methods, QUGIA does not rely on surrogate models, thereby avoiding performance degradation and achieving better generalization. Extensive experiments on six real-world datasets with diverse characteristics demonstrate that QUGIA achieves unnoticeable attacks and outperforms state-of-the-art attackers. The code will be released upon acceptance.

[LG-60] Anomaly Detection via Autoencoder Composite Features and NCE

链接: https://arxiv.org/abs/2502.01920
作者: Yalin Liao,Austin J. Brockmeier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised anomaly detection is a challenging task. Autoencoders (AEs) or generative models are often employed to model the data distribution of normal inputs and subsequently identify anomalous, out-of-distribution inputs by high reconstruction error or low likelihood, respectively. However, AEs may generalize and achieve small reconstruction errors on abnormal inputs. We propose a decoupled training approach for anomaly detection that both an AE and a likelihood model trained with noise contrastive estimation (NCE). After training the AE, NCE estimates a probability density function, to serve as the anomaly score, on the joint space of the AE’s latent representation combined with features of the reconstruction quality. To further reduce the false negative rate in NCE we systematically varying the reconstruction features to augment the training and optimize the contrastive Gaussian noise distribution. Experimental assessments on multiple benchmark datasets demonstrate that the proposed approach matches the performance of prevalent state-of-the-art anomaly detection algorithms.

[LG-61] Generalizable and Fast Surrogates: Model Predictive Control of Articulated Soft Robots using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2502.01916
作者: Tim-Lukas Habich,Aran Mohammad,Simon F. G. Ehlers,Martin Bensch,Thomas Seel,Moritz Schappler
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Soft robots can revolutionize several applications with high demands on dexterity and safety. When operating these systems, real-time estimation and control require fast and accurate models. However, prediction with first-principles (FP) models is slow, and learned black-box models have poor generalizability. Physics-informed machine learning offers excellent advantages here, but it is currently limited to simple, often simulated systems without considering changes after training. We propose physics-informed neural networks (PINNs) for articulated soft robots (ASRs) with a focus on data efficiency. The amount of expensive real-world training data is reduced to a minimum - one dataset in one system domain. Two hours of data in different domains are used for a comparison against two gold-standard approaches: In contrast to a recurrent neural network, the PINN provides a high generalizability. The prediction speed of an accurate FP model is improved with the PINN by up to a factor of 466 at slightly reduced accuracy. This enables nonlinear model predictive control (MPC) of the pneumatic ASR. In nine dynamic MPC experiments, an average joint-tracking error of 1.3° is achieved.

[LG-62] Composite Gaussian Processes Flows for Learning Discontinuous Multimodal Policies

链接: https://arxiv.org/abs/2502.01913
作者: Shu-yuan Wang,Hikaru Sasaki,Takamitsu Matsubara
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning control policies for real-world robotic tasks often involve challenges such as multimodality, local discontinuities, and the need for computational efficiency. These challenges arise from the complexity of robotic environments, where multiple solutions may coexist. To address these issues, we propose Composite Gaussian Processes Flows (CGP-Flows), a novel semi-parametric model for robotic policy. CGP-Flows integrate Overlapping Mixtures of Gaussian Processes (OMGPs) with the Continuous Normalizing Flows (CNFs), enabling them to model complex policies addressing multimodality and local discontinuities. This hybrid approach retains the computational efficiency of OMGPs while incorporating the flexibility of CNFs. Experiments conducted in both simulated and real-world robotic tasks demonstrate that CGP-flows significantly improve performance in modeling control policies. In a simulation task, we confirmed that CGP-Flows had a higher success rate compared to the baseline method, and the success rate of GCP-Flow was significantly different from the success rate of other baselines in chi-square tests.

[LG-63] Unlocking Efficient Large Inference Models: One-Bit Unrolling Tips the Scales

链接: https://arxiv.org/abs/2502.01908
作者: Arian Eamaz,Farhang Yeganegi,Mojtaba Soltanalian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Model (LLM) compression, such as BitNet and BitNet b1.58, have marked significant strides in reducing the computational demands of LLMs through innovative one-bit quantization techniques. We extend this frontier by looking at Large Inference Models (LIMs) that have become indispensable across various applications. However, their scale and complexity often come at a significant computational cost. We introduce a novel approach that leverages one-bit algorithm unrolling, effectively integrating information from the physical world in the model architecture. Our method achieves a bit-per-link rate significantly lower than the 1.58 bits reported in prior work, thanks to the natural sparsity that emerges in our network architectures. We numerically demonstrate that the proposed one-bit algorithm unrolling scheme can improve both training and test outcomes by effortlessly increasing the number of layers while substantially compressing the network. Additionally, we provide theoretical results on the generalization gap, convergence rate, stability, and sensitivity of our proposed one-bit algorithm unrolling.

[LG-64] Reinforcement Learning with Segment Feedback

链接: https://arxiv.org/abs/2502.01876
作者: Yihan Du,Anna Winnicki,Gal Dalal,Shie Mannor,R. Srikant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard reinforcement learning (RL) assumes that an agent can observe a reward for each state-action pair. However, in practical applications, it is often difficult and costly to collect a reward for each state-action pair. While there have been several works considering RL with trajectory feedback, it is unclear if trajectory feedback is inefficient for learning when trajectories are long. In this work, we consider a model named RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback. In this model, we consider an episodic Markov decision process (MDP), where each episode is divided into m segments, and the agent observes reward feedback only at the end of each segment. Under this model, we study two popular feedback settings: binary feedback and sum feedback, where the agent observes a binary outcome and a reward sum according to the underlying reward function, respectively. To investigate the impact of the number of segments m on learning performance, we design efficient algorithms and establish regret upper and lower bounds for both feedback settings. Our theoretical and experimental results show that: under binary feedback, increasing the number of segments m decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing m does not reduce the regret significantly.

[LG-65] Optimizing Online Advertising with Multi-Armed Bandits: Mitigating the Cold Start Problem under Auction Dynamics

链接: https://arxiv.org/abs/2502.01867
作者: Anastasiia Soboleva,Andrey Pudovikov,Roman Snetkov,Alina Babenko,Egor Samosvat,Yuriy Dorn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online advertising platforms often face a common challenge: the cold start problem. Insufficient behavioral data (clicks) makes accurate click-through rate (CTR) forecasting of new ads challenging. CTR for “old” items can also be significantly underestimated due to their early performance influencing their long-term behavior on the platform. The cold start problem has far-reaching implications for businesses, including missed long-term revenue opportunities. To mitigate this issue, we developed a UCB-like algorithm under multi-armed bandit (MAB) setting for positional-based model (PBM), specifically tailored to auction pay-per-click systems. Our proposed algorithm successfully combines theory and practice: we obtain theoretical upper estimates of budget regret, and conduct a series of experiments on synthetic and real-world data that confirm the applicability of the method on the real platform. In addition to increasing the platform’s long-term profitability, we also propose a mechanism for maintaining short-term profits through controlled exploration and exploitation of items. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.01867 [cs.LG] (or arXiv:2502.01867v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.01867 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yuriy Dorn [view email] [v1] Mon, 3 Feb 2025 22:33:24 UTC (7,214 KB)

[LG-66] Enhancing Generalization via Sharpness-Aware Trajectory Matching for Dataset Condensation

链接: https://arxiv.org/abs/2502.01865
作者: Boyan Gao,Bo Zhao,Shreyank N Gowda,Xingrun Xing,Yibo Yang,Timothy Hospedales,David A. Clifton
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset condensation aims to synthesize datasets with a few representative samples that can effectively represent the original datasets. This enables efficient training and produces models with performance close to those trained on the original sets. Most existing dataset condensation methods conduct dataset learning under the bilevel (inner- and outer-loop) based optimization. However, the preceding methods perform with limited dataset generalization due to the notoriously complicated loss landscape and expensive time-space complexity of the inner-loop unrolling of bilevel optimization. These issues deteriorate when the datasets are learned via matching the trajectories of networks trained on the real and synthetic datasets with a long horizon inner-loop. To address these issues, we introduce Sharpness-Aware Trajectory Matching (SATM), which enhances the generalization capability of learned synthetic datasets by optimising the sharpness of the loss landscape and objective simultaneously. Moreover, our approach is coupled with an efficient hypergradient approximation that is mathematically well-supported and straightforward to implement along with controllable computational overhead. Empirical evaluations of SATM demonstrate its effectiveness across various applications, including in-domain benchmarks and out-of-domain settings. Moreover, its easy-to-implement properties afford flexibility, allowing it to integrate with other advanced sharpness-aware minimizers. Our code will be released.

[LG-67] Learning Hyperparameters via a Data-Emphasized Variational Objective

链接: https://arxiv.org/abs/2502.01861
作者: Ethan Harvey,Mikhail Petrov,Michael C. Hughes
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2410.19675

点击查看摘要

Abstract:When training large flexible models, practitioners often rely on grid search to select hyperparameters that control over-fitting. This grid search has several disadvantages: the search is computationally expensive, requires carving out a validation set that reduces the available data for training, and requires users to specify candidate values. In this paper, we propose an alternative: directly learning regularization hyperparameters on the full training set via the evidence lower bound (“ELBo”) objective from variational methods. For deep neural networks with millions of parameters, we recommend a modified ELBo that upweights the influence of the data likelihood relative to the prior. Our proposed technique overcomes all three disadvantages of grid search. In a case study on transfer learning of image classifiers, we show how our method reduces the 88+ hour grid search of past work to under 3 hours while delivering comparable accuracy. We further demonstrate how our approach enables efficient yet accurate approximations of Gaussian processes with learnable length-scale kernels.

[LG-68] SE Arena: Benchmarking Software Engineering Chatbots with Iterative Interactions

链接: https://arxiv.org/abs/2502.01860
作者: Zhimin Zhao
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: this https URL

点击查看摘要

Abstract:Foundation models (FMs), particularly large language models (LLMs), have shown significant promise in various software engineering (SE) tasks, including code generation, debugging, and requirement refinement. Despite these advances, existing evaluation frameworks are insufficient for assessing model performance in iterative, context-rich workflows characteristic of SE activities. To address this limitation, we introduce SE Arena, an interactive platform designed to evaluate SE-focused chatbots. SE Arena provides a transparent, open-source leaderboard, supports multi-round conversational workflows, and enables end-to-end model comparisons. Moreover, SE Arena incorporates a new feature called RepoChat, which automatically injects repository-related context (e.g., issues, commits, pull requests) into the conversation, further aligning evaluations with real-world development processes. This paper outlines the design and capabilities of SE Arena, emphasizing its potential to advance the evaluation and practical application of FMs in software engineering.

[LG-69] How to warm-start your unfolding network

链接: https://arxiv.org/abs/2502.01854
作者: Vicky Kouni
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We present a new ensemble framework for boosting the performance of overparameterized unfolding networks solving the compressed sensing problem. We combine a state-of-the-art overparameterized unfolding network with a continuation technique, to warm-start a crucial quantity of the said network’s architecture; we coin the resulting continued network C-DEC. Moreover, for training and evaluating C-DEC, we incorporate the log-cosh loss function, which enjoys both linear and quadratic behavior. Finally, we numerically assess C-DEC’s performance on real-world images. Results showcase that the combination of continuation with the overparameterized unfolded architecture, trained and evaluated with the chosen loss function, yields smoother loss landscapes and improved reconstruction and generalization performance of C-DEC, consistently for all datasets.

[LG-70] Security and Quality in LLM -Generated Code: A Multi-Language Multi-Model Analysis

链接: https://arxiv.org/abs/2502.01853
作者: Mohammed Kharma,Soohyeon Choi,Mohammed AlKhanafseh,David Mohaisen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 12 pages, 10 tables. In submission to IEEE Transactions on Dependable and Secure Computing

点击查看摘要

Abstract:Artificial Intelligence (AI)-driven code generation tools are increasingly used throughout the software development lifecycle to accelerate coding tasks. However, the security of AI-generated code using Large Language Models (LLMs) remains underexplored, with studies revealing various risks and weaknesses. This paper analyzes the security of code generated by LLMs across different programming languages. We introduce a dataset of 200 tasks grouped into six categories to evaluate the performance of LLMs in generating secure and maintainable code. Our research shows that while LLMs can automate code creation, their security effectiveness varies by language. Many models fail to utilize modern security features in recent compiler and toolkit updates, such as Java 17. Moreover, outdated methods are still commonly used, particularly in C++. This highlights the need for advancing LLMs to enhance security and quality while incorporating emerging best practices in programming languages.

[LG-71] From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment

链接: https://arxiv.org/abs/2502.01828
作者: Yilin Wu,Ran Tian,Gokul Swamy,Andrea Bajcsy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While generative robot policies have demonstrated significant potential in learning complex, multimodal behaviors from demonstrations, they still exhibit diverse failures at deployment-time. Policy steering offers an elegant solution to reducing the chance of failure by using an external verifier to select from low-level actions proposed by an imperfect generative policy. Here, one might hope to use a Vision Language Model (VLM) as a verifier, leveraging its open-world reasoning capabilities. However, off-the-shelf VLMs struggle to understand the consequences of low-level robot actions as they are represented fundamentally differently than the text and images the VLM was trained on. In response, we propose FOREWARN, a novel framework to unlock the potential of VLMs as open-vocabulary verifiers for runtime policy steering. Our key idea is to decouple the VLM’s burden of predicting action outcomes (foresight) from evaluation (forethought). For foresight, we leverage a latent world model to imagine future latent states given diverse low-level action plans. For forethought, we align the VLM with these predicted latent states to reason about the consequences of actions in its native representation–natural language–and effectively filter proposed plans. We validate our framework across diverse robotic manipulation tasks, demonstrating its ability to bridge representational gaps and provide robust, generalizable policy steering.

[LG-72] Self-supervised Subgraph Neural Network With Deep Reinforcement Walk Exploration

链接: https://arxiv.org/abs/2502.01809
作者: Jianming Huang,Hiroyuki Kasai
类目: Machine Learning (cs.LG)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:Graph data, with its structurally variable nature, represents complex real-world phenomena like chemical compounds, protein structures, and social networks. Traditional Graph Neural Networks (GNNs) primarily utilize the message-passing mechanism, but their expressive power is limited and their prediction lacks explainability. To address these limitations, researchers have focused on graph substructures. Subgraph neural networks (SGNNs) and GNN explainers have emerged as potential solutions, but each has its limitations. SGNNs computes graph representations based on the bags of subgraphs to enhance the expressive power. However, they often rely on predefined algorithm-based sampling strategies, which is inefficient. GNN explainers adopt data-driven approaches to generate important subgraphs to provide explanation. Nevertheless, their explanation is difficult to be translated into practical improvements on GNNs. To overcome these issues, we propose a novel self-supervised framework that integrates SGNNs with the generation approach of GNN explainers, named the Reinforcement Walk Exploration SGNN (RWE-SGNN). Our approach features a sampling model trained in an explainer fashion, optimizing subgraphs to enhance model performance. To achieve a data-driven sampling approach, unlike traditional subgraph generation approaches, we propose a novel walk exploration process, which efficiently extracts important substructures, simplifying the embedding process and avoiding isomorphism problems. Moreover, we prove that our proposed walk exploration process has equivalent generation capability to the traditional subgraph generation process. Experimental results on various graph datasets validate the effectiveness of our proposed method, demonstrating significant improvements in performance and precision.

[LG-73] Policy Design for Two-sided Platforms with Participation Dynamics

链接: https://arxiv.org/abs/2502.01792
作者: Haruka Kiyohara,Fan Yao,Sarah Dean
类目: Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: preprint, under review

点击查看摘要

Abstract:In two-sided platforms (e.g., video streaming or e-commerce), viewers and providers engage in interactive dynamics, where an increased provider population results in higher viewer utility and the increase of viewer population results in higher provider utility. Despite the importance of such “population effects” on long-term platform health, recommendation policies do not generally take the participation dynamics into account. This paper thus studies the dynamics and policy design on two-sided platforms under the population effects for the first time. Our control- and game-theoretic findings warn against the use of myopic-greedy policy and shed light on the importance of provider-side considerations (i.e., effectively distributing exposure among provider groups) to improve social welfare via population growth. We also present a simple algorithm to optimize long-term objectives by considering the population effects, and demonstrate its effectiveness in synthetic and real-data experiments.

[LG-74] GNN-DT: Graph Neural Network Enhanced Decision Transformer for Efficient Optimization in Dynamic Environments

链接: https://arxiv.org/abs/2502.01778
作者: Stavros Orfanoudakis,Nanda Kishor Panda,Peter Palensky,Pedro P. Vergara
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) methods used for solving real-world optimization problems often involve dynamic state-action spaces, larger scale, and sparse rewards, leading to significant challenges in convergence, scalability, and efficient exploration of the solution space. This study introduces GNN-DT, a novel Decision Transformer (DT) architecture that integrates Graph Neural Network (GNN) embedders with a novel residual connection between input and output tokens crucial for handling dynamic environments. By learning from previously collected trajectories, GNN-DT reduces dependence on accurate simulators and tackles the sparse rewards limitations of online RL algorithms. We evaluate GNN-DT on the complex electric vehicle (EV) charging optimization problem and prove that its performance is superior and requires significantly fewer training trajectories, thus improving sample efficiency compared to existing DT baselines. Furthermore, GNN-DT exhibits robust generalization to unseen environments and larger action spaces, addressing a critical gap in prior DT-based approaches

[LG-75] On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

链接: https://arxiv.org/abs/2502.01763
作者: Thomas T. Zhang,Behrad Moniri,Ansh Nagwekar,Faraz Rahman,Anton Xue,Hamed Hassani,Nikolai Matni
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer’s weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise (“diagonal”) preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs \mathbfx \sim \mathsfN(\mathbf0, \mathbfI) and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.

[LG-76] Choose Your Model Size: Any Compression by a Single Gradient Descent

链接: https://arxiv.org/abs/2502.01717
作者: Martin Genzel,Patrick Putzky,Pengfei Zhao,Sebastian Schulze,Mattes Mollenhauer,Robert Seidel,Stefan Dietzel,Thomas Wollmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The adoption of Foundation Models in resource-constrained environments remains challenging due to their large size and inference costs. A promising way to overcome these limitations is post-training compression, which aims to balance reduced model size against performance degradation. This work presents Any Compression via Iterative Pruning (ACIP), a novel algorithmic approach to determine a compression-performance trade-off from a single stochastic gradient descent run. To ensure parameter efficiency, we use an SVD-reparametrization of linear layers and iteratively prune their singular values with a sparsity-inducing penalty. The resulting pruning order gives rise to a global parameter ranking that allows us to materialize models of any target size. Importantly, the compressed models exhibit strong predictive downstream performance without the need for costly fine-tuning. We evaluate ACIP on a large selection of open-weight LLMs and tasks, and demonstrate state-of-the-art results compared to existing factorisation-based compression methods. We also show that ACIP seamlessly complements common quantization-based compression techniques.

[LG-77] Auditing a Dutch Public Sector Risk Profiling Algorithm Using an Unsupervised Bias Detection Tool

链接: https://arxiv.org/abs/2502.01713
作者: Floris Holstege,Mackenzie Jorgensen,Kirtan Padh,Jurriaan Parie,Joel Persson,Krsto Prorokovic,Lukas Snoek
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithms are increasingly used to automate or aid human decisions, yet recent research shows that these algorithms may exhibit bias across legally protected demographic groups. However, data on these groups may be unavailable to organizations or external auditors due to privacy legislation. This paper studies bias detection using an unsupervised clustering tool when data on demographic groups are unavailable. We collaborate with the Dutch Executive Agency for Education to audit an algorithm that was used to assign risk scores to college students at the national level in the Netherlands between 2012-2023. Our audit covers more than 250,000 students from the whole country. The unsupervised clustering tool highlights known disparities between students with a non-European migration background and Dutch origin. Our contributions are three-fold: (1) we assess bias in a real-world, large-scale and high-stakes decision-making process by a governmental organization; (2) we use simulation studies to highlight potential pitfalls of using the unsupervised clustering tool to detect true bias when demographic group data are unavailable and provide recommendations for valid inferences; (3) we provide the unsupervised clustering tool in an open-source library. Our work serves as a starting point for a deliberative assessment by human experts to evaluate potential discrimination in algorithmic-supported decision-making processes.

[LG-78] Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models ICASSP2025

链接: https://arxiv.org/abs/2502.01709
作者: Christopher Simic,Korbinian Riedhammer,Tobias Bocklet
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:We present an approach to Audio-Visual Speech Recognition that builds on a pre-trained Whisper model. To infuse visual information into this audio-only model, we extend it with an AV fusion module and LoRa adapters, one of the most up-to-date adapter approaches. One advantage of adapter-based approaches, is that only a relatively small number of parameters are trained, while the basic model remains unchanged. Common AVSR approaches train single models to handle several noise categories and noise levels simultaneously. Taking advantage of the lightweight nature of adapter approaches, we train noise-scenario-specific adapter-sets, each covering individual noise-categories or a specific noise-level range. The most suitable adapter-set is selected by previously classifying the noise-scenario. This enables our models to achieve an optimum coverage across different noise-categories and noise-levels, while training only a minimum number of parameters. Compared to a full fine-tuning approach with SOTA performance our models achieve almost comparable results over the majority of the tested noise-categories and noise-levels, with up to 88.5% less trainable parameters. Our approach can be extended by further noise-specific adapter-sets to cover additional noise scenarios. It is also possible to utilize the underlying powerful ASR model when no visual information is available, as it remains unchanged. Comments: Accepted at ICASSP 2025 Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2502.01709 [cs.SD] (or arXiv:2502.01709v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2502.01709 Focus to learn more arXiv-issued DOI via DataCite

[LG-79] Progressive Binarization with Semi-Structured Pruning for LLM s

链接: https://arxiv.org/abs/2502.01705
作者: Xianglong Yan,Tianao Zhang,Zhiteng Li,Yulun Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in natural language processing tasks, but their high computational and memory demands pose challenges for deployment on resource-constrained devices. Binarization, as an efficient compression method that reduces model weights to just 1 bit, significantly lowers both computational and memory requirements. Despite this, the binarized LLM still contains redundancy, which can be further compressed. Semi-structured pruning provides a promising approach to achieve this, which offers a better trade-off between model performance and hardware efficiency. However, simply combining binarization with semi-structured pruning can lead to a significant performance drop. To address this issue, we propose a Progressive Binarization with Semi-Structured Pruning (PBS ^2 P) method for LLM compression. We first propose a Stepwise semi-structured Pruning with Binarization Optimization (SPBO). Our optimization strategy significantly reduces the total error caused by pruning and binarization, even below that of the no-pruning scenario. Furthermore, we design a Coarse-to-Fine Search (CFS) method to select pruning elements more effectively. Extensive experiments demonstrate that PBS ^2 P achieves superior accuracy across various LLM families and evaluation metrics, noticeably outperforming state-of-the-art (SOTA) binary PTQ methods. The code and models will be available at this https URL.

[LG-80] Al-Khwarizmi: Discovering Physical Laws with Foundation Models

链接: https://arxiv.org/abs/2502.01702
作者: Christopher E. Mower,Haitham Bou-Ammar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inferring physical laws from data is a central challenge in science and engineering, including but not limited to healthcare, physical sciences, biosciences, social sciences, sustainability, climate, and robotics. Deep networks offer high-accuracy results but lack interpretability, prompting interest in models built from simple components. The Sparse Identification of Nonlinear Dynamics (SINDy) method has become the go-to approach for building such modular and interpretable models. SINDy leverages sparse regression with L1 regularization to identify key terms from a library of candidate functions. However, SINDy’s choice of candidate library and optimization method requires significant technical expertise, limiting its widespread applicability. This work introduces Al-Khwarizmi, a novel agentic framework for physical law discovery from data, which integrates foundational models with SINDy. Leveraging LLMs, VLMs, and Retrieval-Augmented Generation (RAG), our approach automates physical law discovery, incorporating prior knowledge and iteratively refining candidate solutions via reflection. Al-Khwarizmi operates in two steps: it summarizes system observations-comprising textual descriptions, raw data, and plots-followed by a secondary step that generates candidate feature libraries and optimizer configurations to identify hidden physics laws correctly. Evaluating our algorithm on over 198 models, we demonstrate state-of-the-art performance compared to alternatives, reaching a 20 percent increase against the best-performing alternative.

[LG-81] Learning with Differentially Private (Sliced) Wasserstein Gradients

链接: https://arxiv.org/abs/2502.01701
作者: Clément Lalanne(IMT, ANITI),Jean-Michel Loubes(IMT, ANITI),David Rodríguez-Vítores(UVa, IMUVA)
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this work, we introduce a novel framework for privately optimizing objectives that rely on Wasserstein distances between data-dependent empirical measures. Our main theoretical contribution is, based on an explicit formulation of the Wasserstein gradient in a fully discrete setting, a control on the sensitivity of this gradient to individual data points, allowing strong privacy guarantees at minimal utility cost. Building on these insights, we develop a deep learning approach that incorporates gradient and activations clipping, originally designed for DP training of problems with a finite-sum structure. We further demonstrate that privacy accounting methods extend to Wasserstein-based objectives, facilitating large-scale private training. Empirical results confirm that our framework effectively balances accuracy and privacy, offering a theoretically sound solution for privacy-preserving machine learning tasks relying on optimal transport distances such as Wasserstein distance or sliced-Wasserstein distance.

[LG-82] EdgeMark: An Automation and Benchmarking System for Embedded Artificial Intelligence Tools

链接: https://arxiv.org/abs/2502.01700
作者: Mohammad Amin Hasanpour,Mikkel Kirkegaard,Xenofon Fafoutis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of artificial intelligence (AI) into embedded devices, a paradigm known as embedded artificial intelligence (eAI) or tiny machine learning (TinyML), is transforming industries by enabling intelligent data processing at the edge. However, the many tools available in this domain leave researchers and developers wondering which one is best suited to their needs. This paper provides a review of existing eAI tools, highlighting their features, trade-offs, and limitations. Additionally, we introduce EdgeMark, an open-source automation system designed to streamline the workflow for deploying and benchmarking machine learning (ML) models on embedded platforms. EdgeMark simplifies model generation, optimization, conversion, and deployment while promoting modularity, reproducibility, and scalability. Experimental benchmarking results showcase the performance of widely used eAI tools, including TensorFlow Lite Micro (TFLM), Edge Impulse, Ekkono, and Renesas eAI Translator, across a wide range of models, revealing insights into their relative strengths and weaknesses. The findings provide guidance for researchers and developers in selecting the most suitable tools for specific application requirements, while EdgeMark lowers the barriers to adoption of eAI technologies.

[LG-83] BrainOOD: Out-of-distribution Generalizable Brain Network Analysis

链接: https://arxiv.org/abs/2502.01688
作者: Jiaxing Xu,Yongqiang Chen,Xia Dong,Mengcheng Lan,Tiancheng Huang,Qingtian Bian,James Cheng,Yiping Ke
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:In neuroscience, identifying distinct patterns linked to neurological disorders, such as Alzheimer’s and Autism, is critical for early diagnosis and effective intervention. Graph Neural Networks (GNNs) have shown promising in analyzing brain networks, but there are two major challenges in using GNNs: (1) distribution shifts in multi-site brain network data, leading to poor Out-of-Distribution (OOD) generalization, and (2) limited interpretability in identifying key brain regions critical to neurological disorders. Existing graph OOD methods, while effective in other domains, struggle with the unique characteristics of brain networks. To bridge these gaps, we introduce BrainOOD, a novel framework tailored for brain networks that enhances GNNs’ OOD generalization and interpretability. BrainOOD framework consists of a feature selector and a structure extractor, which incorporates various auxiliary losses including an improved Graph Information Bottleneck (GIB) objective to recover causal subgraphs. By aligning structure selection across brain networks and filtering noisy features, BrainOOD offers reliable interpretations of critical brain regions. Our approach outperforms 16 existing methods and improves generalization to OOD subjects by up to 8.5%. Case studies highlight the scientific validity of the patterns extracted, which aligns with the findings in known neuroscience literature. We also propose the first OOD brain network benchmark, which provides a foundation for future research in this field. Our code is available at this https URL.

[LG-84] DeepGate4: Efficient and Effective Representation Learning for Circuit Design at Scale

链接: https://arxiv.org/abs/2502.01681
作者: Ziyang Zheng,Shan Huang,Jianyuan Zhong,Zhengyuan Shi,Guohao Dai,Ningyi Xu,Qiang Xu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Circuit representation learning has become pivotal in electronic design automation, enabling critical tasks such as testability analysis, logic reasoning, power estimation, and SAT solving. However, existing models face significant challenges in scaling to large circuits due to limitations like over-squashing in graph neural networks and the quadratic complexity of transformer-based models. To address these issues, we introduce DeepGate4, a scalable and efficient graph transformer specifically designed for large-scale circuits. DeepGate4 incorporates several key innovations: (1) an update strategy tailored for circuit graphs, which reduce memory complexity to sub-linear and is adaptable to any graph transformer; (2) a GAT-based sparse transformer with global and local structural encodings for AIGs; and (3) an inference acceleration CUDA kernel that fully exploit the unique sparsity patterns of AIGs. Our extensive experiments on the ITC99 and EPFL benchmarks show that DeepGate4 significantly surpasses state-of-the-art methods, achieving 15.5% and 31.1% performance improvements over the next-best models. Furthermore, the Fused-DeepGate4 variant reduces runtime by 35.1% and memory usage by 46.8%, making it highly efficient for large-scale circuit analysis. These results demonstrate the potential of DeepGate4 to handle complex EDA tasks while offering superior scalability and efficiency.

[LG-85] A Hardware-Efficient Photonic Tensor Core: Accelerating Deep Neural Networks with Structured Compression

链接: https://arxiv.org/abs/2502.01670
作者: Shupeng Ning,Hanqing Zhu,Chenghao Feng,Jiaqi Gu,David Z. Pan,Ray T. Chen
类目: Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI) and deep neural networks (DNNs) have revolutionized numerous fields, enabling complex tasks by extracting intricate features from large datasets. However, the exponential growth in computational demands has outstripped the capabilities of traditional electrical hardware accelerators. Optical computing offers a promising alternative due to its inherent advantages of parallelism, high computational speed, and low power consumption. Yet, current photonic integrated circuits (PICs) designed for general matrix multiplication (GEMM) are constrained by large footprints, high costs of electro-optical (E-O) interfaces, and high control complexity, limiting their scalability. To overcome these challenges, we introduce a block-circulant photonic tensor core (CirPTC) for a structure-compressed optical neural network (StrC-ONN) architecture. By applying a structured compression strategy to weight matrices, StrC-ONN significantly reduces model parameters and hardware requirements while preserving the universal representability of networks and maintaining comparable expressivity. Additionally, we propose a hardware-aware training framework to compensate for on-chip nonidealities to improve model robustness and accuracy. We experimentally demonstrate image processing and classification tasks, achieving up to a 74.91% reduction in trainable parameters while maintaining competitive accuracies. Performance analysis expects a computational density of 5.84 tera operations per second (TOPS) per mm^2 and a power efficiency of 47.94 TOPS/W, marking a 6.87-times improvement achieved through the hardware-software co-design approach. By reducing both hardware requirements and control complexity across multiple dimensions, this work explores a new pathway to push the limits of optical computing in the pursuit of high efficiency and scalability.

[LG-86] Predicting concentration levels of air pollutants by transfer learning and recurrent neural network

链接: https://arxiv.org/abs/2502.01654
作者: Iat Hang Fong,Tengyue Li,Simon Fong,Raymond K. Wong,Antonio J. Tallón-Ballesteros
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Forecasting, environment monitoring, transfer learning, recurrent neural network, airborne particle matter

点击查看摘要

Abstract:Air pollution (AP) poses a great threat to human health, and people are paying more attention than ever to its prediction. Accurate prediction of AP helps people to plan for their outdoor activities and aids protecting human health. In this paper, long-short term memory (LSTM) recurrent neural networks (RNNs) have been used to predict the future concentration of air pollutants (APS) in Macau. Additionally, meteorological data and data on the concentration of APS have been utilized. Moreover, in Macau, some air quality monitoring stations (AQMSs) have less observed data in quantity, and, at the same time, some AQMSs recorded less observed data of certain types of APS. Therefore, the transfer learning and pre-trained neural networks have been employed to assist AQMSs with less observed data to build a neural network with high prediction accuracy. The experimental sample covers a period longer than 12-year and includes daily measurements from several APS as well as other more classical meteorological values. Records from five stations, four out of them are AQMSs and the remaining one is an automatic weather station, have been prepared from the aforesaid period and eventually underwent to computational intelligence techniques to build and extract a prediction knowledge-based system. As shown by experimentation, LSTM RNNs initialized with transfer learning methods have higher prediction accuracy; it incurred shorter training time than randomly initialized recurrent neural networks.

[LG-87] Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting Their Limitations

链接: https://arxiv.org/abs/2501.18197
作者: Cedric Renggli,Ihab F. Ilyas,Theodoros Rekatsinas
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:In this work, we dive into the fundamental challenges of evaluating Text2SQL solutions and highlight potential failure causes and the potential risks of relying on aggregate metrics in existing benchmarks. We identify two largely unaddressed limitations in current open benchmarks: (1) data quality issues in the evaluation data, mainly attributed to the lack of capturing the probabilistic nature of translating a natural language description into a structured query (e.g., NL ambiguity), and (2) the bias introduced by using different match functions as approximations for SQL equivalence. To put both limitations into context, we propose a unified taxonomy of all Text2SQL limitations that can lead to both prediction and evaluation errors. We then motivate the taxonomy by providing a survey of Text2SQL limitations using state-of-the-art Text2SQL solutions and benchmarks. We describe the causes of limitations with real-world examples and propose potential mitigation solutions for each category in the taxonomy. We conclude by highlighting the open challenges encountered when deploying such mitigation strategies or attempting to automatically apply the taxonomy. Subjects: Machine Learning (cs.LG); Databases (cs.DB) Cite as: arXiv:2501.18197 [cs.LG] (or arXiv:2501.18197v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.18197 Focus to learn more arXiv-issued DOI via DataCite

[LG-88] Catoni Contextual Bandits are Robust to Heavy-tailed Rewards

链接: https://arxiv.org/abs/2502.02486
作者: Chenlu Ye,Yujia Jin,Alekh Agarwal,Tong Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range [0, R] , and their regret scales polynomially with this reward range R . However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni’s estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range R as well as the number of rounds T . For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.

[LG-89] SDE Matching: Scalable and Simulation-Free Training of Latent Stochastic Differential Equations

链接: https://arxiv.org/abs/2502.02472
作者: Grigory Bartosh,Dmitry Vetrov,Christian A. Naesseth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Latent Stochastic Differential Equation (SDE) is a powerful tool for time series and sequence modeling. However, training Latent SDEs typically relies on adjoint sensitivity methods, which depend on simulation and backpropagation through approximate SDE solutions, which limit scalability. In this work, we propose SDE Matching, a new simulation-free method for training Latent SDEs. Inspired by modern Score- and Flow Matching algorithms for learning generative dynamics, we extend these ideas to the domain of stochastic dynamics for time series and sequence modeling, eliminating the need for costly numerical simulations. Our results demonstrate that SDE Matching achieves performance comparable to adjoint sensitivity methods while drastically reducing computational complexity.

[LG-90] Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation

链接: https://arxiv.org/abs/2502.02463
作者: George Whittle,Juliusz Ziomek,Jacob Rawling,Michael A Osborne
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Bayesian inference provides a principled framework for reasoning under uncertainty, its widespread adoption is limited by the intractability of exact posterior computation, necessitating the use of approximate inference. However, existing methods are often computationally expensive, or demand costly retraining when priors change, limiting their utility, particularly in sequential inference problems such as real-time sensor fusion. To address these challenges, we introduce the Distribution Transformer – a novel architecture that can learn arbitrary distribution-to-distribution mappings. Our method can be trained to map a prior to the corresponding posterior, conditioned on some dataset – thus performing approximate Bayesian inference. Our novel architecture represents a prior distribution as a (universally-approximating) Gaussian Mixture Model (GMM), and transforms it into a GMM representation of the posterior. The components of the GMM attend to each other via self-attention, and to the datapoints via cross-attention. We demonstrate that Distribution Transformers both maintain flexibility to vary the prior, and significantly reduces computation times-from minutes to milliseconds-while achieving log-likelihood performance on par with or superior to existing approximate inference methods across tasks such as sequential inference, quantum system parameter inference, and Gaussian Process predictive posterior inference with hyperpriors.

[LG-91] A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals

链接: https://arxiv.org/abs/2502.02430
作者: Róbert Busa-Fekete,Julian Zimmert,András György,Linhai Qiu,Tzu-Wei Sung,Hao Shen,Hyomin Choi,Sharmila Subramaniam,Li Xiao
类目: Machine Learning (stat.ML); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Web refresh crawling is the problem of keeping a cache of web pages fresh, that is, having the most recent copy available when a page is requested, given a limited bandwidth available to the crawler. Under the assumption that the change and request events, resp., to each web page follow independent Poisson processes, the optimal scheduling policy was derived by Azar et al. 2018. In this paper, we study an extension of this problem where side information indicating content changes, such as various types of web pings, for example, signals from sitemaps, content delivery networks, etc., is available. Incorporating such side information into the crawling policy is challenging, because (i) the signals can be noisy with false positive events and with missing change events; and (ii) the crawler should achieve a fair performance over web pages regardless of the quality of the side information, which might differ from web page to web page. We propose a scalable crawling algorithm which (i) uses the noisy side information in an optimal way under mild assumptions; (ii) can be deployed without heavy centralized computation; (iii) is able to crawl web pages at a constant total rate without spikes in the total bandwidth usage over any time interval, and automatically adapt to the new optimal solution when the total bandwidth changes without centralized computation. Experiments clearly demonstrate the versatility of our approach.

[LG-92] Circular Microalgae-Based Carbon Control for Net Zero

链接: https://arxiv.org/abs/2502.02382
作者: Federico Zocco,Joan García,Wassim M. Haddad
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: To be submitted

点击查看摘要

Abstract:The alteration of the climate in various areas of the world is of increasing concern since climate stability is a necessary condition for human survival as well as every living organism. The main reason of climate change is the greenhouse effect caused by the accumulation of carbon dioxide in the atmosphere. In this paper, we design a networked system underpinned by compartmental dynamical thermodynamics to circulate the atmospheric carbon dioxide. Specifically, in the carbon dioxide emitter compartment, we develop an initial-condition-dependent finite-time stabilizing controller that guarantees stability within a desired time leveraging the system property of affinity in the control. Then, to compensate for carbon emissions we show that a cultivation of microalgae with a volume 625 times bigger than the one of the carbon emitter is required. To increase the carbon uptake of the microalgae, we implement the nonaffine-in-the-control microalgae dynamical equations as an environment of a state-of-the-art library for reinforcement learning (RL), namely, Stable-Baselines3, and then, through the library, we test the performance of eight RL algorithms for training a controller that maximizes the microalgae absorption of carbon through the light intensity. All the eight controllers increased the carbon absorption of the cultivation during a training of 200,000 time steps with a maximum episode length of 200 time steps and with no termination conditions. This work is a first step towards approaching net zero as a classical and learning-based network control problem. The source code is publicly available.

[LG-93] FAB-PPI: Frequentist Assisted by Bayes Prediction-Powered Inference

链接: https://arxiv.org/abs/2502.02363
作者: Stefano Cortinovis,François Caron
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 28 pages, 13 figures

点击查看摘要

Abstract:Prediction-powered inference (PPI) enables valid statistical inference by combining experimental data with machine learning predictions. When a sufficient number of high-quality predictions is available, PPI results in more accurate estimates and tighter confidence intervals than traditional methods. In this paper, we propose to inform the PPI framework with prior knowledge on the quality of the predictions. The resulting method, which we call frequentist, assisted by Bayes, PPI (FAB-PPI), improves over PPI when the observed prediction quality is likely under the prior, while maintaining its frequentist guarantees. Furthermore, when using heavy-tailed priors, FAB-PPI adaptively reverts to standard PPI in low prior probability regions. We demonstrate the benefits of FAB-PPI in real and synthetic examples.

[LG-94] Coreset-Based Task Selection for Sample-Efficient Meta-Reinforcement Learning

链接: https://arxiv.org/abs/2502.02332
作者: Donglin Zhan,Leonardo F. Toso,James Anderson
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study task selection to enhance sample efficiency in model-agnostic meta-reinforcement learning (MAML-RL). Traditional meta-RL typically assumes that all available tasks are equally important, which can lead to task redundancy when they share significant similarities. To address this, we propose a coreset-based task selection approach that selects a weighted subset of tasks based on how diverse they are in gradient space, prioritizing the most informative and diverse tasks. Such task selection reduces the number of samples needed to find an \epsilon -close stationary solution by a factor of O(1/ \epsilon ). Consequently, it guarantees a faster adaptation to unseen tasks while focusing training on the most relevant tasks. As a case study, we incorporate task selection to MAML-LQR (Toso et al., 2024b), and prove a sample complexity reduction proportional to O(log(1/ \epsilon )) when the task specific cost also satisfy gradient dominance. Our theoretical guarantees underscore task selection as a key component for scalable and sample-efficient meta-RL. We numerically validate this trend across multiple RL benchmark problems, illustrating the benefits of task selection beyond the LQR baseline.

[LG-95] On the Impact of Performative Risk Minimization for Binary Random Variables

链接: https://arxiv.org/abs/2502.02331
作者: Nikita Tsoy,Ivan Kirev,Negin Rahimiyazdi,Nikola Konstantinov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Performativity, the phenomenon where outcomes are influenced by predictions, is particularly prevalent in social contexts where individuals strategically respond to a deployed model. In order to preserve the high accuracy of machine learning models under distribution shifts caused by performativity, Perdomo et al. (2020) introduced the concept of performative risk minimization (PRM). While this framework ensures model accuracy, it overlooks the impact of the PRM on the underlying distributions and the predictions of the model. In this paper, we initiate the analysis of the impact of PRM, by studying performativity for a sequential performative risk minimization problem with binary random variables and linear performative shifts. We formulate two natural measures of impact. In the case of full information, where the distribution dynamics are known, we derive explicit formulas for the PRM solution and our impact measures. In the case of partial information, we provide performative-aware statistical estimators, as well as simulations. Our analysis contrasts PRM to alternatives that do not model data shift and indicates that PRM can have amplified side effects compared to such methods.

[LG-96] Information-Theoretic Proofs for Diffusion Sampling

链接: https://arxiv.org/abs/2502.02305
作者: Galen Reeves,Henry D. Pfister
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper provides an elementary, self-contained analysis of diffusion-based sampling methods for generative modeling. In contrast to existing approaches that rely on continuous-time processes and then discretize, our treatment works directly with discrete-time stochastic processes and yields precise non-asymptotic convergence guarantees under broad assumptions. The key insight is to couple the sampling process of interest with an idealized comparison process that has an explicit Gaussian-convolution structure. We then leverage simple identities from information theory, including the I-MMSE relationship, to bound the discrepancy (in terms of the Kullback-Leibler divergence) between these two discrete-time processes. In particular, we show that, if the diffusion step sizes are chosen sufficiently small and one can approximate certain conditional mean estimators well, then the sampling distribution is provably close to the target distribution. Our results also provide a transparent view on how to accelerate convergence by introducing additional randomness in each step to match higher order moments in the comparison process.

[LG-97] Comparative Analysis of FPGA and GPU Performance for Machine Learning-Based Track Reconstruction at LHCb

链接: https://arxiv.org/abs/2502.02304
作者: Fotis I. Giasemis,Vladimir Lončar,Bertrand Granado,Vladimir Vava Gligorov
类目: High Energy Physics - Experiment (hep-ex); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:In high-energy physics, the increasing luminosity and detector granularity at the Large Hadron Collider are driving the need for more efficient data processing solutions. Machine Learning has emerged as a promising tool for reconstructing charged particle tracks, due to its potentially linear computational scaling with detector hits. The recent implementation of a graph neural network-based track reconstruction pipeline in the first level trigger of the LHCb experiment on GPUs serves as a platform for comparative studies between computational architectures in the context of high-energy physics. This paper presents a novel comparison of the throughput of ML model inference between FPGAs and GPUs, focusing on the first step of the track reconstruction pipeline \unicodex2013 an implementation of a multilayer perceptron. Using HLS4ML for FPGA deployment, we benchmark its performance against the GPU implementation and demonstrate the potential of FPGAs for high-throughput, low-latency inference without the need for an expertise in FPGA development and while consuming significantly less power.

[LG-98] Variance-Adjusted Cosine Distance as Similarity Metric

链接: https://arxiv.org/abs/2502.02233
作者: Satyajeet Sahoo,Jhareswar Maiti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 6 Pages

点击查看摘要

Abstract:Cosine similarity is a popular distance measure that measures the similarity between two vectors in the inner product space. It is widely used in many data classification algorithms like K-Nearest Neighbors, Clustering etc. This study demonstrates limitations of application of cosine similarity. Particularly, this study demonstrates that traditional cosine similarity metric is valid only in the Euclidean space, whereas the original data resides in a random variable space. When there is variance and correlation in the data, then cosine distance is not a completely accurate measure of similarity. While new similarity and distance metrics have been developed to make up for the limitations of cosine similarity, these metrics are used as substitutes to cosine distance, and do not make modifications to cosine distance to overcome its limitations. Subsequently, we propose a modified cosine similarity metric, where cosine distance is adjusted by variance-covariance of the data. Application of variance-adjusted cosine distance gives better similarity performance compared to traditional cosine distance. KNN modelling on the Wisconsin Breast Cancer Dataset is performed using both traditional and modified cosine similarity measures and compared. The modified formula shows 100% test accuracy on the data.

[LG-99] SurvHive: a package to consistently access multiple survival-analysis packages

链接: https://arxiv.org/abs/2502.02223
作者: Giovanni Birolo,Ivan Rossi,Flavio Sartori,Cesare Rollo,Tiziana Sanavia,Piero Fariselli
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 11 loages, 1 table

点击查看摘要

Abstract:Survival analysis, a foundational tool for modeling time-to-event data, has seen growing integration with machine learning (ML) approaches to handle the complexities of censored data and time-varying risks. Despite these advances, leveraging state-of-the-art survival models remains a challenge due to the fragmented nature of existing implementations, which lack standardized interfaces and require extensive preprocessing. We introduce SurvHive, a Python-based framework designed to unify survival analysis methods within a coherent and extensible interface modeled on scikit-learn. SurvHive integrates classical statistical models with cutting-edge deep learning approaches, including transformer-based architectures and parametric survival models. Using a consistent API, SurvHive simplifies model training, evaluation, and optimization, significantly reducing the barrier to entry for ML practitioners exploring survival analysis. The package includes enhanced support for hyper-parameter tuning, time-dependent risk evaluation metrics, and cross-validation strategies tailored to censored data. With its extensibility and focus on usability, SurvHive provides a bridge between survival analysis and the broader ML community, facilitating advancements in time-to-event modeling across domains. The SurvHive code and documentation are available freely at this https URL.

[LG-100] EFKAN: A KAN-Integrated Neural Operator For Efficient Magnetotelluric Forward Modeling

链接: https://arxiv.org/abs/2502.02195
作者: Feng Wang,Hong Qiu,Yingying Huang,Xiaozhe Gu,Renfang Wang,Bo Yang
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: Submitted to Computers Geosciences

点击查看摘要

Abstract:Magnetotelluric (MT) forward modeling is fundamental for improving the accuracy and efficiency of MT inversion. Neural operators (NOs) have been effectively used for rapid MT forward modeling, demonstrating their promising performance in solving the MT forward modeling-related partial differential equations (PDEs). Particularly, they can obtain the electromagnetic field at arbitrary locations and frequencies. In these NOs, the projection layers have been dominated by multi-layer perceptrons (MLPs), which may potentially reduce the accuracy of solution due to they usually suffer from the disadvantages of MLPs, such as lack of interpretability, overfitting, and so on. Therefore, to improve the accuracy of MT forward modeling with NOs and explore the potential alternatives to MLPs, we propose a novel neural operator by extending the Fourier neural operator (FNO) with Kolmogorov-Arnold network (EFKAN). Within the EFKAN framework, the FNO serves as the branch network to calculate the apparent resistivity and phase from the resistivity model in the frequency domain. Meanwhile, the KAN acts as the trunk network to project the resistivity and phase, determined by the FNO, to the desired locations and frequencies. Experimental results demonstrate that the proposed method not only achieves higher accuracy in obtaining apparent resistivity and phase compared to the NO equipped with MLPs at the desired frequencies and locations but also outperforms traditional numerical methods in terms of computational speed.

[LG-101] An Information-Theoretic Analysis of Thompson Sampling with Infinite Action Spaces ICASSP

链接: https://arxiv.org/abs/2502.02140
作者: Amaury Gouverneur,Borja Rodriguez Gálvez,Tobias Oechtering,Mikael Skoglund
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 5 pages, accepted to ICASSP

点击查看摘要

Abstract:This paper studies the Bayesian regret of the Thompson Sampling algorithm for bandit problems, building on the information-theoretic framework introduced by Russo and Van Roy (2015). Specifically, it extends the rate-distortion analysis of Dong and Van Roy (2018), which provides near-optimal bounds for linear bandits. A limitation of these results is the assumption of a finite action space. We address this by extending the analysis to settings with infinite and continuous action spaces. Additionally, we specialize our results to bandit problems with expected rewards that are Lipschitz continuous with respect to the action space, deriving a regret bound that explicitly accounts for the complexity of the action space.

[LG-102] he Ball-Proximal (=“Broximal”) Point Method: a New Algorithm Convergence Theory and Applications

链接: https://arxiv.org/abs/2502.02002
作者: Kaja Gruntkowska,Hanmin Li,Aadi Rane,Peter Richtárik
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 44 pages, 3 figures

点击查看摘要

Abstract:Non-smooth and non-convex global optimization poses significant challenges across various applications, where standard gradient-based methods often struggle. We propose the Ball-Proximal Point Method, Broximal Point Method, or Ball Point Method (BPM) for short - a novel algorithmic framework inspired by the classical Proximal Point Method (PPM) (Rockafellar, 1976), which, as we show, sheds new light on several foundational optimization paradigms and phenomena, including non-convex and non-smooth optimization, acceleration, smoothing, adaptive stepsize selection, and trust-region methods. At the core of BPM lies the ball-proximal (“broximal”) operator, which arises from the classical proximal operator by replacing the quadratic distance penalty by a ball constraint. Surprisingly, and in sharp contrast with the sublinear rate of PPM in the nonsmooth convex regime, we prove that BPM converges linearly and in a finite number of steps in the same regime. Furthermore, by introducing the concept of ball-convexity, we prove that BPM retains the same global convergence guarantees under weaker assumptions, making it a powerful tool for a broader class of potentially non-convex optimization problems. Just like PPM plays the role of a conceptual method inspiring the development of practically efficient algorithms and algorithmic elements, e.g., gradient descent, adaptive step sizes, acceleration (Ahn Sra, 2020), and “W” in AdamW (Zhuang et al., 2022), we believe that BPM should be understood in the same manner: as a blueprint and inspiration for further development.

[LG-103] FinRLlama: A Solution to LLM -Engineered Signals Challenge at FinRL Contest 2024

链接: https://arxiv.org/abs/2502.01992
作者: Arnav Grover
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注: Competition Track FinRL, ICAIF 2024

点击查看摘要

Abstract:In response to Task II of the FinRL Challenge at ACM ICAIF 2024, this study proposes a novel prompt framework for fine-tuning large language models (LLM) with Reinforcement Learning from Market Feedback (RLMF). Our framework incorporates market-specific features and short-term price dynamics to generate more precise trading signals. Traditional LLMs, while competent in sentiment analysis, lack contextual alignment for financial market applications. To bridge this gap, we fine-tune the LLaMA-3.2-3B-Instruct model using a custom RLMF prompt design that integrates historical market data and reward-based feedback. Our evaluation shows that this RLMF-tuned framework outperforms baseline methods in signal consistency and achieving tighter trading outcomes; awarded as winner of Task II. You can find the code for this project on GitHub.

[LG-104] ReMiDi: Reconstruction of Microstructure Using a Differentiable Diffusion MRI Simulator

链接: https://arxiv.org/abs/2502.01988
作者: Prathamesh Pradeep Khole,Zahra Kais Petiwala,Shri Prathaa Magesh,Ehsan Mirafzali,Utkarsh Gupta,Jing-Rebecca Li,Andrada Ianus,Razvan Marinescu
类目: Image and Video Processing (eess.IV); Graphics (cs.GR); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 15 pages, ~13 figures, 2 algorithms, 1 table

点击查看摘要

Abstract:We propose ReMiDi, a novel method for inferring neuronal microstructure as arbitrary 3D meshes using a differentiable diffusion Magnetic Resonance Imaging (dMRI) simulator. We first implemented in PyTorch a differentiable dMRI simulator that simulates the forward diffusion process using a finite-element method on an input 3D microstructure mesh. To achieve significantly faster simulations, we solve the differential equation semi-analytically using a matrix formalism approach. Given a reference dMRI signal S_ref , we use the differentiable simulator to iteratively update the input mesh such that it matches S_ref using gradient-based learning. Since directly optimizing the 3D coordinates of the vertices is challenging, particularly due to ill-posedness of the inverse problem, we instead optimize a lower-dimensional latent space representation of the mesh. The mesh is first encoded into spectral coefficients, which are further encoded into a latent \textbfz using an auto-encoder, and are then decoded back into the true mesh. We present an end-to-end differentiable pipeline that simulates signals that can be tuned to match a reference signal by iteratively updating the latent representation \textbfz . We demonstrate the ability to reconstruct microstructures of arbitrary shapes represented by finite-element meshes, with a focus on axonal geometries found in the brain white matter, including bending, fanning and beading fibers. Our source code will be made available online.

[LG-105] Local minima of the empirical risk in high dimension: General theorems and convex examples

链接: https://arxiv.org/abs/2502.01953
作者: Kiana Asgari,Andrea Montanari,Basil Saeed
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 95 pages, 5 figures

点击查看摘要

Abstract:We consider a general model for high-dimensional empirical risk minimization whereby the data \mathbfx_i are d -dimensional isotropic Gaussian vectors, the model is parametrized by \mathbf\Theta\in\mathbbR^d\times k , and the loss depends on the data via the projection \mathbf\Theta^\mathsfT\mathbfx_i . This setting covers as special cases classical statistics methods (e.g. multinomial regression and other generalized linear models), but also two-layer fully connected neural networks with k hidden neurons. We use the Kac-Rice formula from Gaussian process theory to derive a bound on the expected number of local minima of this empirical risk, under the proportional asymptotics in which n,d\to\infty , with n\asymp d . Via Markov’s inequality, this bound allows to determine the positions of these minimizers (with exponential deviation bounds) and hence derive sharp asymptotics on the estimation and prediction error. In this paper, we apply our characterization to convex losses, where high-dimensional asymptotics were not (in general) rigorously established for k\ge 2 . We show that our approach is tight and allows to prove previously conjectured results. In addition, we characterize the spectrum of the Hessian at the minimizer. A companion paper applies our general result to non-convex examples.

[LG-106] Poisson Hierarchical Indian Buffet Processes for Within and Across Group Sharing of Latent Features-With Indications for Microbiome Species Sampling Models

链接: https://arxiv.org/abs/2502.01919
作者: Lancelot F. James,Juho Lee,Abhinav Pandey
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: experiments to be added

点击查看摘要

Abstract:In this work, we present a comprehensive Bayesian posterior analysis of what we term Poisson Hierarchical Indian Buffet Processes, designed for complex random sparse count species sampling models that allow for the sharing of information across and within groups. This analysis covers a potentially infinite number of species and unknown parameters, which, within a Bayesian machine learning context, we are able to learn from as more information is sampled. To achieve our refined results, we employ a range of methodologies drawn from Bayesian latent feature models, random occupancy models, and excursion theory. Despite this complexity, our goal is to make our findings accessible to practitioners, including those who may not be familiar with these areas. To facilitate understanding, we adopt a pseudo-expository style that emphasizes clarity and practical utility. We aim to express our findings in a language that resonates with experts in microbiome and ecological studies, addressing gaps in modeling capabilities while acknowledging that we are not experts ourselves in these fields. This approach encourages the use of our models as basic components of more sophisticated frameworks employed by domain experts, embodying the spirit of the seminal work on the Dirichlet Process. Ultimately, our refined posterior analysis not only yields tractable computational procedures but also enables practical statistical implementation and provides a clear mapping to relevant quantities in microbiome analysis.

[LG-107] Graph Canonical Correlation Analysis

链接: https://arxiv.org/abs/2502.01780
作者: Hongju Park,Shuyang Bai,Zhenyao Ye,Hwiyoung Lee,Tianzhou Ma,Shuo Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 40 pages, 3 figures

点击查看摘要

Abstract:Canonical correlation analysis (CCA) is a widely used technique for estimating associations between two sets of multi-dimensional variables. Recent advancements in CCA methods have expanded their application to decipher the interactions of multiomics datasets, imaging-omics datasets, and more. However, conventional CCA methods are limited in their ability to incorporate structured patterns in the cross-correlation matrix, potentially leading to suboptimal estimations. To address this limitation, we propose the graph Canonical Correlation Analysis (gCCA) approach, which calculates canonical correlations based on the graph structure of the cross-correlation matrix between the two sets of variables. We develop computationally efficient algorithms for gCCA, and provide theoretical results for finite sample analysis of best subset selection and canonical correlation estimation by introducing concentration inequalities and stopping time rule based on martingale theories. Extensive simulations demonstrate that gCCA outperforms competing CCA methods. Additionally, we apply gCCA to a multiomics dataset of DNA methylation and RNA-seq transcriptomics, identifying both positively and negatively regulated gene expression pathways by DNA methylation pathways.

[LG-108] Adaptive Observation Cost Control for Variational Quantum Eigensolvers ICML2024

链接: https://arxiv.org/abs/2502.01704
作者: Christopher J. Anders,Kim A. Nicoli,Bingting Wu,Naima Elosegui,Samuele Pedrielli,Lena Funcke,Karl Jansen,Stefan Kühn,Shinichi Nakajima
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 41st International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:The objective to be minimized in the variational quantum eigensolver (VQE) has a restricted form, which allows a specialized sequential minimal optimization (SMO) that requires only a few observations in each iteration. However, the SMO iteration is still costly due to the observation noise – one observation at a point typically requires averaging over hundreds to thousands of repeated quantum measurement shots for achieving a reasonable noise level. In this paper, we propose an adaptive cost control method, named subspace in confident region (SubsCoRe), for SMO. SubsCoRe uses the Gaussian process (GP) surrogate, and requires it to have low uncertainty over the subspace being updated, so that optimization in each iteration is performed with guaranteed accuracy. The adaptive cost control is performed by first setting the required accuracy according to the progress of the optimization, and then choosing the minimum number of measurement shots and their distribution such that the required accuracy is satisfied. We demonstrate that SubsCoRe significantly improves the efficiency of SMO, and outperforms the state-of-the-art methods.

[LG-109] Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models

链接: https://arxiv.org/abs/2502.01649
作者: Afsara Benazir,Felix Xiaozhu Lin
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Robust speech recognition systems rely on cloud service providers for inference. It needs to ensure that an untrustworthy provider cannot deduce the sensitive content in speech. Sanitization can be done on speech content keeping in mind that it has to avoid compromising transcription accuracy. Realizing the under utilized capabilities of tiny speech foundation models (FMs), for the first time, we propose a novel use: enhancing speech privacy on resource-constrained devices. We introduce XYZ, an edge/cloud privacy preserving speech inference engine that can filter sensitive entities without compromising transcript accuracy. We utilize a timestamp based on-device masking approach that utilizes a token to entity prediction model to filter sensitive entities. Our choice of mask strategically conceals parts of the input and hides sensitive data. The masked input is sent to a trusted cloud service or to a local hub to generate the masked output. The effectiveness of XYZ hinges on how well the entity time segments are masked. Our recovery is a confidence score based approach that chooses the best prediction between cloud and on-device model. We implement XYZ on a 64 bit Raspberry Pi 4B. Experiments show that our solution leads to robust speech recognition without forsaking privacy. XYZ with 100 MB memory, achieves state-of-the-art (SOTA) speech transcription performance while filtering about 83% of private entities directly on-device. XYZ is 16x smaller in memory and 17x more compute efficient than prior privacy preserving speech frameworks and has a relative reduction in word error rate (WER) by 38.8-77.5% when compared to existing offline transcription services.

信息检索

[IR-0] Combinatorial Optimization Perspective based Framework for Multi-behavior Recommendation KDD2025

链接: https://arxiv.org/abs/2502.02232
作者: Chenhao Zhai,Chang Meng,Yu Yang,Kexin Zhang,Xuhao Zhao,Xiu Li
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2025 Research Track

点击查看摘要

Abstract:In real-world recommendation scenarios, users engage with items through various types of behaviors. Leveraging diversified user behavior information for learning can enhance the recommendation of target behaviors (e.g., buy), as demonstrated by recent multi-behavior methods. The mainstream multi-behavior recommendation framework consists of two steps: fusion and prediction. Recent approaches utilize graph neural networks for multi-behavior fusion and employ multi-task learning paradigms for joint optimization in the prediction step, achieving significant success. However, these methods have limited perspectives on multi-behavior fusion, which leads to inaccurate capture of user behavior patterns in the fusion step. Moreover, when using multi-task learning for prediction, the relationship between the target task and auxiliary tasks is not sufficiently coordinated, resulting in negative information transfer. To address these problems, we propose a novel multi-behavior recommendation framework based on the combinatorial optimization perspective, named COPF. Specifically, we treat multi-behavior fusion as a combinatorial optimization problem, imposing different constraints at various stages of each behavior to restrict the solution space, thus significantly enhancing fusion efficiency (COGCN). In the prediction step, we improve both forward and backward propagation during the generation and aggregation of multiple experts to mitigate negative transfer caused by differences in both feature and label distributions (DFME). Comprehensive experiments on three real-world datasets indicate the superiority of COPF. Further analyses also validate the effectiveness of the COGCN and DFME modules. Our code is available at this https URL.

[IR-1] Large Language Models for Recommendation with Deliberative User Preference Alignment

链接: https://arxiv.org/abs/2502.02061
作者: Yi Fang,Wenjie Wang,Yang Zhang,Fengbin Zhu,Qifan Wang,Fuli Feng,Xiangnan He
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:While recent advancements in aligning Large Language Models (LLMs) with recommendation tasks have shown great potential and promising performance overall, these aligned recommendation LLMs still face challenges in complex scenarios. This is primarily due to the current alignment approach focusing on optimizing LLMs to generate user feedback directly, without incorporating deliberation. To overcome this limitation and develop more reliable LLMs for recommendations, we propose a new Deliberative Recommendation task, which incorporates explicit reasoning about user preferences as an additional alignment goal. We then introduce the Deliberative User Preference Alignment framework, designed to enhance reasoning capabilities by utilizing verbalized user feedback in a step-wise manner to tackle this task. The framework employs collaborative step-wise experts and tailored training strategies for each expert. Experimental results across three real-world datasets demonstrate the rationality of the deliberative task formulation and the superior performance of the proposed framework in improving both prediction accuracy and reasoning quality.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-05

目录

概览 (2025-02-05)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载