Arxiv今日论文 | 2025-02-07

本篇博文主要内容为 2025-02-07 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决多模态模型在性能上落后于专门化的单模态模型的问题。关键解决方案在于Ola模型的渐进式模态对齐策略，通过逐步扩展语言模型的支持模态，从图像和文本开始，逐步引入语音和视频数据，从而实现跨图像、视频和音频理解的竞争性性能。这种渐进学习管道不仅保持了较小的跨模态对齐数据规模，还使得从现有的视觉-语言模型发展为多模态模型变得容易且成本较低。此外，设计了基于句子的解码方案以支持流式语音生成，进一步提升了交互体验。

链接: https://arxiv.org/abs/2502.04328
作者: Zuyan Liu,Yuhao Dong,Jiahui Wang,Ziwei Liu,Winston Hu,Jiwen Lu,Yongming Rao
机构: Tsinghua University(清华大学); Tencent Hunyuan Research(腾讯浑元研究); S-Lab, NTU(南洋理工大学S-Lab)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at this https URL.
zh

[NLP-1] Can Grammarly and ChatGPT accelerate language change? AI-powered technologies and their impact on the English language: wordiness vs. conciseness

【速读】：该论文旨在探讨Grammarly和ChatGPT在处理英语语言时，对于简洁性（conciseness）与冗长性（wordiness）的影响。关键在于分析这两种工具如何通过推荐更简短的语法结构（如使用“in order to”替代其他较长形式），来影响英语表达的简洁度，即使是在已经由母语使用者创作且正确的句子中也是如此。研究表明，这些技术不仅反映了语言的变化，还可能促进或加速这种变化。

链接: https://arxiv.org/abs/2502.04324
作者: Karolina Rudnicka
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages, article

点击查看摘要

Abstract:The proliferation of NLP-powered language technologies, AI-based natural language generation models, and English as a mainstream means of communication among both native and non-native speakers make the output of AI-powered tools especially intriguing to linguists. This paper investigates how Grammarly and ChatGPT affect the English language regarding wordiness vs. conciseness. A case study focusing on the purpose subordinator in order to is presented to illustrate the way in which Grammarly and ChatGPT recommend shorter grammatical structures instead of longer and more elaborate ones. Although the analysed sentences were produced by native speakers, are perfectly correct, and were extracted from a language corpus of contemporary English, both Grammarly and ChatGPT suggest more conciseness and less verbosity, even for relatively short sentences. The present article argues that technologies such as Grammarly not only mirror language change but also have the potential to facilitate or accelerate it.
zh

[NLP-2] Speak Easy: Eliciting Harmful Jailbreaks from LLM s with Simple Interactions

【速读】：该论文旨在解决大型语言模型（LLMs）在常见且简单的交互模式下存在的安全漏洞问题，特别是那些易于被普通用户利用以实施有害行为的漏洞。论文的关键解决方案在于提出了一种名为HarmScore的新指标以及一种名为Speak Easy的多步骤多语言攻击框架。通过将Speak Easy整合到直接请求和越狱基线方法中，研究发现这两种技术显著提高了攻击成功率（平均绝对提升0.319）和有害行为的促成效果（平均绝对提升0.426），从而揭示了恶意用户如何容易地利用常见的交互模式来实现有害意图。

链接: https://arxiv.org/abs/2502.04322
作者: Yik Siu Chan,Narutatsu Ri,Yuxin Xiao,Marzyeh Ghassemi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative–two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of 0.319 in Attack Success Rate and 0.426 in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.
zh

[NLP-3] Variation of sentence length across time and genre

【速读】：该论文旨在探讨三个方面的内容：i) 展示使用历史美国英语语料库（COHA）全文字版本在研究语言变化趋势中的实际应用；ii) 验证一个广泛持有的假设，即书面英语的句子长度在过去几个世纪中一直在稳步减少；iii) 指出句子长度的变化与英语句法使用变化之间的可能联系。论文通过分析“in order to”这一非有限目的从属连词频率的下降来证明第三点，表明句子长度、体裁以及“in order to”的出现可能性之间存在关联。关键在于验证句子长度变化是否与英语句法使用的变化相关，并通过具体实例提供实证支持。

链接: https://arxiv.org/abs/2502.04321
作者: Karolina Rudnicka
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:The goal of this paper is threefold: i) to present some practical aspects of using full-text version of Corpus of Historical American English (COHA), the largest diachronic multi-genre corpus of the English language, in the investigation of a linguistic trend of change; ii) to test a widely held assumption that sentence length in written English has been steadily decreasing over the past few centuries; iii) to point to a possible link between the changes in sentence length and changes in the English syntactic usage. The empirical proof of concept for iii) is provided by the decline in the frequency of the non-finite purpose subordinator in order to. Sentence length, genre and the likelihood of occurrence of in order to are shown to be interrelated.
zh

[NLP-4] ChamaleonLLM : Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters

【速读】：该论文旨在解决大型语言模型（LLMs）在推理过程中因固定权重而缺乏适应实时数据变异性的能力。关键解决方案在于ChamaleonLLM框架，它通过批处理感知聚类和动态生成低秩更新，在推理时对LLMs进行自适应调整。不同于传统的微调方法如LoRA或依赖预学习固定掩码的方法，ChamaleonLLM通过智能分组相似输入并利用超网络计算上下文感知的低秩更新，从而实现显著的性能提升。

链接: https://arxiv.org/abs/2502.04315
作者: Kamer Ali Yuksel,Hassan Sawaf
机构: aiXplain Inc. (aiXplain公司), San Jose, CA, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown remarkable performance across diverse tasks. However, these models are typically deployed with fixed weights, which limits their ability to adapt dynamically to the variability inherent in real-world data during inference. This paper introduces ChamaleonLLM, a novel framework that enables inference-time adaptation of LLMs by leveraging batch-aware clustering and on-the-fly generation of low-rank updates. Unlike traditional fine-tuning approaches such as Low-Rank Adaptation (LoRA) or methods that rely on a fixed set of pre-learned uniforms (changeable masks), our method dynamically generates adaptive modifications to the decoder weights based on the aggregated statistics of clustered batches. By intelligently grouping similar inputs and computing context-aware low-rank updates via a hyper-network, ChamaleonLLM achieves significant performance gains, outperforming conventional LoRA methods while eliminating the overhead of maintaining multiple expert models. Our experiments highlight the potential of our approach to serve as a versatile and highly adaptive solution for language model inference. ChamaleonLLM is open-sourced to ensure the reproducibility of our experiments: this https URL
zh

[NLP-5] BOUQuET: dataset Benchmark and Open initiative for Universal Quality Evaluation in Translation

【速读】：该论文旨在构建一个名为BOUQuET的多语言多领域数据集与基准，以解决现有机器翻译（Machine Translation, MT）数据集在语言多样性和领域覆盖方面的局限性。关键在于创建一个手工制作的、包含23种常用语言的数据集，这些语言覆盖了全球一半人口的母语，从而作为转译媒介以实现更精准的翻译。此外，BOUQuET数据集不仅限于句子级别，还组织了不同长度的段落，以此来避免污染并增强跨语言特征的表现。通过这一方案，BOUQuET能够提供更广泛的领域覆盖，并简化非专家用户的翻译任务。论文还呼吁开放合作，以扩展BOUQuET成为一个多向平行语料库，适用于任何书写语言。

链接: https://arxiv.org/abs/2502.04314
作者: TheOmnilingual MT Team,Pierre Andrews,Mikel Artetxe,Mariano Coria Meglioli,Marta R. Costa-jussà,Joe Chuang,David Dale,Cynthia Gao,Jean Maillard,Alex Mourachko,Christophe Ropers,Safiyyah Saleem,Eduardo Sánchez,Ioannis Tsiamas,Arina Turkatenko,Albert Ventayol-Boada,Shireen Yates
机构: Meta
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents BOUQuET, a multicentric and multi-register/domain dataset and benchmark, and its broader collaborative extension initiative. This dataset is handcrafted in non-English languages first, each of these source languages being represented among the 23 languages commonly used by half of the world’s population and therefore having the potential to serve as pivot languages that will enable more accurate translations. The dataset is specially designed to avoid contamination and be multicentric, so as to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation (MT) datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for the open initiative and call for translation participation that we are launching to extend it to a multi-way parallel corpus to any written language.
zh

[NLP-6] Great Models Think Alike and this Undermines AI Oversight

【速读】：该论文旨在解决在语言模型（Language Model, LM）能力提升背景下，人类难以对其进行规模化评估和监督的问题。论文的关键解决方案在于提出了一种基于模型错误重叠的概率度量方法来评估模型相似性，并通过此方法研究了语言模型作为评判者时的偏好性和训练过程中弱监督者与强学生模型之间的互补知识对“弱到强泛化”增益的影响。研究表明，模型能力增强使得发现其错误变得更加困难，而模型错误的相似性增加则带来了相关失效的风险。论文强调了报告和校正模型相似性的必要性，尤其是在新兴的AI监督范式中。

链接: https://arxiv.org/abs/2502.04313
作者: Shashwat Goel,Joschka Struber,Ilze Amanda Auzina,Karuna K Chandra,Ponnurangam Kumaraguru,Douwe Kiela,Ameya Prabhu,Matthias Bethge,Jonas Geiping
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 60 pages, 20 figures

点击查看摘要

Abstract:As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as “AI Oversight”. We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from “weak-to-strong generalization”. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend – model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.
zh

[NLP-7] ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

【速读】：该论文旨在解决现有自动代理工作流程优化方法因表示能力有限、适应性差及依赖离散优化技术而导致的灵活性不足的问题。解决方案的关键在于提出ScoreFlow框架，该框架利用基于梯度的优化方法在连续空间中实现高效优化，并引入了考虑定量反馈的新型直接偏好优化方法Score-DPO。

链接: https://arxiv.org/abs/2502.04306
作者: Yinjie Wang,Ling Yang,Guohao Li,Mengdi Wang,Bryon Aragam
机构: 未知
类目: Computation and Language (cs.CL)
备注: Project: this https URL

点击查看摘要

Abstract:Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: this https URL
zh

[NLP-8] Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization

【速读】：该论文旨在解决大型语言模型（LLMs）在实际应用中效果受制于提示设计的问题，特别关注了优化提示内容的同时忽视了提示格式的重要性。论文的关键解决方案是提出了一种名为内容格式整合提示优化（CFPO）的方法，通过迭代优化过程同时改进提示的内容和格式，从而实现更佳的性能提升。

链接: https://arxiv.org/abs/2502.04295
作者: Yuanye Liu,Jiahang Xu,Li Lyna Zhang,Qi Chen,Xuan Feng,Yang Chen,Zhongxin Guo,Yuqing Yang,Cheng Peng
机构: Fudan University (复旦大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant capability across various tasks, with their real-world effectiveness often driven by prompt design. While recent research has focused on optimizing prompt content, the role of prompt formatting, a critical but often overlooked dimension, has received limited systematic investigation. In this paper, we introduce Content-Format Integrated Prompt Optimization (CFPO), an innovative methodology that jointly optimizes both prompt content and formatting through an iterative refinement process. CFPO leverages natural language mutations to explore content variations and employs a dynamic format exploration strategy that systematically evaluates diverse format options. Our extensive evaluations across multiple tasks and open-source LLMs demonstrate that CFPO demonstrates measurable performance improvements compared to content-only optimization methods. This highlights the importance of integrated content-format optimization and offers a practical, model-agnostic approach to enhancing LLM performance. Code will be available at this https URL.
zh

[NLP-9] A Methodology for Studying Linguistic and Cultural Change in China 1900-1950

【速读】：该论文旨在解决二十世纪上半叶中国语言和文化变迁的量化研究不足的问题。解决方案的关键在于提出一个分析框架，利用诸如词频统计（word counts）和词嵌入（word embeddings）等已有方法，以提供新的历史见解，探讨西方现代性与中国文化话语之间的复杂互动。

链接: https://arxiv.org/abs/2502.04286
作者: Spencer Dean Stewart
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:This paper presents a quantitative approach to studying linguistic and cultural change in China during the first half of the twentieth century, a period that remains understudied in computational humanities research. The dramatic changes in Chinese language and culture during this time call for greater reflection on the tools and methods used for text analysis. This preliminary study offers a framework for analyzing Chinese texts from the late nineteenth and twentieth centuries, demonstrating how established methods such as word counts and word embeddings can provide new historical insights into the complex negotiations between Western modernity and Chinese cultural discourse.
zh

[NLP-10] How does a Multilingual LM Handle Multiple Languages?

【速读】：该论文旨在解决多语言语言模型（Multilingual Language Models, MLMs）在捕捉语言知识方面的有效性问题，特别是针对低资源语言的表现。论文的关键解决方案包括通过三个目标来评估和分析MLMs的能力：首先，通过余弦相似性分析多语言词嵌入的一致性来评估语义相似性；其次，通过命名实体识别（Named Entity Recognition, NER）和句子相似性任务来考察BLOOM-1.7B和Qwen2的语言结构；最后，通过情感分析和文本分类任务评估从高资源语言到低资源语言的跨语言知识迁移能力。通过这些方法，论文揭示了MLMs的优势与局限性，以期改进多语言自然语言处理（NLP）模型，从而更好地支持高资源和低资源语言，促进语言技术的包容性。

链接: https://arxiv.org/abs/2502.04269
作者: Santhosh Kakarla,Gautama Shastry Bulusu Venkata,Aishwarya Gaddam
机构: George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Multilingual language models have significantly advanced due to rapid progress in natural language processing. Models like BLOOM 1.7B, trained on diverse multilingual datasets, aim to bridge linguistic gaps. However, their effectiveness in capturing linguistic knowledge, particularly for low-resource languages, remains an open question. This study critically examines MLMs capabilities in multilingual understanding, semantic representation, and cross-lingual knowledge transfer. While these models perform well for high-resource languages, they struggle with less-represented ones. Additionally, traditional evaluation methods often overlook their internal syntactic and semantic encoding. This research addresses key limitations through three objectives. First, it assesses semantic similarity by analyzing multilingual word embeddings for consistency using cosine similarity. Second, it examines BLOOM-1.7B and Qwen2 through Named Entity Recognition and sentence similarity tasks to understand their linguistic structures. Third, it explores cross-lingual knowledge transfer by evaluating generalization from high-resource to low-resource languages in sentiment analysis and text classification. By leveraging linguistic probing, performance metrics, and visualizations, this study provides insights into the strengths and limitations of MLMs. The findings aim to enhance multilingual NLP models, ensuring better support for both high- and low-resource languages, thereby promoting inclusivity in language technologies. Comments: 10 pages, 8 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.04269 [cs.CL] (or arXiv:2502.04269v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.04269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-11] riNER: A Series of Named Entity Recognition Models For Hindi Bengali Marathi

【速读】：该论文旨在解决印度多语言环境下的命名实体识别（NER）挑战，特别是针对印地语、孟加拉语和马拉地语这三种最广泛使用的语言。论文的关键解决方案在于训练一个定制的Transformer模型，并微调若干预训练模型，从而实现对总共六个实体类别的F1得分达92.11。这一方法显著减少了这三个语言之间在实体类别和标签名称上的一致性问题。

链接: https://arxiv.org/abs/2502.04245
作者: Mohammed Amaan Dhamaskar,Rasika Ransing
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:India’s rich cultural and linguistic diversity poses various challenges in the domain of Natural Language Processing (NLP), particularly in Named Entity Recognition (NER). NER is a NLP task that aims to identify and classify tokens into different entity groups like Person, Location, Organization, Number, etc. This makes NER very useful for downstream tasks like context-aware anonymization. This paper details our work to build a multilingual NER model for the three most spoken languages in India - Hindi, Bengali Marathi. We train a custom transformer model and fine tune a few pretrained models, achieving an F1 Score of 92.11 for a total of 6 entity groups. Through this paper, we aim to introduce a single model to perform NER and significantly reduce the inconsistencies in entity groups and tag names, across the three languages.
zh

[NLP-12] MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

【速读】：该论文旨在解决大型语言模型在持续扩展过程中面临的关键挑战：高质量预训练数据的稀缺。为应对这一瓶颈，论文提出了一种名为“大规模类型-受众改写”（MAGA）的方法，该方法能够系统性地从现有语料库中合成多样化且上下文丰富的预训练数据。解决方案的关键在于MAGA改写方法，它是一种轻量且可扩展的预训练语料库扩展方法，并构建了一个包含7700亿个标记的MAGA语料库。

链接: https://arxiv.org/abs/2502.04235
作者: Xintong Hao,Ke Shen,Chenggang Li
机构: Seed-LLM, ByteDance (种子-LLM, 字节跳动)
类目: Computation and Language (cs.CL)
备注: Dataset released url this https URL

点击查看摘要

Abstract:Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose \textbfMAssive \textbfGenre-\textbfAudience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering’s impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.
zh

[NLP-13] A Classification System Approach in Predicting Chinese Censorship

【速读】：该论文旨在使用分类器预测微博（Weibo）帖子是否会在中国互联网环境下被审查。解决方案的关键在于通过随机抽样和中文分词策略构建了一个带有二元审查标记的清洗后的中文短语数据集，并利用多种基于概率的信息检索方法从中推导出4个用于分类的逻辑回归模型。此外，研究还尝试了预训练的变换器模型来执行类似的分类任务。最终评估表明，微调的BERT模型在性能上超越了其他策略。

链接: https://arxiv.org/abs/2502.04234
作者: Matt Prodani,Tianchu Ze,Yushen Hu
机构: New York University (纽约大学); New York University (纽约大学); New York University (纽约大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:This paper is dedicated to using a classifier to predict whether a Weibo post would be censored under the Chinese internet. Through randomized sampling from \citeauthorFu2021 and Chinese tokenizing strategies, we constructed a cleaned Chinese phrase dataset with binary censorship markings. Utilizing various probability-based information retrieval methods on the data, we were able to derive 4 logistic regression models for classification. Furthermore, we experimented with pre-trained transformers to perform similar classification tasks. After evaluating both the macro-F1 and ROC-AUC metrics, we concluded that the Fined-Tuned BERT model exceeds other strategies in performance.
zh

[NLP-14] Sports and Womens Sports: Gender Bias in Text Generation with Olympic Data NAACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）中存在的性别偏见问题。论文的关键解决方案是通过分析来自奥林匹克运动会男女项目平行赛事的数据，定义了三种度量标准来衡量偏见，并发现当提示中的性别不明确时，模型倾向于对女性存在持续性偏见，通常仅检索男性赛事的结果而不加以说明。这揭示了LLMs在体育领域中普遍存在的性别偏见。

链接: https://arxiv.org/abs/2502.04218
作者: Laura Biester
机构: Middlebury College
类目: Computation and Language (cs.CL)
备注: NAACL 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to be biased in prior work, as they generate text that is in line with stereotypical views of the world or that is not representative of the viewpoints and values of historically marginalized demographic groups. In this work, we propose using data from parallel men’s and women’s events at the Olympic Games to investigate different forms of gender bias in language models. We define three metrics to measure bias, and find that models are consistently biased against women when the gender is ambiguous in the prompt. In this case, the model frequently retrieves only the results of the men’s event with or without acknowledging them as such, revealing pervasive gender bias in LLMs in the context of athletics.
zh

[NLP-15] he Best Instruction-Tuning Data are Those That Fit

【速读】：该论文旨在解决高质量监督微调（SFT）数据对于预训练大型语言模型（LLMs）能力激发的重要性，并提出在大规模应用中，采用其他模型生成的响应会导致性能下降甚至损害模型的性能和鲁棒性的问题。为了解决这一问题，论文提出了GRAPE框架，其关键是针对目标模型的独特特性选择最优响应。具体而言，对于每个指令，GRAPE从多个LLMs收集响应，并选择目标模型概率最高的那个响应进行标准SFT训练，从而确保所选响应与目标模型的预训练分布最为一致。这种做法显著提升了模型性能，相比强基准有高达13.8%的提升，并且使用更少的数据和训练轮数即能达到更好的效果。

链接: https://arxiv.org/abs/2502.04194
作者: Dylan Zhang,Qirun Dai,Hao Peng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-quality supervised fine-tuning (SFT) data are crucial for eliciting strong capabilities from pretrained large language models (LLMs). Typically, instructions are paired with multiple responses sampled from other LLMs, which are often out of the distribution of the target model to be fine-tuned. This, at scale, can lead to diminishing returns and even hurt the models’ performance and robustness. We propose GRAPE, a novel SFT framework that accounts for the unique characteristics of the target model. For each instruction, it gathers responses from various LLMs and selects the one with the highest probability measured by the target model, indicating that it aligns most closely with the target model’s pretrained distribution; it then proceeds with standard SFT training. We first evaluate GRAPE with a controlled experiment, where we sample various solutions for each question in UltraInteract from multiple models and fine-tune commonly used LMs like LLaMA3.1-8B, Mistral-7B, and Qwen2.5-7B on GRAPE-selected data. GRAPE significantly outperforms strong baselines, including distilling from the strongest model with an absolute gain of up to 13.8%, averaged across benchmarks, and training on 3x more data with a maximum performance improvement of 17.3%. GRAPE’s strong performance generalizes to realistic settings. We experiment with the post-training data used for Tulu3 and Olmo-2. GRAPE outperforms strong baselines trained on 4.5 times more data by 6.1% and a state-of-the-art data selection approach by 3% on average performance. Remarkably, using 1/3 of the data and half the number of epochs, GRAPE enables LLaMA3.1-8B to surpass the performance of Tulu3-SFT by 3.5%. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.04194 [cs.CL] (or arXiv:2502.04194v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.04194 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-16] Multi-agent Architecture Search via Agent ic Supernet

【速读】：该论文旨在解决大型语言模型（LLM）赋能的多智能体系统设计过程中需要大量手工设计的问题，并且现有的自动化方法通常只能识别出静态的一体化系统，无法根据查询的难度和领域动态分配推理资源。为了解决这一挑战，论文的关键在于优化智能超网（agentic supernet），这是一个概率性和连续性的智能架构分布。论文提出了MaAS框架，通过从超网中采样与查询相关的智能体系统，实现了高质量的解决方案和定制化的资源分配。实验结果表明，MaAS仅需现有手工设计或自动化多智能体系统的6%到45%的推理成本，并且在多个基准测试中超越了这些系统0.54%到11.82%，同时具有更好的跨数据集和跨LLM骨干的迁移能力。

链接: https://arxiv.org/abs/2502.04180
作者: Guibin Zhang,Luyang Niu,Junfeng Fang,Kun Wang,Lei Bai,Xiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbfagentic supernet, a probabilistic and continuous distribution of agentic architectures. We introduce MaAS, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textite.g., LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf(I) requires only 6\sim45% of the inference costs of existing handcrafted or automated multi-agent systems, \textbf(II) surpasses them by 0.54%\sim11.82% , and \textbf(III) enjoys superior cross-dataset and cross-LLM-backbone transferability.
zh

[NLP-17] Lexical Substitution is not Synonym Substitution: On the Importance of Producing Contextually Relevant Word Substitutes

【速读】：该论文旨在解决词汇替换（Lexical Substitution）任务中，如何生成与目标词在语义和上下文上更为贴合的替代词。解决方案的关键在于提出了一种名为ConCat的简单增强方法，通过利用原始句子来增强预训练语言模型中的上下文信息，从而引导模型做出更符合上下文的相关预测。

链接: https://arxiv.org/abs/2502.04173
作者: Juraj Vladika,Stephen Meisenbacher,Florian Matthes
机构: Department of Computer Science, School of Computation, Information and Technology, Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ICAART 2025

点击查看摘要

Abstract:Lexical Substitution is the task of replacing a single word in a sentence with a similar one. This should ideally be one that is not necessarily only synonymous, but also fits well into the surrounding context of the target word, while preserving the sentence’s grammatical structure. Recent advances in Lexical Substitution have leveraged the masked token prediction task of Pre-trained Language Models to generate replacements for a given word in a sentence. With this technique, we introduce ConCat, a simple augmented approach which utilizes the original sentence to bolster contextual information sent to the model. Compared to existing approaches, it proves to be very effective in guiding the model to make contextually relevant predictions for the target word. Our study includes a quantitative evaluation, measured via sentence similarity and task performance. In addition, we conduct a qualitative human analysis to validate that users prefer the substitutions proposed by our method, as opposed to previous methods. Finally, we test our approach on the prevailing benchmark for Lexical Substitution, CoInCo, revealing potential pitfalls of the benchmark. These insights serve as the foundation for a critical discussion on the way in which Lexical Substitution is evaluated.
zh

[NLP-18] UltraIF: Advancing Instruction Following from the Wild

【速读】：该论文旨在解决现代大型语言模型（Large Language Models, LLMs）在处理复杂指令时表现不佳的问题。论文的关键在于提出了一种名为UltraIF的简单且可扩展的方法，通过将实际用户的指令分解为更简单的查询、约束及其对应的评估问题，并训练一个UltraComposer来合成这些约束关联的提示与评估问题，从而实现对开放源代码数据的利用，使LLMs能够更好地遵循复杂指令。实验结果显示，仅使用80亿参数的模型作为响应生成器和评估器，UltraIF方法成功使LLaMA-3.1-8B-Base模型在无任何基准信息的情况下，在五个指令跟随基准上达到了其指导版本的性能水平。此外，UltraIF还展示了通过自对齐进一步提升LLaMA-3.1-8B-Instruct模型性能的可能性。

链接: https://arxiv.org/abs/2502.04153
作者: Kaikai An,Li Sheng,Ganqu Cui,Shuzheng Si,Ning Ding,Yu Cheng,Baobao Chang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions with open-source data. UltraIF first decomposes real-world user prompts into simpler queries, constraints, and corresponding evaluation questions for the constraints. Then, we train an UltraComposer to compose constraint-associated prompts with evaluation questions. This prompt composer allows us to synthesize complicated instructions as well as filter responses with evaluation questions. In our experiment, for the first time, we successfully align LLaMA-3.1-8B-Base to catch up with its instruct version on 5 instruction-following benchmarks without any benchmark information, using only 8B model as response generator and evaluator. The aligned model also achieved competitive scores on other benchmarks. Moreover, we also show that UltraIF could further improve LLaMA-3.1-8B-Instruct through self-alignment, motivating broader use cases for the method. Our code will be available at this https URL.
zh

[NLP-19] he Order Effect: Investigating Prompt Sensitivity in Closed-Source LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在不同输入条件下的可靠性问题，特别是输入顺序敏感性导致的输出不一致或偏见。论文通过跨多个任务（包括同义改写、相关性判断和多项选择题）的实验，发现输入顺序显著影响性能，打乱输入顺序会导致输出准确性明显下降。尽管少量提示调优（few-shot prompting）显示出一定的有效性并提供部分缓解，但未能完全解决问题。因此，论文强调了在关键应用中持续存在的风险，并指出需要更稳健的LLMs或改进的输入处理技术。

链接: https://arxiv.org/abs/2502.04134
作者: Bryan Guan,Tanya Roosta,Peyman Passban,Mehdi Rezagholizadeh
机构: 未知
类目: Computation and Language (cs.CL)
备注: The first 3 authors have contributed equally

点击查看摘要

Abstract:As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in closed-source LLMs by conducting experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation, however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.
zh

[NLP-20] LLM s to Support a Domain Specific Knowledge Assistant

【速读】：该论文旨在解决在可持续报告领域缺乏高质量问答数据集的问题，阻碍了开发支持国际财务报告准则(IFRS)报告的高质量聊天机器人的进程。为了解决这一问题，论文的关键贡献在于创建了一个基于IFRS可持续标准的高质量合成问答数据集，并开发了两种用于可持续报告领域的问答架构：一种是RAG流水线，另一种是完全基于大型语言模型(LLM)的流水线。这些解决方案通过实验、微调和训练在问答数据集上得以实现，并通过自定义评估框架验证了其质量与性能。

链接: https://arxiv.org/abs/2502.04095
作者: Maria-Flavia Lovin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work presents a custom approach to developing a domain specific knowledge assistant for sustainability reporting using the International Financial Reporting Standards (IFRS). In this domain, there is no publicly available question-answer dataset, which has impeded the development of a high-quality chatbot to support companies with IFRS reporting. The two key contributions of this project therefore are: (1) A high-quality synthetic question-answer (QA) dataset based on IFRS sustainability standards, created using a novel generation and evaluation pipeline leveraging Large Language Models (LLMs). This comprises 1,063 diverse QA pairs that address a wide spectrum of potential user queries in sustainability reporting. Various LLM-based techniques are employed to create the dataset, including chain-of-thought reasoning and few-shot prompting. A custom evaluation framework is developed to assess question and answer quality across multiple dimensions, including faithfulness, relevance, and domain specificity. The dataset averages a score range of 8.16 out of 10 on these metrics. (2) Two architectures for question-answering in the sustainability reporting domain - a RAG pipeline and a fully LLM-based pipeline. The architectures are developed by experimenting, fine-tuning, and training on the QA dataset. The final pipelines feature an LLM fine-tuned on domain specific data and an industry classification component to improve the handling of complex queries. The RAG architecture achieves an accuracy of 85.32% on single-industry and 72.15% on cross-industry multiple-choice questions, outperforming the baseline approach by 4.67 and 19.21 percentage points, respectively. The LLM-based pipeline achieves an accuracy of 93.45% on single-industry and 80.30% on cross-industry multiple-choice questions, an improvement of 12.80 and 27.36 percentage points over the baseline, respectively. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.04095 [cs.CL] (or arXiv:2502.04095v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.04095 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-21] AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

【速读】：该论文旨在解决在大型语言模型（LLMs）长上下文生成过程中，通过关键值（KV）缓存压缩进行高效推理时，现有方法难以准确识别关键KV令牌的问题。这些现有方法通常依赖于注意力分数的启发式排序，但忽略了注意力分数中的时间模式，导致LLM性能显著下降。论文提出的关键解决方案是AttentionPredictor，这是一种基于学习的方法，通过轻量级卷积模型捕捉时空模式并预测下一个令牌的注意力分数。AttentionPredictor不仅能够准确预测注意力分数，而且几乎不增加内存消耗。此外，论文还提出了一种跨令牌关键缓存预取框架，以隐藏令牌估计时间开销，加速解码阶段。通过保留大部分注意力信息，AttentionPredictor实现了16倍的KV缓存压缩，同时保持与LLM性能相当，显著优于当前最先进的方法。

链接: https://arxiv.org/abs/2502.04077
作者: Qingyue Yang,Jie Wang,Xing Li,Zhihai Wang,Chen Chen,Lei Chen,Xianzhi Yu,Wulong Liu,Jianye Hao,Mingxuan Yuan,Bin Li
机构: MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China (中国科学技术大学); Noah’s Ark Lab, Huawei Technologies (华为技术有限公司); College of Intelligence and Computing, Tianjin University (天津大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through heuristic ranking with attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the \textittemporal patterns in attention scores, resulting in a noticeable degradation in LLM performance. To address this challenge, we propose AttentionPredictor, which is the first learning-based critical token identification approach. Specifically, AttentionPredictor learns a lightweight convolution model to capture spatiotemporal patterns and predict the next-token attention score. An appealing feature of AttentionPredictor is that it accurately predicts the attention score while consuming negligible memory. Moreover, we propose a cross-token critical cache prefetching framework that hides the token estimation time overhead to accelerate the decoding stage. By retaining most of the attention information, AttentionPredictor achieves 16 \times KV cache compression with comparable LLM performance, significantly outperforming the state-of-the-art.
zh

[NLP-22] Controllable Emotion Generation with Emotion Vectors

【速读】：该论文旨在解决大型语言模型（Large-scale Language Models, LLMs）在表达情感方面能力不足的问题。论文的关键在于提出了一种通用、高度灵活且可控的方法，用于使LLMs能够输出情感表达，并通过广泛的实验和验证证明了该方法的有效性。这种方法在涉及LLMs情感输出的应用领域，如智能客服、文学创作和家庭陪伴机器人等，具有广泛的应用前景。

链接: https://arxiv.org/abs/2502.04075
作者: Yurui Dong,Luozhijie Jin,Yao Yang,Bingjie Lu,Jiaxi Yang,Zhi Liu
机构: Fudan Unversity(复旦大学); Zhejiang Lab(浙江实验室); Nanjing University of Chinese Medicine(南京中医药大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:In recent years, technologies based on large-scale language models (LLMs) have made remarkable progress in many fields, especially in customer service, content creation, and embodied intelligence, showing broad application potential. However, The LLM’s ability to express emotions with proper tone, timing, and in both direct and indirect forms is still insufficient but significant. Few works have studied on how to build the controlable emotional expression capability of LLMs. In this work, we propose a method for emotion expression output by LLMs, which is universal, highly flexible, and well controllable proved with the extensive experiments and verifications. This method has broad application prospects in fields involving emotions output by LLMs, such as intelligent customer service, literary creation, and home companion robots. The extensive experiments on various LLMs with different model-scales and architectures prove the versatility and the effectiveness of the proposed method.
zh

[NLP-23] Predicting Large Language Model Capabilities on Closed-Book QA Tasks Using Only Information Available Prior to Training

【速读】：该论文旨在解决在训练前预测模型在特定任务上的表现这一问题。关键在于通过分析预训练数据中的知识三元组以及引入一个名为SMI的信息论度量来量化预训练数据、模型大小与针对具体任务的知识保留之间的关系。论文通过实验展示了SMI与封闭书本问答（CBQA）任务准确率之间存在强线性相关性 (\text{R}^2 = 0.84)，从而验证了解决方案的有效性。

链接: https://arxiv.org/abs/2502.04066
作者: Changhao Jiang,Ming Zhang,Junjie Ye,Xiaoran Fan,Yifei Cao,Jiajun Sun,Zhiheng Xi,Shihan Dou,Yi Dong,Yujiong Shen,Jingqi Tong,Zhen Wang,Tao Liang,Zhihui Fei,Mingyang Wan,Guojun Ma,Qi Zhang,Tao Gui,Xuanjing Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The GPT-4 technical report from OpenAI suggests that model performance on specific tasks can be predicted prior to training, though methodologies remain unspecified. This approach is crucial for optimizing resource allocation and ensuring data alignment with target tasks. To achieve this vision, we focus on predicting performance on Closed-book Question Answering (CBQA) tasks, which are closely tied to pre-training data and knowledge retention. We address three major challenges: 1) mastering the entire pre-training process, especially data construction; 2) evaluating a model’s knowledge retention; and 3) predicting task-specific knowledge retention using only information available prior to training. To tackle these challenges, we pre-train three large language models (i.e., 1.6B, 7B, and 13B) using 560k dollars and 520k GPU hours. We analyze the pre-training data with knowledge triples and assess knowledge retention using established methods. Additionally, we introduce the SMI metric, an information-theoretic measure that quantifies the relationship between pre-training data, model size, and task-specific knowledge retention. Our experiments reveal a strong linear correlation ( \textR^2 0.84 ) between the SMI metric and the model’s accuracy on CBQA tasks across models of varying sizes (i.e., 1.1B, 1.6B, 7B, and 13B). The dataset, model, and code are available at this https URL.
zh

[NLP-24] Leverag ing Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment

【速读】：该论文旨在解决OOD (Out-of-Distribution) 攻击下拒绝训练（Refusal Training, RT）模型泛化能力不足的问题。论文的关键解决方案在于引入基于推理的监督机制，通过合成预设指南来指导模型进行安全推理，从而促使模型在处理OOD攻击时能够更有效地发掘和利用潜在的安全知识。这种方法显著提升了模型在面对OOD攻击时的泛化性能。

链接: https://arxiv.org/abs/2502.04040
作者: Haoyu Wang,Zeyu Qin,Li Shen,Xueqian Wang,Minhao Cheng,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The first two authors contributed equally

点击查看摘要

Abstract:Training safe LLMs is one of the most critical research challenge. However, the commonly used method, Refusal Training (RT), struggles to generalize against various OOD jailbreaking attacks. Many safety training methods have been proposed to address this issue. While they offer valuable insights, we aim to complement this line of research by investigating whether OOD attacks truly exceed the capability of RT model. Conducting evaluation with BoN, we observe significant improvements on generalization as N increases. This underscores that the model possesses sufficient safety-related latent knowledge, but RT fails to consistently elicit this knowledge when addressing OOD attacks. Further analysis based on domain adaptation reveals that training with direct refusal causes model to rely on superficial shortcuts, resulting in learning of non-robust representation mappings. Based on our findings, we propose training model to perform safety reasoning for each query. Reasoning supervision encourages model to perform more computations, explicitly eliciting and using latent knowledge through reasoning. To achieve this, we synthesize reasoning supervision based on pre-guidelines, training the model to reason in alignment with them, thereby effectively eliciting and utilizing latent knowledge from diverse perspectives. Extensive experiments show that our method significantly improves generalization performance against OOD attacks.
zh

[NLP-25] Simulating the Emergence of Differential Case Marking with Communicating Neural-Network Agents

【速读】：该论文旨在探究差分格标记（Differential Case Marking, DCM）在人工语言中的出现机制。解决方案的关键在于使用多智能体强化学习框架，其中智能体首先通过神经网络学习一种人造语言，然后进行交流互动。研究结果表明，单纯的学习过程不会导致DCM的产生，而当智能体进行交流时，差异化的标记使用则会自然出现。这支持了Smith和Culbertson（2020）的研究，强调了交流在形成DCM中的关键作用，并展示了神经网络智能体模型在语言演化实验研究中的潜力。

链接: https://arxiv.org/abs/2502.04038
作者: Yuchen Lian,Arianna Bisazza,Tessa Verhoef
机构: Faculty of Electronic and Information Engineering, Xi’an Jiaotong University (电子与信息工程学院，西安交通大学); Leiden Institute of Advanced Computer Science, Leiden University (莱顿大学先进计算机科学研究所); Center for Language and Cognition, University of Groningen (格罗宁根大学语言与认知中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Differential Case Marking (DCM) refers to the phenomenon where grammatical case marking is applied selectively based on semantic, pragmatic, or other factors. The emergence of DCM has been studied in artificial language learning experiments with human participants, which were specifically aimed at disentangling the effects of learning from those of communication (Smith Culbertson, 2020). Multi-agent reinforcement learning frameworks based on neural networks have gained significant interest to simulate the emergence of human-like linguistic phenomena. In this study, we employ such a framework in which agents first acquire an artificial language before engaging in communicative interactions, enabling direct comparisons to human result. Using a very generic communication optimization algorithm and neural-network learners that have no prior experience with language or semantic preferences, our results demonstrate that learning alone does not lead to DCM, but when agents communicate, differential use of markers arises. This supports Smith and Culbertson (2020)'s findings that highlight the critical role of communication in shaping DCM and showcases the potential of neural-agent models to complement experimental research on language evolution.
zh

[NLP-26] Exploring Imbalanced Annotations for Effective In-Context Learning

【速读】：该论文旨在解决大型语言模型（LLMs）在通过上下文学习（ICL）进行下游任务时因标注数据集分布不平衡导致性能显著下降的问题。现有方法在处理这种不平衡时效果不佳。论文的关键在于将标注数据集与测试数据集之间的分布差异分解为类别权重和条件偏差两部分，并通过最小化平衡验证数据集上的经验误差来估计条件偏差，进而利用这两部分权重调整原始评分函数。这种方法能够防止从单一类别中选择过多示例，同时保持原有选择方法的有效性。

链接: https://arxiv.org/abs/2502.04037
作者: Hongfu Gao,Feipeng Zhang,Hao Zeng,Deyu Meng,Bingyi Jing,Hongxin Wei
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive performance on downstream tasks through in-context learning (ICL), which heavily relies on the demonstrations selected from annotated datasets. Existing selection methods may hinge on the distribution of annotated datasets, which can often be long-tailed in real-world scenarios. In this work, we show that imbalanced class distributions in annotated datasets significantly degrade the performance of ICL across various tasks and selection methods. Moreover, traditional rebalance methods fail to ameliorate the issue of class imbalance in ICL. Our method is motivated by decomposing the distributional differences between annotated and test datasets into two-component weights: class-wise weights and conditional bias. The key idea behind our method is to estimate the conditional bias by minimizing the empirical error on a balanced validation dataset and to employ the two-component weights to modify the original scoring functions during selection. Our approach can prevent selecting too many demonstrations from a single class while preserving the effectiveness of the original selection methods. Extensive experiments demonstrate the effectiveness of our method, improving the average accuracy by up to 5.46 on common benchmarks with imbalanced datasets.
zh

[NLP-27] Quantification of Biodiversity from Historical Survey Text with LLM -based Best-Worst Scaling

【速读】：该论文旨在通过历史调查文本的数量估计来评估确定物种频率的方法。关键解决方案在于将此问题构架为回归任务，并采用Best-Worst Scaling (BWS)技术结合大型语言模型（Large Language Models, LLMs）进行处理。研究发现，DeepSeek-V3和GPT-4在与人类和其他模型的一致性方面表现合理，证明了这种方法相较于精细多分类方法更为经济且具有相似的鲁棒性，从而实现了跨物种的自动化数量估计。

链接: https://arxiv.org/abs/2502.04022
作者: Thomas Haider,Tobias Perschl,Malte Rehbein
机构: University of Passau (帕绍大学)
类目: Computation and Language (cs.CL)
备注: NoDaLiDa 2025, EcoNLP Workshop

点击查看摘要

Abstract:In this study, we evaluate methods to determine the frequency of species via quantity estimation from historical survey text. To that end, we formulate classification tasks and finally show that this problem can be adequately framed as a regression task using Best-Worst Scaling (BWS) with Large Language Models (LLMs). We test Ministral-8B, DeepSeek-V3, and GPT-4, finding that the latter two have reasonable agreement with humans and each other. We conclude that this approach is more cost-effective and similarly robust compared to a fine-grained multi-class approach, allowing automated quantity estimation across species.
zh

[NLP-28] Ontology-Guided Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering

【速读】：该论文旨在解决现有知识图谱问答（Knowledge Graph Question Answering, KGQA）系统无法高效迁移到未见过的知识图谱（KGs）的问题。解决方案的关键在于提出了一种名为OntoSCPrompt的新方法，采用两阶段架构将语义解析与知识图谱依赖性交互分离。首先生成包含SPARQL关键词（如SELECT, ASK, WHERE）及缺失标记的SPARQL查询结构，随后填充特定于知识图谱的信息。通过引入基于本体论的混合提示学习策略，并结合离散和连续向量，增强对底层知识图谱的理解。此外，还提出了几种任务特定的解码策略以确保生成的SPARQL查询在各个阶段的正确性和可执行性。实验结果表明，OntoSCPrompt在无需重新训练的情况下，在多个KGQA数据集上表现优异，并且能够很好地泛化到未见过的领域特定知识图谱。

链接: https://arxiv.org/abs/2502.03992
作者: Longquan Jiang,Junbo Huang,Cedric Möller,Ricardo Usbeck
机构: Department of Computer Science, University of Hamburg(汉堡大学), Germany; Institute for Information Systems, Leuphana University Lüneburg(吕讷堡雷本哈大学), Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted By ICSC 2025

点击查看摘要

Abstract:Most existing Knowledge Graph Question Answering (KGQA) approaches are designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the heterogeneity of the underlying graph schema, topology and assertions, most KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without resource-intensive training data. We present OntoSCPrompt, a novel Large Language Model (LLM)-based KGQA approach with a two-stage architecture that separates semantic parsing from KG-dependent interactions. OntoSCPrompt first generates a SPARQL query structure (including SPARQL keywords such as SELECT, ASK, WHERE and placeholders for missing tokens) and then fills them with KG-specific information. To enhance the understanding of the underlying KG, we present an ontology-guided, hybrid prompt learning strategy that integrates KG ontology into the learning process of hybrid prompts (e.g., discrete and continuous vectors). We also present several task-specific decoding strategies to ensure the correctness and executability of generated SPARQL queries in both stages. Experimental results demonstrate that OntoSCPrompt performs as well as SOTA approaches without retraining on a number of KGQA datasets such as CWQ, WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code: \hrefthis https URLthis https URL
zh

[NLP-29] PGB: One-Shot Pruning for BERT via Weight Grouping and Permutation

【速读】：该论文旨在解决大型预训练语言模型（如BERT）在推理速度和内存使用上的瓶颈问题，由于其庞大的规模导致。论文提出了一种名为“BERT排列分组”（PGB）的新颖一次性结构化剪枝方法，关键在于通过排列识别重要权重组，并在多头注意力和前馈层中以结构化方式剪枝其他权重，若某一层未形成重要组，则整个层被舍弃，从而实现高效压缩和稀疏性的同时保持准确性。

链接: https://arxiv.org/abs/2502.03984
作者: Hyemin Lim,Jaeyeon Lee,Dong-Wan Choi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large pretrained language models such as BERT suffer from slow inference and high memory usage, due to their huge size. Recent approaches to compressing BERT rely on iterative pruning and knowledge distillation, which, however, are often too complicated and computationally intensive. This paper proposes a novel semi-structured one-shot pruning method for BERT, called \textitPermutation and Grouping for BERT (PGB), which achieves high compression efficiency and sparsity while preserving accuracy. To this end, PGB identifies important groups of individual weights by permutation and prunes all other weights as a structure in both multi-head attention and feed-forward layers. Furthermore, if no important group is formed in a particular layer, PGB drops the entire layer to produce an even more compact model. Our experimental results on BERT _\textBASE demonstrate that PGB outperforms the state-of-the-art structured pruning methods in terms of computational cost and accuracy preservation.
zh

[NLP-30] MAQInstruct: Instruction-based Unified Event Relation Extraction WWW2025

【速读】：该论文旨在解决利用指令调教的大语言模型在事件关系抽取任务中的挑战，特别是面对大量推理样本和非顺序事件关系时的问题。解决方案的关键在于提出了一种改进的基于指令的事件关系抽取框架MAQInstruct。该框架首先将任务从使用给定事件-事件指令提取事件关系转变为使用给定事件-关系指令选择事件，从而减少了推理所需的样本数量。其次，通过引入二分匹配损失，减少了基于指令方法对生成序列的依赖性。

链接: https://arxiv.org/abs/2502.03954
作者: Jun Xu,Mengshu Sun,Zhiqiang Zhang,Jun Zhou
机构: AntGroupHangzhouChina
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by WWW 2025 short

点击查看摘要

Abstract:Extracting event relations that deviate from known schemas has proven challenging for previous methods based on multi-class classification, MASK prediction, or prototype matching. Recent advancements in large language models have shown impressive performance through instruction tuning. Nevertheless, in the task of event relation extraction, instruction-based methods face several challenges: there are a vast number of inference samples, and the relations between events are non-sequential. To tackle these challenges, we present an improved instruction-based event relation extraction framework named MAQInstruct. Firstly, we transform the task from extracting event relations using given event-event instructions to selecting events using given event-relation instructions, which reduces the number of samples required for inference. Then, by incorporating a bipartite matching loss, we reduce the dependency of the instruction-based method on the generation sequence. Our experimental results demonstrate that MAQInstruct significantly improves the performance of event relation extraction across multiple LLMs.
zh

[NLP-31] Enhancing Online Learning Efficiency Through Heterogeneous Resource Integration with a Multi-Agent RAG System

【速读】：该论文旨在解决在线学习过程中高效访问多样化资源（如视频、代码仓库、文档和通用网络内容）的需求。解决方案的关键在于开发一个多智能体检索增强生成系统（Multi-Agent Retrieval-Augmented Generation, RAG），通过专为特定资源类型（如YouTube教程、GitHub仓库、文档网站和搜索引擎）定制的智能体，实现相关资源的自动化检索与整合，从而减少手动操作并提升学习效率。

链接: https://arxiv.org/abs/2502.03948
作者: Devansh Srivastav,Hasan Md Tusfiqur Alam,Afsaneh Asaei,Mahmoud Fazeli,Tanisha Sharma,Daniel Sonntag
机构: German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心); Saarbrücken(萨尔布吕肯), Germany(德国); Saarland University(萨尔兰大学); Digital Product School of UnternehmerTUM (Center for Innovation and Business Creation at Technical University of Munich)(UnternehmerTUM数字产品学院); Munich(慕尼黑), Germany(德国); CROWDCONSULTANTS( crowdconsultants)( crowdconsultants); Berlin(柏林), Germany(德国); Microsoft(微软); Hyderabad(海得拉巴), India(印度); University of Oldenburg(奥尔登堡大学); Oldenburg(奥尔登堡), Germany(德国)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Efficient online learning requires seamless access to diverse resources such as videos, code repositories, documentation, and general web content. This poster paper introduces early-stage work on a Multi-Agent Retrieval-Augmented Generation (RAG) System designed to enhance learning efficiency by integrating these heterogeneous resources. Using specialized agents tailored for specific resource types (e.g., YouTube tutorials, GitHub repositories, documentation websites, and search engines), the system automates the retrieval and synthesis of relevant information. By streamlining the process of finding and combining knowledge, this approach reduces manual effort and enhances the learning experience. A preliminary user study confirmed the system’s strong usability and moderate-high utility, demonstrating its potential to improve the efficiency of knowledge acquisition.
zh

[NLP-32] Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond

【速读】：该论文旨在解决自动语音识别（ASR）系统在处理非洲口音英语对话时的性能不足问题。关键在于引入了Afrispeech-Dialog数据集，该数据集包含50段模拟的医疗和非医疗领域的非洲口音英语对话，用于评估当前最先进的ASR及相关的说话人分割技术在长篇幅口音对话中的表现，并与标准口音进行对比，揭示了超过10%的性能下降。此外，研究还探讨了大型语言模型（LLM）在医疗对话摘要中的能力，以展示ASR错误对下游医疗摘要的影响，从而为全球南部地区语音技术的发展提供挑战与机遇的见解。

链接: https://arxiv.org/abs/2502.03945
作者: Mardhiyah Sanni,Tassallah Abdullahi,Devendra D. Kayande,Emmanuel Ayodele,Naome A. Etori,Michael S. Mollel,Moshood Yekini,Chibuzor Okocha,Lukman E. Ismaila,Folafunmi Omofoye,Boluwatife A. Adewale,Tobi Olatunji
机构: Intron(BioRAMP); BioRAMP; Brown University; Indian Institute of Information Technology Allahabad; University of Minnesota-Twin Cities; University of Glasgow; University of Florida; Johns Hopkins University; University of North Carolina at Chapel Hill; Georgia Institute of Technology
类目: Computation and Language (cs.CL)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:Speech technologies are transforming interactions across various sectors, from healthcare to call centers and robots, yet their performance on African-accented conversations remains underexplored. We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. We assess state-of-the-art (SOTA) speaker diarization and ASR systems on long-form, accented speech, comparing their performance with native accents and discover a 10%+ performance degradation. Additionally, we explore medical conversation summarization capabilities of large language models (LLMs) to demonstrate the impact of ASR errors on downstream medical summaries, providing insights into the challenges and opportunities for speech technologies in the Global South. Our work highlights the need for more inclusive datasets to advance conversational AI in low-resource settings.
zh

[NLP-33] Experiments with Large Language Models on Retrieval-Augmented Generation for Closed-Source Simulation Software

【速读】：该论文旨在解决大型语言模型（LLMs）在处理闭源仿真软件时存在的知识获取难题，特别是当这些软件内部知识受到知识产权或数据隐私保护而无法公开时。论文的关键解决方案是采用检索增强生成（Retrieval-Augmented Generation, RAG）方法，通过结合外部信息检索与模型生成技术，以提高LLMs对闭源仿真软件的知识掌握能力，并有效创建仿真模型。实验结果表明，RAG系统能够显著提升LLMs在处理闭源仿真软件相关任务时的表现，但仍存在一些信息缺口和进一步研究的问题。

链接: https://arxiv.org/abs/2502.03916
作者: Andreas Baumann,Peter Eberhard
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly helpful in text generation, even writing code in programming languages based on user prompts written in natural language. They are even applied to generate simulation models for multibody systems from natural language. Research results suggest that LLMs surpass the mere replication of existing code examples, where some LLMs have been trained on an open-source multibody simulation code. However, for closed-source simulation software, such results are not to be expected as their ideas and concepts might differ from other publicly available ones. LLMs can hallucinate for knowledge-intensive tasks, such as model creation, which can lead to wrong responses. This is especially the case for the LLM unknown closed-source simulation software. The same applies to other internal knowledge kept private to protect intellectual property or data privacy. The Retrieval-Augmented Generation (RAG) approach might yield a solution for these knowledge-intensive tasks. This paper explores the application of RAG to closed-source simulation software and presents first experiments. After a brief introduction to LLMs, the RAG approach, and the simulation method applied by the close-source simulation software, several examples are provided to test LLMs’ knowledge of the simulation software and the creation of simulation models using two RAG systems. The examples show promising results indicating the benefits of applying RAG systems to closed-source simulation software, helping to access their knowledge. Nevertheless, they also reveal gaps in the applied information and open questions for further research.
zh

[NLP-34] BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation

【速读】：该论文旨在解决如何在不依赖知识蒸馏或昂贵的人工标注的情况下，使大型语言模型（Large Language Models, LLMs）具备长链思维（Long Chain-of-Thought, LongCoT）的能力。论文的关键解决方案是通过自举长链思维（Bootstrapping LongCoT, BOLT）方法，从标准指令模型（standard instruct model）出发，经过三个阶段：1）利用上下文学习进行长链思维数据自举；2）长链思维监督微调；3）在线训练以进一步精炼长链思维能力。这种方法仅需少量上下文示例即可实现，论文中的实验使用了10个示例，并且应用到不同规模的模型上，取得了显著的性能提升。

链接: https://arxiv.org/abs/2502.03860
作者: Bo Pang,Hanze Dong,Jiacheng Xu,Silvio Savarese,Yingbo Zhou,Caiming Xiong
机构: 未知
类目: Computation and Language (cs.CL)
备注: 36 pages

点击查看摘要

Abstract:Large language models (LLMs), such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM’s LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.
zh

[NLP-35] Improving Natural Language Understanding for LLM s via Large-Scale Instruction Synthesis AAAI2025

【速读】：该论文旨在解决自然语言理解（NLU）领域高质量大规模指令短缺的问题，特别是现有指令主要集中于信息抽取（Information Extraction, IE），而忽视了机器阅读理解、问答和文本分类等任务。此外，数据多样性不足导致训练出的大规模语言模型（LLMs）在其他NLU任务中的泛化能力下降，并且基础模型的通用能力显著减弱。为了解决这些问题，论文提出Hum，一个涵盖信息抽取（包括封闭式或开放式）、机器阅读理解、文本分类及指令通才任务的大规模高质量合成指令语料库，以增强LLMs的NLU能力。关键解决方案在于通过人类与LLMs协作机制合成指令，从而增加指令的多样性。

链接: https://arxiv.org/abs/2502.03843
作者: Lin Yuan,Jun Xu,Honghao Gui,Mengshu Sun,Zhiqiang Zhang,Lei Liang,Jun Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:High-quality, large-scale instructions are crucial for aligning large language models (LLMs), however, there is a severe shortage of instruction in the field of natural language understanding (NLU). Previous works on constructing NLU instructions mainly focus on information extraction (IE), neglecting tasks such as machine reading comprehension, question answering, and text classification. Furthermore, the lack of diversity in the data has led to a decreased generalization ability of trained LLMs in other NLU tasks and a noticeable decline in the fundamental model’s general capabilities. To address this issue, we propose Hum, a large-scale, high-quality synthetic instruction corpus for NLU tasks, designed to enhance the NLU capabilities of LLMs. Specifically, Hum includes IE (either close IE or open IE), machine reading comprehension, text classification, and instruction generalist tasks, thereby enriching task diversity. Additionally, we introduce a human-LLMs collaborative mechanism to synthesize instructions, which enriches instruction diversity by incorporating guidelines, preference rules, and format variants. We conduct extensive experiments on 5 NLU tasks and 28 general capability evaluation datasets for LLMs. Experimental results show that Hum enhances the NLU capabilities of six LLMs by an average of 3.1%, with no significant decline observed in other general capabilities.
zh

[NLP-36] A comprehensive survey of contemporary Arabic sentiment analysis: Methods Challenges and Future Directions NAACL2025

【速读】：该论文旨在解决阿拉伯语情感分析领域的挑战与局限性，并概述未来研究的方向。论文的关键在于系统回顾利用深度学习方法进行阿拉伯语情感分析的研究，识别现有文献中的研究空白，并将其置于情感分析的更广泛背景下进行比较。通过这样做，论文揭示了在阿拉伯语情感分析领域的主要挑战及未来有前景的研究方向。

链接: https://arxiv.org/abs/2502.03827
作者: Zhiqiang Shi,Ruchit Agrawal
机构: University of Edinburgh (爱丁堡大学); University of Birmingham (伯明翰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper accepted to NAACL 2025

点击查看摘要

Abstract:Sentiment Analysis, a popular subtask of Natural Language Processing, employs computational methods to extract sentiment, opinions, and other subjective aspects from linguistic data. Given its crucial role in understanding human sentiment, research in sentiment analysis has witnessed significant growth in the recent years. However, the majority of approaches are aimed at the English language, and research towards Arabic sentiment analysis remains relatively unexplored. This paper presents a comprehensive and contemporary survey of Arabic Sentiment Analysis, identifies the challenges and limitations of existing literature in this field and presents avenues for future research. We present a systematic review of Arabic sentiment analysis methods, focusing specifically on research utilizing deep learning. We then situate Arabic Sentiment Analysis within the broader context, highlighting research gaps in Arabic sentiment analysis as compared to general sentiment analysis. Finally, we outline the main challenges and promising future directions for research in Arabic sentiment analysis.
zh

[NLP-37] Syntriever: How to Train Your Retriever with Synthetic Data from LLM s NAACL

【速读】：该论文旨在解决将大型语言模型（LLMs）中的知识有效提炼到信息检索系统中的问题。解决方案的关键在于提出了一种名为Syntriever的训练框架，它通过使用来自黑盒LLMs的合成数据来训练检索器。Syntriever分为两个阶段：首先，在蒸馏阶段，利用链式思维生成相关和可能不相关的段落及增强查询，并由LLM进行自我验证以减少幻觉；其次，在对齐阶段，采用部分Plackett-Luce排序来学习LLM的偏好，并通过正则化防止模型过度偏离蒸馏阶段的训练结果。实验表明，Syntriever在多个领域的基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2502.03824
作者: Minsang Kim,Seungjun Baek
机构: Korea University (高丽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Findings, Accepted

点击查看摘要

Abstract:LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce ranking to learn LLM preferences with regularization which prevents the model from deviating excessively from that trained in the distillation stage. Experiments show that Syntriever achieves state-of-the-art performances on benchmark datasets from various domains in nDCG@ K . The code is available at \hrefthis https URLthis https URL.
zh

[NLP-38] PsyPlay: Personality-Infused Role-Playing Conversational Agents

【速读】：该论文旨在解决现有角色扮演对话代理（Role-Playing Conversational Agents, RPCAs）在大型语言模型（Large Language Models, LLMs）研究中主要关注模仿特定说话风格和利用角色背景，而忽视深入描绘个性特征的问题。论文的关键解决方案是提出PsyPlay框架，该框架能够促进多个LLM代理之间表达丰富的个性，并使代理在对话中始终表现出其指定的个性特征。通过生成的对话数据验证表明，PsyPlay能够在80.31%的成功率下准确展现预期的个性特征，且与积极价值观对齐的LLMs在表现积极个性角色方面更为成功。此外，论文构建了一个名为PsyPlay-Bench的对话语料库，用于进一步推动个性化角色扮演和对话个性检测的研究。

链接: https://arxiv.org/abs/2502.03821
作者: Tao Yang,Yuhua Zhu,Xiaojun Quan,Cong Liu,Qifan Wang
机构: School of Computer Science and Engineering, Sun Yat-sen University, China(中山大学计算机科学与工程学院,中国); Meta AI(Meta AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The current research on Role-Playing Conversational Agents (RPCAs) with Large Language Models (LLMs) primarily focuses on imitating specific speaking styles and utilizing character backgrounds, neglecting the depiction of deeper personality traits.~In this study, we introduce personality-infused role-playing for LLM agents, which encourages agents to accurately portray their designated personality traits during dialogues. We then propose PsyPlay, a dialogue generation framework that facilitates the expression of rich personalities among multiple LLM agents. Specifically, PsyPlay enables agents to assume roles with distinct personality traits and engage in discussions centered around specific topics, consistently exhibiting their designated personality traits throughout the interactions. Validation on generated dialogue data demonstrates that PsyPlay can accurately portray the intended personality traits, achieving an overall success rate of 80.31% on GPT-3.5. Notably, we observe that LLMs aligned with positive values are more successful in portraying positive personality roles compared to negative ones. Moreover, we construct a dialogue corpus for personality-infused role-playing, called PsyPlay-Bench. The corpus, which consists of 4745 instances of correctly portrayed dialogues using PsyPlay, aims to further facilitate research in personalized role-playing and dialogue personality detection.
zh

[NLP-39] Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective

【速读】：该论文旨在解决大型语言模型在长序列推理过程中由于Transformer架构依赖于自注意力机制而产生的高存储和运行成本问题。具体而言，论文关注减少Key-Value (KV)缓存大小的问题。解决方案的关键在于提出了一种基于注意力输出扰动分析的正式研究方法，通过优化最坏情况下的输出扰动来识别关键的KV缓存条目。论文表明，除了注意力权重外，KV条目中的值状态和预训练参数矩阵也是关键因素。所提出的扰动约束选择算法显著提升了现有缓存驱逐方法的效果，在Llama模型超过92%的关注头中实现了更低的输出扰动。

链接: https://arxiv.org/abs/2502.03805
作者: Yuan Feng,Junlin Lv,Yukun Cao,Xike Xie,S Kevin Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture’s reliance on self-attention, particularly the large Key-Value (KV) cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. Evaluations on the Needle-in-a-Haystack test and Longbench benchmark show our algorithm enhances state-of-the-art cache eviction methods. Further empirical analysis confirms that our algorithm achieves lower output perturbations in over 92% attention heads in Llama model, thereby providing a significant improvement over existing methods.
zh

[NLP-40] Enhancing Hallucination Detection through Noise Injection

【速读】：该论文旨在解决大型语言模型（LLMs）在生成响应时易产生看似合理但错误的回答（即幻觉）的问题，并提出有效检测这些幻觉对于LLMs的安全部署至关重要。论文的关键解决方案在于通过考虑贝叶斯意义上的模型不确定性来改进幻觉检测，具体方法是通过对采样过程中适当子集的模型参数或等效隐藏单元激活进行扰动，从而显著提高检测效果。这种方法在广泛的数据集和模型架构上均表现出有效性。

链接: https://arxiv.org/abs/2502.03799
作者: Litian Liu,Reza Pourreza,Sunny Panchal,Apratim Bhattacharyya,Yao Qin,Roland Memisevic
机构: 未知
类目: Computation and Language (cs.CL); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from a set of samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is sub-optimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple and efficient approach that perturbs an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate its effectiveness across a wide range of datasets and model architectures.
zh

[NLP-41] Its All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers

【速读】：该论文旨在解决传统encoder-only模型（如BERT和ModernBERT）在实际应用中受限于特定任务分类头的问题，限制了其与基于解码器的大规模语言模型（LLMs）相比的应用范围。论文的关键解决方案在于引入了ModernBERT-Large-Instruct，一个利用掩码语言建模（MLM）头进行生成式分类的0.4B参数编码模型。这种方法通过简单的训练循环和推理机制实现，无需复杂的预处理、精心设计的提示或架构修改，从而展现出强大的零样本性能，并在多个任务上超越了同等规模的LLMs。

链接: https://arxiv.org/abs/2502.03793
作者: Benjamin Clavié,Nathan Cooper,Benjamin Warner
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While encoder-only models such as BERT and ModernBERT are ubiquitous in real-world NLP applications, their conventional reliance on task-specific classification heads can limit their applicability compared to decoder-based large language models (LLMs). In this work, we introduce ModernBERT-Large-Instruct, a 0.4B-parameter encoder model that leverages its masked language modelling (MLM) head for generative classification. Our approach employs an intentionally simple training loop and inference mechanism that requires no heavy pre-processing, heavily engineered prompting, or architectural modifications. ModernBERT-Large-Instruct exhibits strong zero-shot performance on both classification and knowledge-based tasks, outperforming similarly sized LLMs on MMLU and achieving 93% of Llama3-1B’s MMLU performance with 60% less parameters. We also demonstrate that, when fine-tuned, the generative approach using the MLM head matches or even surpasses traditional classification-head methods across diverse NLU this http URL capability emerges specifically in models trained on contemporary, diverse data mixes, with models trained on lower volume, less-diverse data yielding considerably weaker performance. Although preliminary, these results demonstrate the potential of using the original generative masked language modelling head over traditional task-specific heads for downstream tasks. Our work suggests that further exploration into this area is warranted, highlighting many avenues for future improvements.
zh

[NLP-42] Adaptive Semantic Prompt Caching with VectorQ

【速读】：该论文旨在解决现有语义提示缓存系统中使用静态阈值判断相似度分数是否足够高的方法在不同提示下表现不佳的问题。关键解决方案是提出VectorQ框架，通过学习特定于嵌入向量的阈值区域，使阈值能够适应嵌入复杂性和不确定性，从而提高缓存命中率和降低错误率。

链接: https://arxiv.org/abs/2502.03771
作者: Luis Gaspar Schroeder,Shu Liu,Alejandro Cuadron,Mark Zhao,Stephan Krusche,Alfons Kemper,Matei Zaharia,Joseph E. Gonzalez
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semantic prompt caches reduce the latency and cost of large language model (LLM) inference by reusing cached LLM-generated responses for semantically similar prompts. Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache. Existing systems rely on a static threshold to classify whether the similarity score is sufficiently high to result in a cache hit. We show that this one-size-fits-all threshold is insufficient across different prompts. We propose VectorQ, a framework to learn embedding-specific threshold regions that adapt to the complexity and uncertainty of an embedding. Through evaluations on a combination of four diverse datasets, we show that VectorQ consistently outperforms state-of-the-art systems across all static thresholds, achieving up to 12x increases in cache hit rate and error rate reductions up to 92%.
zh

[NLP-43] Hierarchical Contextual Manifold Alignment for Structuring Latent Representations in Large Language Models

【速读】：该论文旨在解决语言模型中潜在token表示组织导致的稳定性、泛化能力和上下文一致性问题。传统方法通常依赖于参数修改，这会引入额外的计算开销。论文提出了一种分层对齐方法（Hierarchical Alignment），通过重构token嵌入而不改变核心模型权重，确保表示分布（representational distributions）在不同语言环境中保持一致性。关键在于这种方法能够在保持计算效率的同时，提高罕见token检索、对抗鲁棒性和长距离依赖跟踪的能力，并改善token邻近关系的一致性及语言生成的可解释性。

链接: https://arxiv.org/abs/2502.03766
作者: Meiquan Dong,Haoran Liu,Yan Huang,Zixuan Feng,Jianhong Tang,Ruoxi Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The organization of latent token representations plays a crucial role in determining the stability, generalization, and contextual consistency of language models, yet conventional approaches to embedding refinement often rely on parameter modifications that introduce additional computational overhead. A hierarchical alignment method was introduced to restructure token embeddings without altering core model weights, ensuring that representational distributions maintained coherence across different linguistic contexts. Experimental evaluations demonstrated improvements in rare token retrieval, adversarial robustness, and long-range dependency tracking, highlighting the advantages of hierarchical structuring in mitigating inconsistencies in latent space organization. The comparative analysis against conventional fine-tuning and embedding perturbation methods revealed that hierarchical restructuring maintained computational efficiency while achieving measurable gains in representation quality. Structural refinements introduced through the alignment process resulted in improved contextual stability across varied linguistic tasks, reducing inconsistencies in token proximity relationships and enhancing interpretability in language generation. A detailed computational assessment confirmed that the realignment process introduced minimal inference overhead, ensuring that representational improvements did not compromise model efficiency. The findings reinforced the broader significance of structured representation learning, illustrating that hierarchical embedding modifications could serve as an effective strategy for refining latent space distributions while preserving pre-learned semantic associations.
zh

[NLP-44] Rethinking the Residual Distribution of Locate-then-Editing Methods in Model Editing

【速读】：该论文旨在解决通过定位-再编辑方法更新大型语言模型（LLMs）知识时导致原始知识退化的问题。论文指出，这一问题的关键在于残差分布（residual distribution）引入了编辑错误，从而导致不准确的编辑结果。为了解决这一问题，作者提出了边界层更新（Boundary Layer UpdatE, BLUE）策略，以改进定位-再编辑方法。实验结果显示，BLUE不仅平均提升了35.59%的性能，还增强了LLMs保持其通用能力的效果。

链接: https://arxiv.org/abs/2502.03748
作者: Xiaopeng Li,Shanwen Wang,Shasha Li,Shezheng Song,Bin Ji,Jun Ma,Jie Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Model editing is a powerful technique for updating the knowledge of Large Language Models (LLMs). Locate-then-edit methods are a popular class of approaches that first identify the critical layers storing knowledge, then compute the residual of the last critical layer based on the edited knowledge, and finally perform multi-layer updates using a least-squares solution by evenly distributing the residual from the first critical layer to the last. Although these methods achieve promising results, they have been shown to degrade the original knowledge of LLMs. We argue that residual distribution leads to this issue. To explore this, we conduct a comprehensive analysis of residual distribution in locate-then-edit methods from both empirical and theoretical perspectives, revealing that residual distribution introduces editing errors, leading to inaccurate edits. To address this issue, we propose the Boundary Layer UpdatE (BLUE) strategy to enhance locate-then-edit methods. Sequential batch editing experiments on three LLMs and two datasets demonstrate that BLUE not only delivers an average performance improvement of 35.59%, significantly advancing the state of the art in model editing, but also enhances the preservation of LLMs’ general capabilities. Our code is available at this https URL.
zh

[NLP-45] MultiQA: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers AAAI2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成响应时容易产生幻觉（hallucinations）的问题。论文的关键解决方案是提出MultiQA，这是一种系统性方法，用于评估LLM生成答案的鲁棒性和一致性。通过大规模众包问题扰动及其相应答案的方式，MultiQA能够检验190万种问题扰动和230万个答案，从而展示集成LLM（如gpt-3.5-turbo）在扰动下的相对稳定性和一致性。MultiQA提供了一种有效的方法来检查分歧和变异性，为机构采用LLM提供了潜在框架，使其能够衡量置信度、一致性和量化幻觉现象。

链接: https://arxiv.org/abs/2502.03711
作者: Nicole Cho,William Watson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AAAI 2025 Workshop on Preventing and Detecting LLM Misinformation (PDLM) (Oral)

点击查看摘要

Abstract:One critical challenge in the institutional adoption journey of Large Language Models (LLMs) stems from their propensity to hallucinate in generated responses. To address this, we propose MultiQA, a systematic approach for evaluating the robustness and consistency of LLM-generated answers. We demonstrate MultiQA’s ability to crowdsource question perturbations and their respective answers through independent LLM agents at scale. Our experiments culminated in the examination of 1.9 million question perturbations and 2.3 million answers. Furthermore, MultiQA shows that ensembled LLMs, such as gpt-3.5-turbo, remain relatively robust and consistent under perturbations. MultiQA provides clarity in the response generation space, offering an effective method for inspecting disagreements and variability. Therefore, our system offers a potential framework for institutional LLM adoption with the ability to measure confidence, consistency, and the quantification of hallucinations.
zh

[NLP-46] Aggregate and conquer: detecting and steering LLM concepts by combining nonlinear predictors over multiple layers

【速读】：该论文旨在解决如何检测和引导大规模语言模型（Large Language Model, LLM）内部语义概念的问题。论文的关键创新在于采用非线性特征学习方法来识别预测概念的重要线性方向，并跨层聚合特征以构建强大的概念检测器和引导机制。通过这种方法，论文实现了在七个基准测试中检测幻觉、有害性、毒性及不实内容的最佳性能，并展示了其通用性，能够引导LLM朝向新概念进行输出，包括语义消歧、人类语言、编程语言、幻觉响应、科学主题、诗歌/莎士比亚式英语等，甚至同时处理多个概念。此外，该方法还能处理带有数值属性的概念，如产品评论。

链接: https://arxiv.org/abs/2502.03708
作者: Daniel Beaglehole,Adityanarayanan Radhakrishnan,Enric Boix-Adserà,Mikhail Belkin
机构: Broad Institute of MIT and Harvard(麻省理工和哈佛Broad研究所); UC San Diego(加州大学圣地亚哥分校); Harvard SEAS(哈佛大学工程与应用科学学院); MIT Mathematics(麻省理工数学系); Halıcıoğlu Data Science Institute(哈利库卢格数据科学研究所); Harvard CMSA(哈佛CMSA)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always ``know what they know’’ and may even be actively misleading. In this work, we give a general method for detecting semantic concepts in the internal activations of LLMs. Furthermore, we show that our methodology can be easily adapted to steer LLMs toward desirable outputs. Our innovations are the following: (1) we use a nonlinear feature learning method to identify important linear directions for predicting concepts from each layer; (2) we aggregate features across layers to build powerful concept detectors and steering mechanisms. We showcase the power of our approach by attaining state-of-the-art results for detecting hallucinations, harmfulness, toxicity, and untruthful content on seven benchmarks. We highlight the generality of our approach by steering LLMs towards new concepts that, to the best of our knowledge, have not been previously considered in the literature, including: semantic disambiguation, human languages, programming languages, hallucinated responses, science subjects, poetic/Shakespearean English, and even multiple concepts simultaneously. Moreover, our method can steer concepts with numerical attributes such as product reviews. We provide our code (including a simple API for our methods) at this https URL .
zh

[NLP-47] LLM Alignment as Retriever Optimization: An Information Retrieval Perspective

【速读】：该论文旨在解决大型语言模型（LLMs）在推理、编码和通信能力上的有效对齐问题，以确保其正确、可信和合乎伦理的行为。论文的关键在于提出了一种基于直接优化方法的新型对齐方案，即通过借鉴信息检索（IR）原则来改进LLM的对齐效果。具体而言，作者提出了LLM对齐作为检索器偏好优化（LarPO），该方法通过将LLM的生成和奖励模型映射到IR的检索器-重排序器范式，从而提升了整体对齐质量。实验结果表明，LarPO在AlpacaEval2和MixEval-Hard基准测试中的平均性能分别提高了38.9%和13.7%，验证了其有效性。

链接: https://arxiv.org/abs/2502.03699
作者: Bowen Jin,Jinsung Yoon,Zhen Qin,Ziqi Wang,Wei Xiong,Yu Meng,Jiawei Han,Sercan O. Arik
机构: Google(谷歌); Google DeepMind; University of Illinois at Urbana-Champaign(伊利诺伊大学香槟分校); University of Virginia(弗吉尼亚大学); Google Cloud
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 26 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative. In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR’s retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO’s effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.
zh

[NLP-48] DocMIA: Document-Level Membership Inference Attacks against DocVQA Models ICLR2025

【速读】：该论文旨在解决文档视觉问答（DocVQA）模型训练过程中隐私泄露的问题，特别是针对成员推理攻击（Membership Inference Attack）的威胁。论文的关键解决方案在于引入两种新颖的成员推理攻击方法，分别适用于白盒（White-box）和黑盒（Black-box）攻击场景。这些方法在不依赖辅助数据集的情况下，有效地检测出训练数据中的特定记录是否被用于模型训练，从而揭示了DocVQA模型在隐私保护方面的潜在风险。

链接: https://arxiv.org/abs/2502.03692
作者: Khanh Nguyen,Raouf Kerkouche,Mario Fritz,Dimosthenis Karatzas
机构: Computer Vision Center, Universitat Autònoma de Barcelona(巴塞罗那自治大学视觉计算中心); CISPA Helmholtz Center for Information Security(赛博安全与隐私研究所赫尔姆霍兹信息保护中心)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: ICLR 2025

点击查看摘要

Abstract:Document Visual Question Answering (DocVQA) has introduced a new paradigm for end-to-end document understanding, and quickly became one of the standard benchmarks for multimodal LLMs. Automating document processing workflows, driven by DocVQA models, presents significant potential for many business sectors. However, documents tend to contain highly sensitive information, raising concerns about privacy risks associated with training such DocVQA models. One significant privacy vulnerability, exploited by the membership inference attack, is the possibility for an adversary to determine if a particular record was part of the model’s training data. In this paper, we introduce two novel membership inference attacks tailored specifically to DocVQA models. These attacks are designed for two different adversarial scenarios: a white-box setting, where the attacker has full access to the model architecture and parameters, and a black-box setting, where only the model’s outputs are available. Notably, our attacks assume the adversary lacks access to auxiliary datasets, which is more realistic in practice but also more challenging. Our unsupervised methods outperform existing state-of-the-art membership inference attacks across a variety of DocVQA models and datasets, demonstrating their effectiveness and highlighting the privacy risks in this domain.
zh

[NLP-49] A Comparison of DeepSeek and Other LLM s

【速读】：该论文旨在比较DeepSeek与其他大型语言模型（Large Language Models, LLMs）在文本分类任务中的性能。论文通过两个设置进行实验：作者身份分类和引文分类。在作者身份分类中，目标是判断短文本是否由人类或AI撰写；在引文分类中，目标是使用文本内容将引用分类到四种类型之一。论文的关键在于评估DeepSeek在这些任务中的表现，并将其与Claude、Gemini、GPT和Llama等流行LLMs进行对比，从而揭示其优势和局限性。研究表明，DeepSeek在大多数情况下优于其他模型，但在某些方面仍存在不足。此外，论文还提出了一种利用LLMs和MADStat数据集生成新数据集的方法，以供未来研究使用。

链接: https://arxiv.org/abs/2502.03688
作者: Tianchen Gao,Jiashun Jin,Zheng Tracy Ke,Gabriel Moryoussef
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Recently, DeepSeek has been the focus of attention in and beyond the AI community. An interesting problem is how DeepSeek compares to other large language models (LLMs). There are many tasks an LLM can do, and in this paper, we use the task of predicting an outcome using a short text for comparison. We consider two settings, an authorship classification setting and a citation classification setting. In the first one, the goal is to determine whether a short text is written by human or AI. In the second one, the goal is to classify a citation to one of four types using the textual content. For each experiment, we compare DeepSeek with 4 popular LLMs: Claude, Gemini, GPT, and Llama. We find that, in terms of classification accuracy, DeepSeek outperforms Gemini, GPT, and Llama in most cases, but underperforms Claude. We also find that DeepSeek is comparably slower than others but with a low cost to use, while Claude is much more expensive than all the others. Finally, we find that in terms of similarity, the output of DeepSeek is most similar to those of Gemini and Claude (and among all 5 LLMs, Claude and Gemini have the most similar outputs). In this paper, we also present a fully-labeled dataset collected by ourselves, and propose a recipe where we can use the LLMs and a recent data set, MADStat, to generate new data sets. The datasets in our paper can be used as benchmarks for future study on LLMs. Comments: 21 pages, 5 figures, 6 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.03688 [cs.CL] (or arXiv:2502.03688v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.03688 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-50] Controlled LLM Decoding via Discrete Auto-regressive Biasing

【速读】：该论文旨在解决在控制文本生成过程中平衡流畅性和约束满足的问题。传统方法如基于能量的解码在连续空间中采样，难以在大规模语言模型输出中同时保证文本的流畅性和满足用户定义的约束条件。论文的关键解决方案是提出了离散自回归偏置（Discrete Auto-regressive Biasing）算法，该算法完全在离散的文本域内操作，并利用梯度进行采样。具体而言，通过在生成序列和辅助偏置序列之间定义联合分布，并采用基于梯度的离散MCMC的Langevin-within-Gibbs采样算法，显著提高了约束满足能力，同时保持或提升了文本的流畅性，并降低了计算成本。

链接: https://arxiv.org/abs/2502.03685
作者: Patrick Pynadath,Ruqi Zhang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Controlled text generation allows for enforcing user-defined constraints on large language model outputs, an increasingly important field as LLMs become more prevalent in everyday life. One common approach uses energy-based decoding, which defines a target distribution through an energy function that combines multiple constraints into a weighted average. However, these methods often struggle to balance fluency with constraint satisfaction, even with extensive tuning of the energy function’s coefficients. In this paper, we identify that this suboptimal balance arises from sampling in continuous space rather than the natural discrete space of text tokens. To address this, we propose Discrete Auto-regressive Biasing, a controlled decoding algorithm that leverages gradients while operating entirely in the discrete text domain. Specifically, we introduce a new formulation for controlled text generation by defining a joint distribution over the generated sequence and an auxiliary bias sequence. To efficiently sample from this joint distribution, we propose a Langevin-within-Gibbs sampling algorithm using gradient-based discrete MCMC. Our method significantly improves constraint satisfaction while maintaining comparable or better fluency, all with even lower computational costs. We demonstrate the advantages of our controlled decoding method on sentiment control, language detoxification, and keyword-guided generation.
zh

[NLP-51] Reflection-Window Decoding: Text Generation with Selective Refinement

【速读】：该论文旨在解决大型语言模型（LLMs）在文本生成过程中自回归解码方法固有的次优性问题，主要由于缺乏内置机制来优化或修正生成的内容。论文的关键解决方案在于引入了一个滑动反射窗口和暂停标准，使得在解码过程中可以交替进行优化和生成，从而平衡效率与最优性。

链接: https://arxiv.org/abs/2502.03678
作者: Zeyu Tang,Zhenhao Chen,Loka Li,Xiangchen Song,Yunlong Deng,Yifan Shen,Guangyi Chen,Peter Spirtes,Kun Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The autoregressive decoding for text generation in large language models (LLMs), while widely used, is inherently suboptimal due to the lack of a built-in mechanism to perform refinement and/or correction of the generated content. In this paper, we consider optimality in terms of the joint probability over the generated response, when jointly considering all tokens at the same time. We theoretically characterize the potential deviation of the autoregressively generated response from its globally optimal counterpart that is of the same length. Our analysis suggests that we need to be cautious when noticeable uncertainty arises during text generation, which may signal the sub-optimality of the generation history. To address the pitfall of autoregressive decoding for text generation, we propose an approach that incorporates a sliding reflection window and a pausing criterion, such that refinement and generation can be carried out interchangeably as the decoding proceeds. Our selective refinement framework strikes a balance between efficiency and optimality, and our extensive experimental results demonstrate the effectiveness of our approach.
zh

[NLP-52] Advancing Reasoning in Large Language Models : Promising Methods and Approaches

【速读】：该论文旨在解决大型语言模型（LLMs）在复杂推理能力方面未能达到人类预期的问题。关键在于探索和综述增强LLMs推理能力的各种新兴技术，包括提示策略（如Chain-of-Thought推理、Self-Consistency和Tree-of-Thought推理）、架构创新（如检索增强模型、模块化推理网络和神经符号集成）以及学习范式（如使用特定推理数据集的微调、强化学习和自监督推理目标）。通过综合这些最新进展，论文旨在为未来研究及推理增强型LLMs的实际应用提供洞见。

链接: https://arxiv.org/abs/2502.03671
作者: Avinash Patil
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 Pages, 1 Figure, IEEE Format

点击查看摘要

Abstract:Large Language Models (LLMs) have succeeded remarkably in various natural language processing (NLP) tasks, yet their reasoning capabilities remain a fundamental challenge. While LLMs exhibit impressive fluency and factual recall, their ability to perform complex reasoning-spanning logical deduction, mathematical problem-solving, commonsense inference, and multi-step reasoning-often falls short of human expectations. This survey provides a comprehensive review of emerging techniques enhancing reasoning in LLMs. We categorize existing methods into key approaches, including prompting strategies (e.g., Chain-of-Thought reasoning, Self-Consistency, and Tree-of-Thought reasoning), architectural innovations (e.g., retrieval-augmented models, modular reasoning networks, and neuro-symbolic integration), and learning paradigms (e.g., fine-tuning with reasoning-specific datasets, reinforcement learning, and self-supervised reasoning objectives). Additionally, we explore evaluation frameworks used to assess reasoning in LLMs and highlight open challenges, such as hallucinations, robustness, and reasoning generalization across diverse tasks. By synthesizing recent advancements, this survey aims to provide insights into promising directions for future research and practical applications of reasoning-augmented LLMs.
zh

[NLP-53] Looking for the Inner Music: Probing LLM s Understanding of Literary Style

【速读】：该论文旨在探究大型语言模型（LLMs）在识别作者身份和区分小说体裁方面的表现，并分析这些模型如何通过不同方法捕捉风格特征。关键在于使用三种方法探查高性能LLM中的风格定义特征，包括直接的句法操作和内部机制分析，从而揭示作者身份和体裁风格的差异性影响因素。研究表明，作者身份风格比体裁风格更容易界定，且受细微句法决策和上下文词汇使用的更多影响，但某些特征如代词使用和词语顺序对两种文学风格的定义都至关重要。

链接: https://arxiv.org/abs/2502.03647
作者: Rebecca M. M. Hicke,David Mimno
机构: Cornell University (康奈尔大学); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work has demonstrated that language models can be trained to identify the author of much shorter literary passages than has been thought feasible for traditional stylometry. We replicate these results for authorship and extend them to a new dataset measuring novel genre. We find that LLMs are able to distinguish authorship and genre, but they do so in different ways. Some models seem to rely more on memorization, while others benefit more from training to learn author/genre characteristics. We then use three methods to probe one high-performing LLM for features that define style. These include direct syntactic ablations to input text as well as two methods that look at model internals. We find that authorial style is easier to define than genre-level style and is more impacted by minor syntactic decisions and contextual word usage. However, some traits like pronoun usage and word order prove significant for defining both kinds of literary style.
zh

[NLP-54] Context-Preserving Gradient Modulation for Large Language Models : A Novel Approach to Semantic Consistency in Long-Form Text Generation

【速读】：该论文旨在解决长文本生成过程中维持语义一致性的问题，特别是在长时间序列中防止上下文漂移和连贯性退化。论文的关键解决方案是一种新颖的梯度调节方法，通过引入一种调制函数来动态调整参数更新，基于学习到的上下文依赖关系选择性地增强或减弱梯度，从而确保生成的文本与先前的论述保持一致，同时不增加显著的计算开销。

链接: https://arxiv.org/abs/2502.03643
作者: Nirola Kobanov,Edmund Weatherstone,Zachary Vanderpoel,Orlando Wetherby
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Maintaining semantic consistency over extended text sequences remains a fundamental challenge in long-form text generation, where conventional training methodologies often struggle to prevent contextual drift and coherence degradation. A novel gradient modulation approach is introduced, designed to adjust parameter updates dynamically in response to contextual relevance, ensuring that generated text remains aligned with prior discourse. By integrating a modulation function that selectively amplifies or attenuates gradients based on learned contextual dependencies, the proposed method enhances the stability of model-generated narratives without imposing significant computational overhead. Comparative evaluations against baseline models reveal improvements in coherence, contextual retention, and long-range dependency tracking, demonstrating the effectiveness of modifying the learning process at the gradient level. The results indicate that sentence structure variability and lexical diversity benefit from this approach, mitigating repetitive phrasing and improving adaptability across diverse linguistic contexts. Statistical validation of coherence metrics further substantiates the observed enhancements, with a significant reduction in inconsistencies emerging as a direct consequence of the modulation mechanism. Computational efficiency assessments confirm that the framework achieves these gains without requiring substantial modifications to the underlying architecture, ensuring compatibility with existing optimization workflows.
zh

[NLP-55] REALEDIT: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations

【速读】：该论文旨在解决现有图像编辑模型难以满足真实世界需求的问题。尽管这些模型在学术基准测试中表现出色，但它们尚未广泛应用于实际用户需求。为了解决这一问题，论文的关键在于引入了一个名为REALEDIT的大规模图像编辑数据集，该数据集包含了真实的用户请求和人工编辑样本，从而提升了训练数据的真实性和多样性。通过使用48K的训练样本，并训练REALEDIT模型，实现了显著性能提升，在人类评估中领先对手多达165 Elo点，在自动化VIEScore指标上相对改进92%。此外，REALEDIT还在检测编辑图像方面展示了其广泛应用潜力。

链接: https://arxiv.org/abs/2502.03629
作者: Peter Sushko,Ayana Bharadwaj,Zhi Yang Lim,Vasily Ilin,Ben Caffee,Dongping Chen,Mohammadreza Salehi,Cheng-Yu Hsieh,Ranjay Krishna
机构: University of Washington
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing image editing models struggle to meet real-world demands. Despite excelling in academic benchmarks, they have yet to be widely adopted for real user needs. Datasets that power these models use artificial edits, lacking the scale and ecological validity necessary to address the true diversity of user requests. We introduce REALEDIT, a large-scale image editing dataset with authentic user requests and human-made edits sourced from Reddit. REALEDIT includes a test set of 9300 examples to evaluate models on real user requests. Our results show that existing models fall short on these tasks, highlighting the need for realistic training data. To address this, we introduce 48K training examples and train our REALEDIT model, achieving substantial gains - outperforming competitors by up to 165 Elo points in human judgment and 92 percent relative improvement on the automated VIEScore metric. We deploy our model on Reddit, testing it on new requests, and receive positive feedback. Beyond image editing, we explore REALEDIT’s potential in detecting edited images by partnering with a deepfake detection non-profit. Finetuning their model on REALEDIT data improves its F1-score by 14 percentage points, underscoring the dataset’s value for broad applications.
zh

[NLP-56] Sorting the Babble in Babel: Assessing the Performance of Language Detection Algorithms on the OpenAlex Database

【速读】：该论文旨在优化OpenAlex数据库中的语言元数据质量，通过设计、应用和评估基于最新高效自动语言检测算法的各种语言分类程序。解决方案的关键在于根据不同应用场景选择合适的算法：在重视精度的场景中，使用LangID算法处理文章标题、摘要及期刊名称效果最佳；而在重视召回率或考虑处理时间的场景中，则应仅使用FastSpell算法处理文章标题以获得最优性能。

链接: https://arxiv.org/abs/2502.03627
作者: Maxime Holmberg Sainte-Marie,Diego Kozlowski,Lucía Céspedes,Vincent Larivière
机构: Chaire UNESCO sur la science ouverte, École de Bibliothéconomie et des Sciences de l’Information, Université de Montréal (联合国教科文开放科学教席，图书与信息学院，蒙特利尔大学); Consortium Érudit, Université de Montréal (Érudit联盟，蒙特利尔大学); Observatoire des Sciences et des Technologies, Centre Interuniversitaire de recherche sur la science et la technologie, Université du Québec à Montréal (科学技术观察站，跨大学科学研究与技术中心，魁北克大学蒙特利尔分校)
类目: Computation and Language (cs.CL)
备注: 33 pages, 4 figures

点击查看摘要

Abstract:Following a recent study on the quality of OpenAlex linguistic metadata (Céspedes et al., 2025), the present paper aims to optimize the latter through the design, use, and evaluation of various linguistic classification procedures based on the latest and most efficient automatic language detection algorithms. Starting from a multilingual set of manually-annotated samples of articles indexed in the database, different classification procedures are then designed, based on the application of a set of language detection algorithms on a series of corpora generated from different combinations of textual metadata of indexed articles. At sample level first, the performance of these different procedures for each of the main languages in the database is evaluated in terms of precision, recall, and processing time. Then, overall procedure performance is estimated at the database level by means of a probabilistic simulation of harmonically aggregated and weighted scores. Results show that procedure performance strongly depends on the importance given to each of the measures implemented: for contexts where precision is preferred, using the LangID algorithm on article titles, abstracts as well as journal names gives the best results; however, for all cases where recall is considered at least slightly more important than precision or as soon as processing times are given any kind of consideration, use of the FastSpell algorithm on article titles only outperforms all other alternatives. Given the lack of truly multilingual, large-scale bibliographic databases, it is hoped that these results help confirm and foster the unparalleled potential of the OpenAlex database for cross-linguistic, bibliometric-based research and analysis.
zh

[NLP-57] Can Cross Encoders Produce Useful Sentence Embeddings?

【速读】：该论文旨在解决跨编码器（Cross Encoders, CEs）在信息检索管道中的使用限制，即它们在推理阶段需要成对的句子，通常只能作为重排序器。论文的关键发现在于，跨编码器早期层的嵌入可以用于信息检索管道中，从而通过蒸馏技术提取出一个轻量级的双编码器（Dual Encoders, DEs），实现推理速度提升5.15倍。这一解决方案的核心在于利用跨编码器的嵌入能力，以克服其在实际应用中的局限性。

链接: https://arxiv.org/abs/2502.03552
作者: Haritha Ananthakrishnan,Julian Dolby,Harsha Kokel,Horst Samulowitz,Kavitha Srinivas
机构: IBM (IBM)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Cross encoders (CEs) are trained with sentence pairs to detect relatedness. As CEs require sentence pairs at inference, the prevailing view is that they can only be used as re-rankers in information retrieval pipelines. Dual encoders (DEs) are instead used to embed sentences, where sentence pairs are encoded by two separate encoders with shared weights at training, and a loss function that ensures the pair’s embeddings lie close in vector space if the sentences are related. DEs however, require much larger datasets to train, and are less accurate than CEs. We report a curious finding that embeddings from earlier layers of CEs can in fact be used within an information retrieval pipeline. We show how to exploit CEs to distill a lighter-weight DE, with a 5.15x speedup in inference time.
zh

[NLP-58] An Empirical Exploration of ChatGPT s Ability to Support Problem Formulation Tasks for Mission Engineering and a Documentation of its Performance Variability

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLM）在支持任务工程（Mission Engineering, ME）问题表述中的表现，特别是在利益相关者识别方面。论文通过执行多个并行尝试，并评估这些尝试的质量和变异性，来分析ChatGPT-3.5在处理NASA空间任务设计挑战中的能力。关键发现是，尽管LLM在识别以人类为中心的利益相关者方面表现出色，但在识别外部系统和环境因素方面表现不佳，并且难以保持所需的抽象层次，倾向于产生特定解决方案的输出，这可能不适合问题表述。此外，LLM在不同尝试之间存在显著差异，强调了使用LLM输出时应谨慎行事，理想情况下应将其视为一种随机过程。总体而言，研究表明，虽然ChatGPT可以减轻一些专家的工作负担，但其缺乏一致性和领域理解可能限制了其在问题表述任务中的可靠性。

链接: https://arxiv.org/abs/2502.03511
作者: Max Ofsa,Taylan G. Topcu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 3 figures, submitted to Conference on Systems Engineering Research (CSER)

点击查看摘要

Abstract:Systems engineering (SE) is evolving with the availability of generative artificial intelligence (AI) and the demand for a systems-of-systems perspective, formalized under the purview of mission engineering (ME) in the US Department of Defense. Formulating ME problems is challenging because they are open-ended exercises that involve translation of ill-defined problems into well-defined ones that are amenable for engineering development. It remains to be seen to which extent AI could assist problem formulation objectives. To that end, this paper explores the quality and consistency of multi-purpose Large Language Models (LLM) in supporting ME problem formulation tasks, specifically focusing on stakeholder identification. We identify a relevant reference problem, a NASA space mission design challenge, and document ChatGPT-3.5’s ability to perform stakeholder identification tasks. We execute multiple parallel attempts and qualitatively evaluate LLM outputs, focusing on both their quality and variability. Our findings portray a nuanced picture. We find that the LLM performs well in identifying human-focused stakeholders but poorly in recognizing external systems and environmental factors, despite explicit efforts to account for these. Additionally, LLMs struggle with preserving the desired level of abstraction and exhibit a tendency to produce solution specific outputs that are inappropriate for problem formulation. More importantly, we document great variability among parallel threads, highlighting that LLM outputs should be used with caution, ideally by adopting a stochastic view of their abilities. Overall, our findings suggest that, while ChatGPT could reduce some expert workload, its lack of consistency and domain understanding may limit its reliability for problem formulation tasks.
zh

[NLP-59] aching Language Models to Critique via Reinforcement Learning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在自我批判与优化方面的能力限制问题。关键在于提出了一种名为\textttCTRL的框架，通过强化学习（Reinforcement Learning）训练批评模型（critic model），使其能够生成反馈以最大化修正性能，而无需人工监督。这种方法显著提升了通过率，并减少了累积错误，同时证明了这些批评模型可以作为精准的生成奖励模型，在测试阶段通过迭代批判-修订过程实现高达106.1%的相对改进。

链接: https://arxiv.org/abs/2502.03492
作者: Zhihui Xie,Jie chen,Liyu Chen,Weichao Mao,Jingjing Xu,Lingpeng Kong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose \textttCTRL , a framework for \textttC ritic \textttT raining via \textttR einforcement \textttL earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with \textttCTRL significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.
zh

[NLP-60] From Informal to Formal – Incorporating and Evaluating LLM s on Natural Language Requirements to Verifiable Formal Proofs

【速读】：该论文旨在解决大型语言模型（LLMs）在形式验证任务中的性能评估与提升问题。论文的关键在于将形式验证分解为六个子任务，并通过GPT-4生成了18k高质量的指令-响应对，构建了两个数据集：一个14k+的微调数据集FM-alpaca和一个4k的基准数据集FM-Bench。论文发现，LLMs在给定代码或详细的证明步骤描述时，能够很好地撰写证明片段，并且微调带来了显著的性能提升，最多可达三倍。此外，使用形式化数据进行微调还增强了数学、推理和编码能力。

链接: https://arxiv.org/abs/2501.16207
作者: Jialun Cao,Yaojie Lu,Meiziniu Li,Haoyang Ma,Haokun Li,Mengda He,Cheng Wen,Le Sun,Hongyu Zhang,Shengchao Qin,Shing-Chi Cheung,Cong Tian
机构: The Hong Kong University of Science and Technology (香港科技大学); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); Fermat Labs, Huawei (费马实验室, 华为); Guangzhou Institute of Technology, Xidian University (广州技术研究院, 西安电子科技大学); Chongqing University (重庆大学); ICTT and ISN Laboratory, Xidian University (ICTT和ISN实验室, 西安电子科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 13 pages

点击查看摘要

Abstract:The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled in mathematical competitions like IMO, showing significant progress. However, these studies intertwined multiple skills simultaneously, i.e., problem-solving, reasoning, and writing formal specifications, making it hard to precisely identify the LLMs’ strengths and weaknesses in each task. This paper focuses on formal verification, an immediate application scenario of formal reasoning, and decomposes it into six sub-tasks. We constructed 18k high-quality instruction-response pairs across five mainstream formal specification languages (Coq, Lean4, Dafny, ACSL, and TLA+) in six formal-verification-related tasks by distilling GPT-4o. They are split into a 14k+ fine-tuning dataset FM-alpaca and a 4k benchmark FM-Bench. We found that LLMs are good at writing proof segments when given either the code, or the detailed description of proof steps. Also, the fine-tuning brought about a nearly threefold improvement at most. Interestingly, we observed that fine-tuning with formal data also enhances mathematics, reasoning, and coding abilities. We hope our findings inspire further research. Fine-tuned models are released to facilitate subsequent studies
zh

[NLP-61] How Should I Build A Benchmark? Revisiting Code-Related Benchmarks For LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在代码相关基准测试（code-related benchmarks）开发过程中缺乏系统性指导的问题，以确保基准测试的质量、可靠性和可重复性。论文的关键解决方案是提出了How2Bench，这是一套包含55项标准的检查清单，作为全面指导代码相关基准测试开发的准则。通过应用How2Bench，作者评估了过去十年发布的274个基准测试，并发现了诸多问题，包括数据质量保障措施不足及开源程度不够等。

链接: https://arxiv.org/abs/2501.10711
作者: Jialun Cao,Yuk-Kit Chan,Zixuan Ling,Wenxuan Wang,Shuqing Li,Mingwei Liu,Ruixi Qiao,Yuting Han,Chaozheng Wang,Boxi Yu,Pinjia He,Shuai Wang,Zibin Zheng,Michael R. Lyu,Shing-Chi Cheung
机构: The Hong Kong University of Science and Technology; The Chinese University of Hong Kong; Sun Yat-Sen University; Chinese Academy of Science, Institute of Automation; Beijing Language and Culture University; The Chinese University of Hong Kong, Shenzhen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 42 pages

点击查看摘要

Abstract:Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55- 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.
zh

[NLP-62] Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

【速读】：该论文旨在解决当前最先进的基于大型语言模型（LLMs）的文本到语音（TTS）系统多阶段模型复杂性的问题。这些系统通常需要多个单独的模型（如在LLM后使用扩散模型），这使得在训练或测试过程中决定是否扩展特定模型变得复杂。论文的关键解决方案是提出了一种名为Llasa的简单框架，它采用单层向量量化（VQ）编解码器和单一Transformer架构，以与标准LLMs（如Llama）完全兼容。通过实验发现，增加Llasa在训练时间的计算资源能够提升合成语音的自然度，并生成更复杂且准确的韵律模式；而在推理时间增加计算资源可以改善情感表达、音色一致性及内容准确性。论文还公开发布了Llasa TTS模型（1B, 3B, 8B参数规模）及其编解码器模型的检查点和训练代码。

链接: https://arxiv.org/abs/2502.04128
作者: Zhen Ye,Xinfa Zhu,Chi-Min Chan,Xinsheng Wang,Xu Tan,Jiahe Lei,Yi Peng,Haohe Liu,Yizhu Jin,Zheqi DAI,Hongzhan Lin,Jianyi Chen,Xingjian Du,Liumeng Xue,Yunlin Chen,Zhifei Li,Lei Xie,Qiuqiang Kong,Yike Guo,Wei Xue
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.
zh

[NLP-63] DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

【速读】：该论文旨在解决连续语音表示的自回归生成问题，特别是在不使用离散语音标记的情况下。论文的关键在于提出了一种名为Diffusion Transformer Autoregressive Modeling (DiTAR) 的方法，该方法结合语言模型与扩散变换器，采用基于补丁的自回归框架，通过分而治之的策略生成补丁，从而显著提升了自回归模型在连续标记上的效能，并减少了计算需求。

链接: https://arxiv.org/abs/2502.03930
作者: Dongya Jia,Zhuo Chen,Jiawei Chen,Chenpeng Du,Jian Wu,Jian Cong,Xiaobin Zhuang,Chumin Li,Zhen Wei,Yuping Wang,Yuxuan Wang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.
zh

计算机视觉

[CV-0] SMART: Advancing Scalable Map Priors for Driving Topology Reasoning ICRA2025

【速读】：该论文旨在解决自动驾驶中车道感知与拓扑推理的可扩展性问题。当前方法依赖于一致的传感器配置来训练模型，这限制了其在不同场景下的应用。论文的关键在于提出SMART系统，通过利用标准定义（SD）地图和卫星地图来学习地图先验模型，从而摆脱对特定传感器设置的依赖。此方法使得SMART仅使用SD和卫星输入即可实现卓越的离线车道拓扑理解，并且可以无缝集成到现有的在线拓扑推理方法中，提升了OpenLane-V2基准测试性能高达28%。

链接: https://arxiv.org/abs/2502.04329
作者: Junjie Ye,David Paz,Hengyuan Zhang,Yuliang Guo,Xinyu Huang,Henrik I. Christensen,Yue Wang,Liu Ren
机构: Bosch Research North America; Thomas Lord Department of Computer Science, University of Southern California (南加州大学托马斯·洛厄尔计算机科学系); Bosch North America and Bosch Center for AI (BCAI) (博世北美和博世人工智能中心); Contextual Robotics Institute, UC San Diego (加利福尼亚大学圣地亚哥分校机器人背景研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICRA 2025. Project page: this https URL

点击查看摘要

Abstract:Topology reasoning is crucial for autonomous driving as it enables comprehensive understanding of connectivity and relationships between lanes and traffic elements. While recent approaches have shown success in perceiving driving topology using vehicle-mounted sensors, their scalability is hindered by the reliance on training data captured by consistent sensor configurations. We identify that the key factor in scalable lane perception and topology reasoning is the elimination of this sensor-dependent feature. To address this, we propose SMART, a scalable solution that leverages easily available standard-definition (SD) and satellite maps to learn a map prior model, supervised by large-scale geo-referenced high-definition (HD) maps independent of sensor settings. Attributed to scaled training, SMART alone achieves superior offline lane topology understanding using only SD and satellite inputs. Extensive experiments further demonstrate that SMART can be seamlessly integrated into any online topology reasoning methods, yielding significant improvements of up to 28% on the OpenLane-V2 benchmark.
zh

[CV-1] WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLM s

【速读】：该论文旨在解决多模态视频理解评估基准不足的问题。解决方案的关键在于提出WorldSense基准，它具有三方面特点：首先，通过设计任务强调音频与视频的强耦合，促使模型有效利用多模态协同感知；其次，收录了多样化的1,662段音视频同步视频，并细分为8个主要领域和67个子类别，同时包含跨26个不同任务的3,172个多选题问答对，以实现全面评估；最后，所有问答对均经过80位专家多轮校正标注，确保高质量。

链接: https://arxiv.org/abs/2502.04326
作者: Jack Hong,Shilin Yan,Jiayin Cai,Xiaolong Jiang,Yao Hu,Weidi Xie
机构: Xiaohongshu Inc.(小红书公司); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i) collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii) diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii) high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (48.0% best accuracy). We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.
zh

[CV-2] ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

【速读】：该论文旨在探究多模态扩散变换器（Diffusion Transformers, DiTs）的丰富表征是否展现出独特的属性以增强其可解释性。论文的关键解决方案是引入了一种名为ConceptAttention的新方法，该方法利用DiT注意力层的表达能力生成高质量的显著性图谱，从而精确定位图像中的文本概念。关键创新在于通过对DiT注意力层输出空间进行线性投影，生成的显著性图谱比常用的交叉注意力机制更为清晰，且无需额外训练。此外，ConceptAttention在零样本图像分割基准测试中表现出色，超越了其他11种零样本可解释性方法。

链接: https://arxiv.org/abs/2502.04320
作者: Alec Helbling,Tuna Han Salih Meral,Ben Hoover,Pinar Yanardag,Duen Horng Chau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms. Remarkably, ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 11 other zero-shot interpretability methods on the ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our work contributes the first evidence that the representations of multi-modal DiT models like Flux are highly transferable to vision tasks like segmentation, even outperforming multi-modal foundation models like CLIP.
zh

[CV-3] sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views

【速读】：该论文旨在解决从稀疏向外视角重建无界户外场景的问题，主要挑战在于视图重叠有限。论文的关键解决方案在于sshELF方法，通过层次化外推潜在特征实现快速单次稀疏视图三维场景重建。其关键是将信息外推与基础结构解码分离，从而有效跨训练场景传递结构模式，并通过两级网络设计分离虚拟视图生成与三维基元解码以提高效率和模块化设计。此外，还集成了预训练的基础模型来联合推断潜在特征和纹理，进一步提升场景理解和泛化能力。

链接: https://arxiv.org/abs/2502.04318
作者: Eyvaz Najafli,Marius Kästingschäfer,Sebastian Bernhard,Thomas Brox,Andreas Geiger
机构: Continental; University of Freiburg; University of Tübingen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Joint first authorship

点击查看摘要

Abstract:Reconstructing unbounded outdoor scenes from sparse outward-facing views poses significant challenges due to minimal view overlap. Previous methods often lack cross-scene understanding and their primitive-centric formulations overload local features to compensate for missing global context, resulting in blurriness in unseen parts of the scene. We propose sshELF, a fast, single-shot pipeline for sparse-view 3D scene reconstruction via hierarchal extrapolation of latent features. Our key insights is that disentangling information extrapolation from primitive decoding allows efficient transfer of structural patterns across training scenes. Our method: (1) learns cross-scene priors to generate intermediate virtual views to extrapolate to unobserved regions, (2) offers a two-stage network design separating virtual view generation from 3D primitive decoding for efficient training and modular model design, and (3) integrates a pre-trained foundation model for joint inference of latent features and texture, improving scene understanding and generalization. sshELF can reconstruct 360 degree scenes from six sparse input views and achieves competitive results on synthetic and real-world datasets. We find that sshELF faithfully reconstructs occluded regions, supports real-time rendering, and provides rich latent features for downstream applications. The code will be released.
zh

[CV-4] Factorized Implicit Global Convolution for Automotive Computational Fluid Dynamics Prediction

【速读】：该论文旨在解决在汽车设计中计算流体动力学（Computational Fluid Dynamics, CFD）分析大型三维点云数据时面临的高计算复杂性问题。现有深度学习方法难以处理高分辨率三维数据。为了解决这一问题，论文提出了一种名为因子隐式全局卷积（Factorized Implicit Global Convolution, FIGConv）的新架构。关键在于通过分解隐式网格逼近高分辨率域、利用二维重参数化实现高效的全局卷积以及采用U形结构有效收集和整合信息，从而将复杂度从现有的O(N^3)显著降低到O(N^2)。实验结果表明，在DrivAerNet数据集上，该方法在阻力预测中达到了R²值为0.95，相比先前最先进方法有显著改进，相对均方误差降低了40%，绝对均方误差降低了70%。

链接: https://arxiv.org/abs/2502.04317
作者: Chris Choy,Alexey Kamenev,Jean Kossaifi,Max Rietmann,Jan Kautz,Kamyar Azizzadenesheli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computational Fluid Dynamics (CFD) is crucial for automotive design, requiring the analysis of large 3D point clouds to study how vehicle geometry affects pressure fields and drag forces. However, existing deep learning approaches for CFD struggle with the computational complexity of processing high-resolution 3D data. We propose Factorized Implicit Global Convolution (FIGConv), a novel architecture that efficiently solves CFD problems for very large 3D meshes with arbitrary input and output geometries. FIGConv achieves quadratic complexity O(N^2) , a significant improvement over existing 3D neural CFD models that require cubic complexity O(N^3) . Our approach combines Factorized Implicit Grids to approximate high-resolution domains, efficient global convolutions through 2D reparameterization, and a U-shaped architecture for effective information gathering and integration. We validate our approach on the industry-standard Ahmed body dataset and the large-scale DrivAerNet dataset. In DrivAerNet, our model achieves an R^2 value of 0.95 for drag prediction, outperforming the previous state-of-the-art by a significant margin. This represents a 40% improvement in relative mean squared error and a 70% improvement in absolute mean squared error over previous methods.
zh

[CV-5] MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation

【速读】：该论文旨在解决在图像到视频（Image-to-Video, I2V）生成系统中实现直观镜头设计的两大挑战：一是有效捕捉用户的运动设计意图，包括相机移动和场景内物体运动的联合指定；二是表示可被视频扩散模型有效利用的运动信息以合成图像动画。为了解决这些问题，论文提出了MotionCanvas方法，该方法将用户驱动控制集成到I2V生成模型中，使用户能够以场景感知的方式控制对象和相机运动。通过融合经典计算机图形学和现代视频生成技术的见解，MotionCanvas实现了在无需昂贵的三维相关训练数据的情况下进行三维感知的运动控制。这种方法的关键在于它能够将用户对场景空间运动意图的直观描述转化为时空运动条件信号，供视频扩散模型使用。

链接: https://arxiv.org/abs/2502.04299
作者: Jinbo Xing,Long Mai,Cusuh Ham,Jiahui Huang,Aniruddha Mahapatra,Chi-Wing Fu,Tien-Tsin Wong,Feng Liu
机构: The Chinese University of Hong Kong(香港中文大学); Adobe Research; Monash University(蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: It is best viewed in Acrobat. Project page: this https URL

点击查看摘要

Abstract:This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motions in a scene. However, enabling intuitive shot design in modern image-to-video generation systems presents two main challenges: first, effectively capturing user intentions on the motion design, where both camera movements and scene-space object motions must be specified jointly; and second, representing motion information that can be effectively utilized by a video diffusion model to synthesize the image animations. To address these challenges, we introduce MotionCanvas, a method that integrates user-driven controls into image-to-video (I2V) generation models, allowing users to control both object and camera motions in a scene-aware manner. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. We demonstrate the effectiveness of our method on a wide range of real-world image content and shot-design scenarios, highlighting its potential to enhance the creative workflows in digital content creation and adapt to various image and video editing applications.
zh

[CV-6] Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression

【速读】：该论文旨在解决在不同设置下构建交互式视频世界模型和策略以实现机器人实时学习的难题。为应对这一挑战，论文提出了一种名为异构掩码自回归（Heterogeneous Masked Autoregression, HMA）的方法。HMA的关键在于利用来自不同机器人实体、领域和任务的观察和动作序列进行异构预训练，并采用掩码自回归技术生成量化或软标记的视频预测，从而实现高质量数据生成和评估，同时提升生成速度达15倍。

链接: https://arxiv.org/abs/2502.04296
作者: Lirui Wang,Kevin Zhao,Chaoqi Liu,Xinlei Chen
机构: MIT; UIUC; Meta (FAIR)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL

点击查看摘要

Abstract:We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building interactive video world models and policies for robotics is difficult due to the challenge of handling diverse settings while maintaining computational efficiency to run in real time. HMA uses heterogeneous pre-training from observations and action sequences across different robotic embodiments, domains, and tasks. HMA uses masked autoregression to generate quantized or soft tokens for video predictions. \ourshort achieves better visual fidelity and controllability than the previous robotic video generation models with 15 times faster speed in the real world. After post-training, this model can be used as a video simulator from low-level action inputs for evaluating policies and generating synthetic data. See this link this https URL for more information.
zh

[CV-7] GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation

【速读】：该论文旨在解决模型无关类别级姿态估计中提取跨实例通用上下文特征的挑战，特别是在部分可见性条件下现有方法的失效问题。关键解决方案在于引入了一种先完整后聚合的特征提取策略，利用类别先验知识克服部分观测的限制。论文提出的方法GCE-Pose通过集成类别级全局上下文先验，增强了对新实例的姿态估计，并通过语义形状重构模块（Semantic Shape Reconstruction, SSR）实现了全局几何和语义的重构。进一步地，通过全局上下文增强（Global Context Enhanced, GCE）特征融合模块有效地融合了部分RGB-D观测与重构的全局上下文特征。

链接: https://arxiv.org/abs/2502.04293
作者: Weihang Li,Hongli Xu,Junwen Huang,Hyunjun Jung,Peter KT Yu,Nassir Navab,Benjamin Busam
机构: Technical University of Munich(慕尼黑工业大学); Munich Center for Machine Learning(慕尼黑机器学习中心); XYZ Robotics(XYZ机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A key challenge in model-free category-level pose estimation is the extraction of contextual object features that generalize across varying instances within a specific category. Recent approaches leverage foundational features to capture semantic and geometry cues from data. However, these approaches fail under partial visibility. We overcome this with a first-complete-then-aggregate strategy for feature extraction utilizing class priors. In this paper, we present GCE-Pose, a method that enhances pose estimation for novel instances by integrating category-level global context prior. GCE-Pose performs semantic shape reconstruction with a proposed Semantic Shape Reconstruction (SSR) module. Given an unseen partial RGB-D object instance, our SSR module reconstructs the instance’s global geometry and semantics by deforming category-specific 3D semantic prototypes through a learned deep Linear Shape Model. We further introduce a Global Context Enhanced (GCE) feature fusion module that effectively fuses features from partial RGB-D observations and the reconstructed global context. Extensive experiments validate the impact of our global context prior and the effectiveness of the GCE fusion module, demonstrating that GCE-Pose significantly outperforms existing methods on challenging real-world datasets HouseCat6D and NOCS-REAL275. Our project page is available at this https URL.
zh

[CV-8] Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances

【速读】：该论文旨在解决密集场景下的定向目标检测（OOD）问题，特别是在仅使用点标注（point annotations）进行弱监督学习的情况下。论文提出的关键解决方案是Point2RBox-v2，其核心包括三个原则：1）高斯重叠损失（Gaussian overlap loss），通过将对象视为二维高斯分布来学习每个实例的上限，并最小化它们之间的重叠；2）Voronoi流域损失（Voronoi watershed loss），通过Voronoi镶嵌上的流域算法来学习每个实例的下限；3）一致性损失（consistency loss），用于学习输入图像及其增强视图之间输出集合在尺寸和旋转上的变化。这些方法结合一些辅助技术，如边缘损失（edge loss）和复制粘贴（copy-paste），以进一步提升检测性能。据我们所知，Point2RBox-v2是首个探索实例间空间布局以实现基于点标注的定向目标检测的方法。该方案优雅且轻量级，尤其适用于密集场景，实验结果表明其在DOTA、HRSC和FAIR1M数据集上的表现分别为62.61%、86.15%和34.71%。

链接: https://arxiv.org/abs/2502.04268
作者: Yi Yu,Botao Ren,Peiyuan Zhang,Mingxin Liu,Junwei Luo,Shaofeng Zhang,Feipeng Da,Junchi Yan,Xue Yang
机构: Southeast University (东南大学); Tsinghua University (清华大学); Wuhan University (武汉大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 10 tables

点击查看摘要

Abstract:With the rapidly increasing demand for oriented object detection (OOD), recent research involving weakly-supervised detectors for learning OOD from point annotations has gained great attention. In this paper, we rethink this challenging task setting with the layout among instances and present Point2RBox-v2. At the core are three principles: 1) Gaussian overlap loss. It learns an upper bound for each instance by treating objects as 2D Gaussian distributions and minimizing their overlap. 2) Voronoi watershed loss. It learns a lower bound for each instance through watershed on Voronoi tessellation. 3) Consistency loss. It learns the size/rotation variation between two output sets with respect to an input image and its augmented view. Supplemented by a few devised techniques, e.g. edge loss and copy-paste, the detector is further this http URL our best knowledge, Point2RBox-v2 is the first approach to explore the spatial layout among instances for learning point-supervised OOD. Our solution is elegant and lightweight, yet it is expected to give a competitive performance especially in densely packed scenes: 62.61%/86.15%/34.71% on DOTA/HRSC/FAIR1M. Code is available at this https URL.
zh

[CV-9] Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion ICLR2025

【速读】：该论文旨在解决在单模态任务（如图像到图像检索）中单独利用多模态预训练模型（如CLIP）的文本或图像编码器导致性能不佳的问题。论文的关键在于指出，由于CLIP风格的跨模态对比损失未能施加任何单模态约束，从而导致了所谓的单模态错位。为了解决这一问题，作者提出通过基于优化的模态反转技术来映射输入模态到互补模态的表示，而无需辅助数据或额外训练的适配器。实验结果表明，在图像到图像和文本到文本检索等单模态任务中，采用跨模态方法显著提升了性能，并且在原生跨模态任务（如零样本图像分类）中采用单模态方法则会降低性能。最后，论文提出在预训练目标中加入单模态项或将文本和图像特征嵌入空间之间的模态差距缩小可以减少单模态错位。

链接: https://arxiv.org/abs/2502.04263
作者: Marco Mistretta,Alberto Baldrati,Lorenzo Agnolucci,Marco Bertini,Andrew D. Bagdanov
机构: University of Florence (佛罗伦萨大学); Media Integration and Communication Center (MICC) (媒体整合与通信中心); University of Pisa (比萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication at ICLR 2025

点击查看摘要

Abstract:Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intra-modal baselines on more than fifteen datasets. Additionally, we demonstrate that approaching a native inter-modal task (e.g. zero-shot image classification) intra-modally decreases performance, further validating our findings. Finally, we show that incorporating an intra-modal term in the pre-training objective or narrowing the modality gap between the text and image feature embedding spaces helps reduce the intra-modal misalignment. The code is publicly available at: this https URL.
zh

[CV-10] An object detection approach for lane change and overtake detection from motion profiles

【速读】：该论文旨在解决在车队管理和驾驶员监控领域中，从车载摄像头视频中高效识别超车和变道行为的问题，同时尽量减少存储和分析的信息量。论文的关键解决方案在于提出了一种新颖的对象检测方法，应用于运动轮廓（motion profiles），即将驾驶视频压缩为单一图像的紧凑表示形式。此外，通过引入CoordConv层，进一步提升了模型性能，在平均精度（mAP）和F1分数方面达到了当前最先进水平。这种方案的极低计算需求使其特别适合在设备端运行。

链接: https://arxiv.org/abs/2502.04244
作者: Andrea Benericetti,Niccolò Bellaccini,Henrique Piñeiro Monteagudo,Matteo Simoncini,Francesco Sambo
机构: Verizon Connect Research (Verizon Connect 研究院); DISI, University of Bologna (博洛尼亚大学DISI学院); SMARTHEP项目 (资助来自欧盟地平线2020研究和创新计划)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:In the application domain of fleet management and driver monitoring, it is very challenging to obtain relevant driving events and activities from dashcam footage while minimizing the amount of information stored and analyzed. In this paper, we address the identification of overtake and lane change maneuvers with a novel object detection approach applied to motion profiles, a compact representation of driving video footage into a single image. To train and test our model we created an internal dataset of motion profile images obtained from a heterogeneous set of dashcam videos, manually labeled with overtake and lane change maneuvers by the ego-vehicle. In addition to a standard object-detection approach, we show how the inclusion of CoordConvolution layers further improves the model performance, in terms of mAP and F1 score, yielding state-of-the art performance when compared to other baselines from the literature. The extremely low computational requirements of the proposed solution make it especially suitable to run in device.
zh

[CV-11] Keep It Light! Simplifying Image Clustering Via Text-Free Adapters

【速读】：该论文旨在解决现有竞争性聚类管道在实际下游应用中的多模态设计复杂且难以训练的问题。这些方法通常依赖于大型语言模型（LLMs）或其他文本编码器以及文本-图像对，而这些数据在现实世界中往往不可用，并且需要大量的计算资源。论文的关键解决方案是提出了一种名为Simple Clustering via Pre-trained models (SCP)的方法，该方法仅训练一个小的聚类头，同时利用预训练视觉模型的特征表示和正样本对，从而实现无需文本信息的高度简化训练流程。实验结果表明，SCP在包括CIFAR-10、CIFAR-20、CIFAR-100、STL-10、ImageNet-10和ImageNet-Dogs在内的基准数据集上实现了具有竞争力的性能。此外，理论分析说明了在理想条件下，为了获得强大的聚类性能，额外的基于文本的嵌入可能不是必需的。

链接: https://arxiv.org/abs/2502.04226
作者: Yicen Li,Haitz Sáez de Ocáriz Borde,Anastasis Kratsios,Paul D. McNicholas
机构: Department of Mathematics and Statistics, McMaster University (麦克马斯特大学), Canada; Vector Institute (向量研究所), Toronto, Canada; University of Oxford (牛津大学), Oxford, United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computation (stat.CO); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Many competitive clustering pipelines have a multi-modal design, leveraging large language models (LLMs) or other text encoders, and text-image pairs, which are often unavailable in real-world downstream applications. Additionally, such frameworks are generally complicated to train and require substantial computational resources, making widespread adoption challenging. In this work, we show that in deep clustering, competitive performance with more complex state-of-the-art methods can be achieved using a text-free and highly simplified training pipeline. In particular, our approach, Simple Clustering via Pre-trained models (SCP), trains only a small cluster head while leveraging pre-trained vision model feature representations and positive data pairs. Experiments on benchmark datasets including CIFAR-10, CIFAR-20, CIFAR-100, STL-10, ImageNet-10, and ImageNet-Dogs, demonstrate that SCP achieves highly competitive performance. Furthermore, we provide a theoretical result explaining why, at least under ideal conditions, additional text-based embeddings may not be necessary to achieve strong clustering performance in vision.
zh

[CV-12] Eclair – Extracting Content and Layout with Integrated Reading Order for Documents

【速读】：该论文旨在解决复杂文档全面理解的需求，不仅限于文本提取，还涵盖文档结构（structure）的理解，包括格式、公式、表格以及多块和多列的阅读顺序等，并涉及语义信息（semantic information）以识别脚注和图像标题等元素。关键解决方案在于引入Éclair工具，它能够处理多种文档类型，从图像中按阅读顺序提取格式化文本，并提供边界框及其对应的语义类别。通过在多样化的人工标注基准上的评估，Éclair展示了其在文档级OCR和语义分类任务中的优越性能。

链接: https://arxiv.org/abs/2502.04223
作者: Ilia Karmanov,Amala Sanjay Deshmukh,Lukas Voegtle,Philipp Fischer,Kateryna Chumachenko,Timo Roman,Jarno Seppänen,Jupinder Parmar,Joseph Jennings,Andrew Tao,Karan Sapra
机构: NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical Character Recognition (OCR) technology is widely used to extract text from images of documents, facilitating efficient digitization and data retrieval. However, merely extracting text is insufficient when dealing with complex documents. Fully comprehending such documents requires an understanding of their structure – including formatting, formulas, tables, and the reading order of multiple blocks and columns across multiple pages – as well as semantic information for detecting elements like footnotes and image captions. This comprehensive understanding is crucial for downstream tasks such as retrieval, document question answering, and data curation for training Large Language Models (LLMs) and Vision Language Models (VLMs). To address this, we introduce Éclair, a general-purpose text-extraction tool specifically designed to process a wide range of document types. Given an image, Éclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. To thoroughly evaluate these novel capabilities, we introduce our diverse human-annotated benchmark for document-level OCR and semantic classification. Éclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics. Additionally, we evaluate Éclair on established benchmarks, demonstrating its versatility and strength across several evaluation standards.
zh

[CV-13] Enhanced Feature-based Image Stitching for Endoscopic Videos in Pediatric Eosinophilic Esophagitis

【速读】：该论文旨在解决内窥镜视频审查过程中频繁调整和重新定位的问题，这可能导致耗时且容易出错。论文的关键解决方案在于提出了一种新颖的预处理流程，通过四个步骤（关键帧选择、图像旋转校正、使用极坐标变换进行表面展开以生成平面图像、以及通过自适应直方图均衡化增强特征点匹配）来提高内窥镜图像拼接的质量。实验结果表明，该方法显著提升了图像对齐和拼接质量，相较于传统技术具有明显优势。

链接: https://arxiv.org/abs/2502.04207
作者: Juming Xiong,Muyang Li,Ruining Deng,Tianyuan Yao,Regina N Tyree,Girish Hiremath,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video endoscopy represents a major advance in the investigation of gastrointestinal diseases. Reviewing endoscopy videos often involves frequent adjustments and reorientations to piece together a complete view, which can be both time-consuming and prone to errors. Image stitching techniques address this issue by providing a continuous and complete visualization of the examined area. However, endoscopic images, particularly those of the esophagus, present unique challenges. The smooth surface, lack of distinct feature points, and non-horizontal orientation complicate the stitching process, rendering traditional feature-based methods often ineffective for these types of images. In this paper, we propose a novel preprocessing pipeline designed to enhance endoscopic image stitching through advanced computational techniques. Our approach converts endoscopic video data into continuous 2D images by following four key steps: (1) keyframe selection, (2) image rotation adjustment to correct distortions, (3) surface unwrapping using polar coordinate transformation to generate a flat image, and (4) feature point matching enhanced by Adaptive Histogram Equalization for improved feature detection. We evaluate stitching quality through the assessment of valid feature point match pairs. Experiments conducted on 20 pediatric endoscopy videos demonstrate that our method significantly improves image alignment and stitching quality compared to traditional techniques, laying a robust foundation for more effective panoramic image creation.
zh

[CV-14] Safeguarding connected autonomous vehicle communication: Protocols intra- and inter-vehicular attacks and defenses

【速读】：该论文旨在解决Connected Autonomous Vehicles (CAVs) 在通信安全方面存在的显著安全漏洞问题。论文的关键解决方案在于提出了一套新的分类系统来识别CAV安全威胁，并建议了实用的安全协议以增强CAV通信安全性。此外，通过引入实际应用案例展示了这些协议如何整合到真实世界的CAV应用中。这些贡献对于推进安全的CAV采用以及确保自动驾驶车辆在智能交通系统中的安全集成至关重要。

链接: https://arxiv.org/abs/2502.04201
作者: Mohammed Aledhari,Rehma Razzak,Mohamed Rahouti,Abbas Yazdinejad,Reza M. Parizi,Basheer Qolomany,Mohsen Guizani,Junaid Qadir,Ala Al-Fuqaha
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:The advancements in autonomous driving technology, coupled with the growing interest from automotive manufacturers and tech companies, suggest a rising adoption of Connected Autonomous Vehicles (CAVs) in the near future. Despite some evidence of higher accident rates in AVs, these incidents tend to result in less severe injuries compared to traditional vehicles due to cooperative safety measures. However, the increased complexity of CAV systems exposes them to significant security vulnerabilities, potentially compromising their performance and communication integrity. This paper contributes by presenting a detailed analysis of existing security frameworks and protocols, focusing on intra- and inter-vehicle communications. We systematically evaluate the effectiveness of these frameworks in addressing known vulnerabilities and propose a set of best practices for enhancing CAV communication security. The paper also provides a comprehensive taxonomy of attack vectors in CAV ecosystems and suggests future research directions for designing more robust security mechanisms. Our key contributions include the development of a new classification system for CAV security threats, the proposal of practical security protocols, and the introduction of use cases that demonstrate how these protocols can be integrated into real-world CAV applications. These insights are crucial for advancing secure CAV adoption and ensuring the safe integration of autonomous vehicles into intelligent transportation systems.
zh

[CV-15] PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在视觉问答任务上的表现不足问题。尽管现有方法通过像素级标注数据训练在基准测试中展示了较强的表现，但在新的具有挑战性的视觉中心基准测试中，这些模型在视觉问答任务上的能力较弱，并且某些方法甚至削弱了原本没有接受过像素级监督的模型的定位能力。论文的关键解决方案是提出两个新的具有挑战性的基准测试，并引入一个名为PixFoundation的简单基线来提取定位信息，从而使得未经像素级监督训练的MLLMs在评估像素级定位和视觉问答任务时能够超越当前最先进的方法。此外，论文探讨了“在未经过像素级监督训练的MLLMs中，何时会出现定位能力”的研究问题，发现这种定位能力可能与物体部分或位置/外观信息相关联。

链接: https://arxiv.org/abs/2502.04192
作者: Mennatullah Siam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Multiple works have emerged to push the boundaries on multi-modal large language models (MLLMs) towards pixel-level understanding. Such approaches have shown strong performance on benchmarks for referring expression segmentation and grounded conversation generation. The current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data. However, we show that such MLLMs when evaluated on recent challenging vision centric benchmarks, exhibit a weak ability in visual question answering. Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such supervision. In this work, we propose two novel challenging benchmarks and show that MLLMs without pixel-level grounding supervision can outperform the state of the art in such tasks when evaluating both the pixel-level grounding and visual question answering. We propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call as PixFoundation. More importantly, we study the research question of ``When does grounding emerge in MLLMs that are not trained with pixel-level grounding supervision?‘’ We show that grounding can coincide with object parts or location/appearance information. Code repository is at this https URL.
zh

[CV-16] YOLOv4: A Breakthrough in Real-Time Object Detection

【速读】：该论文旨在通过结合先进的回归（bounding box positioning）和分类（object class identification）技术，提升目标检测模型在复杂场景下的性能。关键解决方案在于采用Cross mini-Batch归一化（Cross Batch-Normalization）、跨阶段部分连接（Cross-Stage-Partial-Connections）、自对抗训练（Self-Adversarial-Training）以及加权残差连接（Weighted-Residual-Connections），并结合CIoU损失函数（CIoU Loss）、马赛克数据增强（Mosaic Data Augmentation）和DropBlock正则化（DropBlock Regularization）。这些技术共同作用，使YOLOv4在多样化的应用场景中实现了卓越的目标检测效果，同时保持了高效的运行速度。

链接: https://arxiv.org/abs/2502.04161
作者: Athulya Sundaresan Geetha
机构: Huddersfield University (赫德斯菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:YOLOv4 achieved the best performance on the COCO dataset by combining advanced techniques for regression (bounding box positioning) and classification (object class identification) using the Darknet framework. To enhance accuracy and adaptability, it employs Cross mini-Batch Normalization, Cross-Stage-Partial-connections, Self-Adversarial-Training, and Weighted-Residual-Connections, as well as CIoU loss, Mosaic data augmentation, and DropBlock regularization. With Mosaic augmentation and multi-resolution training, YOLOv4 achieves superior detection in diverse scenarios, attaining 43.5% AP (in contrast, 65.7% AP50) on a Tesla V100 at ~65 frames per second, ensuring efficiency, affordability, and adaptability for real-world environments.
zh

[CV-17] HD-EPIC: A Highly-Detailed Egocentric Video Dataset

【速读】：该论文旨在构建一个包含详细标注的新厨房环境第一人称视角视频数据集（HD-EPIC），以评估和推动视觉理解与语言模型（VLMs）的发展。关键在于通过数字孪生技术在三维空间中进行场景、固定装置、物体位置以及视线方向的精确标注，并收集未经过剧本编排的真实家庭环境中的视频素材。论文展示了该数据集通过视觉问答（VQA）基准测试（包含26K个问题）来评估模型识别食谱、成分、营养、精细动作、三维感知、物体运动和视线方向的能力。结果显示，当前模型如Gemini Pro在该挑战性基准上的表现仅为38.5%，这表明现有模型在处理复杂现实场景时的局限性。

链接: https://arxiv.org/abs/2502.04144
作者: Toby Perrett,Ahmad Darkhalil,Saptarshi Sinha,Omar Emara,Sam Pollard,Kranti Parida,Kaiting Liu,Prajwal Gatti,Siddhant Bansal,Kevin Flanagan,Jacob Chalk,Zhifan Zhu,Rhodri Guerrier,Fahd Abdelazim,Bin Zhu,Davide Moltisanti,Michael Wray,Hazel Doughty,Dima Damen
机构: Uni. of Bristol (布里斯托尔大学); Leiden Uni. (莱顿大学); Singapore Management Uni. (新加坡管理大学); Uni. of Bath (巴斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages. Project Webpage and Dataset: this http URL

点击查看摘要

Abstract:We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos. Comments: 29 pages. Project Webpage and Dataset: this http URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.04144 [cs.CV] (or arXiv:2502.04144v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.04144 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-18] Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent -Interpolation Initialization for 3D Instance Segmentation

【速读】：该论文旨在解决3D实例分割中变压器模型在查询初始化阶段难以同时保持强位置和内容信息的问题，并且随着解码器层数加深导致物体消失的现象。为克服这些挑战，论文提出了一种名为超越最终层：基于层次查询融合变换器与代理插值初始化的方法（Beyond the Final Layer, BFL）。关键解决方案包括设计一个代理插值初始化模块以生成具有前景覆盖和内容学习平衡能力的鲁棒查询，以及设计一个层次查询融合解码器以保留低重叠查询，从而缓解随着网络深度增加导致的召回率下降问题。

链接: https://arxiv.org/abs/2502.04139
作者: Jiahao Lu,Jiacheng Deng,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:3D instance segmentation aims to predict a set of object instances in a scene and represent them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines, reduced manual selection of geometric properties, and superior performance. However, transformer-based methods fail to simultaneously maintain strong position and content information during query initialization. Additionally, due to supervision at each decoder layer, there exists a phenomenon of object disappearance with the deepening of layers. To overcome these hurdles, we introduce Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation (BFL). Specifically, an Agent-Interpolation Initialization Module is designed to generate resilient queries capable of achieving a balance between foreground coverage and content learning. Additionally, a Hierarchical Query Fusion Decoder is designed to retain low overlap queries, mitigating the decrease in recall with the deepening of layers. Extensive experiments on ScanNetV2, ScanNet200, ScanNet++ and S3DIS datasets demonstrate the superior performance of BFL.
zh

[CV-19] Generative Adversarial Networks Bridging Art and Machine Intelligence

【速读】：该文献系统介绍了生成对抗网络（GANs）的基本原理和历史发展，并对比传统生成模型，阐明了核心的对抗机制。文献通过数学和理论基础，包括概率论、统计学和博弈论，为理解GAN训练的目标、损失函数和优化挑战提供了坚实的框架。文献的关键解决方案在于详细探讨了多种经典和高级的GAN变体及训练方法，如条件GANs、DCGANs、Wasserstein GANs和带有梯度惩罚的GANs等，并展示了这些技术在高分辨率图像生成、艺术风格迁移、视频合成、文本到图像生成等多媒体应用中的实际实现。最后，文献还展望了未来的研究趋势，包括自注意力机制和基于变换器的生成模型等，从而为学术界和应用领域的未来发展指明了方向。

链接: https://arxiv.org/abs/2502.04116
作者: Junhao Song,Yichao Zhang,Ziqian Bi,Tianyang Wang,Keyu Chen,Ming Li,Qian Niu,Junyu Liu,Benji Peng,Sen Zhang,Ming Liu,Jiawei Xu,Xuanhe Pan,Jinlang Wang,Pohsun Feng,Yizhu Wen,Lawrence K.Q. Yan,Hong-Ming Tseng,Xinyuan Song,Jintao Ren,Silin Chen,Yunze Wang,Weiche Hsieh,Bowen Jing,Junjie Yang,Jun Zhou,Zheyu Yao,Chia Xin Liang
机构: Imperial College London; The University of Texas at Dallas; Indiana University; Xi’an Jiaotong-Liverpool University; Georgia Institute of Technology; Kyoto University; AppCubic; Rutgers University; Purdue University; University of Wisconsin-Madison; National Taiwan Normal University; University of Hawaii; Hong Kong University of Science and Technology; School of Visual Arts; Emory University; Aarhus University; Zhejiang University; University of Edinburgh; National Tsing Hua University; University of Manchester; Pingtan Research Institute of Xiamen University; University of Liverpool; JTB Technology Corp.
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This book begins with a detailed introduction to the fundamental principles and historical development of GANs, contrasting them with traditional generative models and elucidating the core adversarial mechanisms through illustrative Python examples. The text systematically addresses the mathematical and theoretical underpinnings including probability theory, statistics, and game theory providing a solid framework for understanding the objectives, loss functions, and optimisation challenges inherent to GAN training. Subsequent chapters review classic variants such as Conditional GANs, DCGANs, InfoGAN, and LAPGAN before progressing to advanced training methodologies like Wasserstein GANs, GANs with gradient penalty, least squares GANs, and spectral normalisation techniques. The book further examines architectural enhancements and task-specific adaptations in generators and discriminators, showcasing practical implementations in high resolution image generation, artistic style transfer, video synthesis, text to image generation and other multimedia applications. The concluding sections offer insights into emerging research trends, including self-attention mechanisms, transformer-based generative models, and a comparative analysis with diffusion models, thus charting promising directions for future developments in both academic and applied settings.
zh

[CV-20] Adaptive Margin Contrastive Learning for Ambiguity-aware 3D Semantic Segmentation

【速读】：该论文旨在解决三维点云语义分割中因过渡区域导致的点云特征区分度低及标注可靠性差的问题。论文的关键在于提出了一种自适应边距对比学习方法（AMContrast3D），通过基于点云位置嵌入评估各点的模糊性，并据此设计自适应目标函数，从而在确保低模糊性点正确性的同时允许高模糊性点存在误差。此外，引入了一个边距生成器来调整对比特征嵌入的决策边界，使得边距随着模糊性的增加而缩小，甚至对极高模糊性的点采用负边距。这一方法有效提升了在大规模数据集S3DIS和ScanNet上的性能，超越了现有最先进方法。

链接: https://arxiv.org/abs/2502.04111
作者: Yang Chen,Yueqi Duan,Runzhong Zhang,Yap-Peng Tan
机构: Nanyang Technological University; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose an adaptive margin contrastive learning method for 3D point cloud semantic segmentation, namely AMContrast3D. Most existing methods use equally penalized objectives, which ignore per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we design adaptive objectives for individual points based on their ambiguity levels, aiming to ensure the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. Specifically, we first estimate ambiguities based on position embeddings. Then, we develop a margin generator to shift decision boundaries for contrastive feature embeddings, so margins are narrowed due to increasing ambiguities with even negative margins for extremely high-ambiguity points. Experimental results on large-scale datasets, S3DIS and ScanNet, demonstrate that our method outperforms state-of-the-art methods.
zh

[CV-21] Efficient Few-Shot Continual Learning in Vision-Language Models

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在使用预训练图像编码器（如CLIP）时所遇到的性能限制问题，以及在持续接收新数据的情况下模型需要频繁更新的挑战。论文的关键解决方案是LoRSU（低秩适应与结构化更新），它通过选择性地更新图像编码器中的参数，实现了计算效率和资源利用率的显著提升，同时保持了模型的整体鲁棒性。具体而言，LoRSU通过理论洞察确定并仅更新最关键参数，从而将计算开销减少了超过25倍，而不会牺牲性能。

链接: https://arxiv.org/abs/2502.04098
作者: Aristeidis Panos,Rahaf Aljundi,Daniel Olmeda Reino,Richard E. Turner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) excel in tasks such as visual question answering and image captioning. However, VLMs are often limited by their use of pretrained image encoders, like CLIP, leading to image understanding errors that hinder overall performance. On top of that, real-world applications often require the model to be continuously adapted as new and often limited data continuously arrive. To address this, we propose LoRSU (Low-Rank Adaptation with Structured Updates), a robust and computationally efficient method for selectively updating image encoders within VLMs. LoRSU introduces structured and localized parameter updates, effectively correcting performance on previously error-prone data while preserving the model’s general robustness. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency. Specifically, we demonstrate that LoRSU reduces computational overhead by over 25x compared to full VLM updates, without sacrificing performance. Experimental results on VQA tasks in the few-shot continual learning setting, validate LoRSU’s scalability, efficiency, and effectiveness, making it a compelling solution for image encoder adaptation in resource-constrained environments.
zh

[CV-22] Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation

【速读】：该论文旨在利用正电子发射断层扫描（PET）成像技术，开发一种自动化系统以精准分割乳腺病变区域，并从这些区域提取关键生物标志物，从而提供关于新辅助化疗（NAC）第一疗程后乳腺癌演变的见解。论文的关键解决方案在于开发了一种基于深度学习的乳腺肿瘤分割方法，通过优化基线模型并对其进行微调及主动学习，实现了在基线（PET_Bl）和随访（PET_Fu）PET扫描中的肿瘤区域分割。此外，该系统计算了最大标准化摄取值（SUVmax）、代谢肿瘤体积（MTV）和总病变糖酵解（TLG）等生物标志物，以评估肿瘤在随访与基线扫描之间的演变情况。

链接: https://arxiv.org/abs/2502.04083
作者: Tewele W. Tareke(1),Neree Payan(1,2),Alexandre Cochet(1,2),Laurent Arnould(3),Benoit Presles(1),Jean-Marc Vrigneaud(1,2),Fabrice Meriaudeau(1),Alain Lalande(1,4) ((1) ICMUB laboratory, UMR CNRS 6302, Universite de Bourgogne Europe, Dijon, France, (2) Nuclear Medicine Department, Centre Georges-Francois Leclerc, Dijon, France, (3) Department of Biology and Pathology of the Tumors, Centre Georges-Francois Leclerc, Dijon, France, (4) Department of Medical Imaging, University Hospital of Dijon, Dijon, France)
机构: u-bourgogne.fr(布尔格涅大学); cgfl.fr(未知)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submit soon to EJNMMI Research

点击查看摘要

Abstract:Neoadjuvant chemotherapy (NAC) has become a standard clinical practice for tumor downsizing in breast cancer with 18F-FDG Positron Emission Tomography (PET). Our work aims to leverage PET imaging for the segmentation of breast lesions. The focus is on developing an automated system that accurately segments primary tumor regions and extracts key biomarkers from these areas to provide insights into the evolution of breast cancer following the first course of NAC. 243 baseline 18F-FDG PET scans (PET_Bl) and 180 follow-up 18F-FDG PET scans (PET_Fu) were acquired before and after the first course of NAC, respectively. Firstly, a deep learning-based breast tumor segmentation method was developed. The optimal baseline model (model trained on baseline exams) was fine-tuned on 15 follow-up exams and adapted using active learning to segment tumor areas in PET_Fu. The pipeline computes biomarkers such as maximum standardized uptake value (SUVmax), metabolic tumor volume (MTV), and total lesion glycolysis (TLG) to evaluate tumor evolution between PET_Fu and PET_Bl. Quality control measures were employed to exclude aberrant outliers. The nnUNet deep learning model outperformed in tumor segmentation on PET_Bl, achieved a Dice similarity coefficient (DSC) of 0.89 and a Hausdorff distance (HD) of 3.52 mm. After fine-tuning, the model demonstrated a DSC of 0.78 and a HD of 4.95 mm on PET_Fu exams. Biomarkers analysis revealed very strong correlations whatever the biomarker between manually segmented and automatically predicted regions. The significant average decrease of SUVmax, MTV and TLG were 5.22, 11.79 cm3 and 19.23 cm3, respectively. The presented approach demonstrates an automated system for breast tumor segmentation from 18F-FDG PET. Thanks to the extracted biomarkers, our method enables the automatic assessment of cancer progression.
zh

[CV-23] Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency

【速读】：该论文旨在解决下一代视频生成模型如\textit{Sora}带来的挑战，特别是针对AI生成内容（AIGC）视频质量评估（VQA）。这些模型显著减少了之前模型中存在的闪烁伪影，支持更长且复杂的文本提示，并生成具有复杂多样运动模式的长视频。然而，传统的VQA方法难以评估这些内容丰富的视频。为此，论文提出了一种名为CRAVE（内容丰富的AIGC视频评估器）的方法，专门用于评估Sora时代的AIGC视频。CRAVE的关键在于其多粒度文本-时间融合机制，该机制能够将长篇复杂的文本语义与视频动态对齐，同时采用混合运动保真度建模来评估时间伪影。此外，为了应对当前AIGC VQA数据集中简单提示和内容的问题，论文还引入了CRAVE-DB基准，该基准包含来自下一代模型的丰富内容视频及其详细的提示。实验结果表明，CRAVE在多个AIGC VQA基准上取得了优异的表现，高度符合人类感知。

链接: https://arxiv.org/abs/2502.04076
作者: Shangkun Sun,Xiaoyu Liang,Bowen Qu,Wei Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advent of next-generation video generation models like \textitSora poses challenges for AI-generated content (AIGC) video quality assessment (VQA). These models substantially mitigate flickering artifacts prevalent in prior models, enable longer and complex text prompts and generate longer videos with intricate, diverse motion patterns. Conventional VQA methods designed for simple text and basic motion patterns struggle to evaluate these content-rich videos. To this end, we propose \textbfCRAVE (\underlineContent-\underlineRich \underlineAIGC \underlineVideo \underlineEvaluator), specifically for the evaluation of Sora-era AIGC videos. CRAVE proposes the multi-granularity text-temporal fusion that aligns long-form complex textual semantics with video dynamics. Additionally, CRAVE leverages the hybrid motion-fidelity modeling to assess temporal artifacts. Furthermore, given the straightforward prompts and content in current AIGC VQA datasets, we introduce \textbfCRAVE-DB, a benchmark featuring content-rich videos from next-generation models paired with elaborate prompts. Extensive experiments have shown that the proposed CRAVE achieves excellent results on multiple AIGC VQA benchmarks, demonstrating a high degree of alignment with human perception. All data and code will be publicly available at this https URL.
zh

[CV-24] 3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation

【速读】：本文旨在解决跨任务少样本2D gaze估计问题，即将预训练的3D gaze估计网络适应于在未见过的设备上进行2D gaze预测。关键解决方案在于提出了一种新颖的框架，通过一个基于物理的不同iable投影模块（differentiable projection module）来建模屏幕姿态并将3D gaze投影到2D gaze。此外，引入了一种动态伪标签策略用于翻转图像，通过将2D标签转换到非相机坐标系的3D空间中来处理未知屏幕姿态带来的挑战，并学习了一个动态变换矩阵以补偿这种不一致。

链接: https://arxiv.org/abs/2502.04074
作者: Yihua Cheng,Hengfei Wang,Zhongqun Zhang,Yang Yue,Bo Eun Kim,Feng Lu,Hyung Jin Chang
机构: University of Birmingham(伯明翰大学); Beihang University(北京航空航天大学); Dankook University(檀国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D and 2D gaze estimation share the fundamental objective of capturing eye movements but are traditionally treated as two distinct research domains. In this paper, we introduce a novel cross-task few-shot 2D gaze estimation approach, aiming to adapt a pre-trained 3D gaze estimation network for 2D gaze prediction on unseen devices using only a few training images. This task is highly challenging due to the domain gap between 3D and 2D gaze, unknown screen poses, and limited training data. To address these challenges, we propose a novel framework that bridges the gap between 3D and 2D gaze. Our framework contains a physics-based differentiable projection module with learnable parameters to model screen poses and project 3D gaze into 2D gaze. The framework is fully differentiable and can integrate into existing 3D gaze networks without modifying their original architecture. Additionally, we introduce a dynamic pseudo-labelling strategy for flipped images, which is particularly challenging for 2D labels due to unknown screen poses. To overcome this, we reverse the projection process by converting 2D labels to 3D space, where flipping is performed. Notably, this 3D space is not aligned with the camera coordinate system, so we learn a dynamic transformation matrix to compensate for this misalignment. We evaluate our method on MPIIGaze, EVE, and GazeCapture datasets, collected respectively on laptops, desktop computers, and mobile devices. The superior performance highlights the effectiveness of our approach, and demonstrates its strong potential for real-world applications.
zh

[CV-25] Inteligencia artificial para la multi-clasificacion de fauna en fotografias automaticas utilizadas en investigacion cientifica

【速读】：该论文旨在解决在阿根廷火地岛地区，通过相机陷阱收集的大量图像手动解读所面临的可扩展性挑战。解决方案的关键在于开发神经网络模型，以实现对相机陷阱拍摄的动物物种进行分类，从而应对科学研究中的大规模挑战。这一方法结合了深度学习技术，能够有效提取图像中的重要信息，提高生态和野生动物保护研究的效率与准确性。

链接: https://arxiv.org/abs/2502.04064
作者: Federico Gonzalez,Leonel Viera,Rosina Soler,Lucila Chiarvetto Peralta,Matias Gel,Gimena Bustamante,Abril Montaldo,Brian Rigoni,Ignacio Perez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in Spanish language, XXIV Workshop de Investigadores en Ciencias de la Computación (WICC 2022, Mendoza)

点击查看摘要

Abstract:The management of natural environments, whether for conservation or production, requires a deep understanding of wildlife. The number, location, and behavior of wild animals are among the main subjects of study in ecology and wildlife research. The use of camera traps offers the opportunity to quickly collect large quantities of photographs that capture wildlife in its natural habitat, avoiding factors that could alter their behavior. In Tierra del Fuego, Argentina, research is being conducted on forest use by different herbivores (guanacos, cows, sheep) to optimize management and protect these natural ecosystems. Although camera traps allow for the collection of millions of images, interpreting such photographs presents a scalability challenge for manual processing. As a result, much of the valuable knowledge stored in these vast data repositories remains untapped. Neural Networks and Deep Learning are areas of study within Artificial Intelligence. Over the past decade, these two disciplines have made significant contributions to image recognition on a global scale. Ecological and wildlife conservation studies can be combined with these new technologies to extract important information from the photographs obtained by camera traps, contributing to the understanding of various natural processes and improving the management of the involved wild areas. Our project aims to develop neural network models to classify animal species in photographs taken with camera traps, addressing large-scale challenges in scientific research.
zh

[CV-26] PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models

【速读】：该论文旨在解决现有扩散模型（Diffusion Models）对物体部分理解不足的问题，从而限制了用户所需的精细编辑。为了解决这一问题，论文提出通过学习特定的文本标记（special textual tokens）来扩展预训练扩散模型的知识，使模型能够理解不同的物体部分，并实现更精细的编辑。这些标记经过优化可以在每个推理步骤中生成可靠的定位掩码，以精确定位编辑区域。关键在于通过优化过程赋予模型识别和编辑物体特定部分的能力。

链接: https://arxiv.org/abs/2502.04050
作者: Aleksandar Cvejic(KAUST),Abdelrahman Eldesokey(KAUST),Peter Wonka(KAUST)
机构: KAUST (国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present the first text-based image editing approach for object parts based on pre-trained diffusion models. Diffusion-based image editing approaches capitalized on the deep understanding of diffusion models of image semantics to perform a variety of edits. However, existing diffusion models lack sufficient understanding of many object parts, hindering fine-grained edits requested by users. To address this, we propose to expand the knowledge of pre-trained diffusion models to allow them to understand various object parts, enabling them to perform fine-grained edits. We achieve this by learning special textual tokens that correspond to different object parts through an efficient token optimization process. These tokens are optimized to produce reliable localization masks at each inference step to localize the editing region. Leveraging these masks, we design feature-blending and adaptive thresholding strategies to execute the edits seamlessly. To evaluate our approach, we establish a benchmark and an evaluation protocol for part editing. Experiments show that our approach outperforms existing editing methods on all metrics and is preferred by users 77-90% of the time in conducted user studies.
zh

[CV-27] Enhancing people localisation in drone imagery for better crowd management by utilising every pixel in high-resolution images

【速读】：该论文旨在解决使用无人机进行精确人群定位的问题，特别是在高分辨率图像中小对象定位精度和效率方面的局限性。这些问题主要源于图像缩放和滑动窗口技术的限制。为了解决这些挑战，论文提出了一种新的点导向对象定位方法，并引入了Pixel Distill模块以增强高清图像处理能力，通过一次性提取单个像素的空间信息来提升性能。此外，论文还分享了一个名为UP-COUNT的新数据集，用于应对无人机图像中的多种挑战，如摄像机和目标物体在图像采集过程中的同时移动，从而推进了人群管理应用的能力。关键解决方案在于该点导向定位方法及Pixel Distill模块的应用。

链接: https://arxiv.org/abs/2502.04014
作者: Bartosz Ptak,Marek Kraft
机构: Institute of Robotics and Machine Intelligence (机器人与机器智能研究所), Poznań University of Technology (波兹南工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This is the pre-print. The article is submitted to the Engineering Applications of Artificial Intelligence journal

点击查看摘要

Abstract:Accurate people localisation using drones is crucial for effective crowd management, not only during massive events and public gatherings but also for monitoring daily urban crowd flow. Traditional methods for tiny object localisation using high-resolution drone imagery often face limitations in precision and efficiency, primarily due to constraints in image scaling and sliding window techniques. To address these challenges, a novel approach dedicated to point-oriented object localisation is proposed. Along with this approach, the Pixel Distill module is introduced to enhance the processing of high-definition images by extracting spatial information from individual pixels at once. Additionally, a new dataset named UP-COUNT, tailored to contemporary drone applications, is shared. It addresses a wide range of challenges in drone imagery, such as simultaneous camera and object movement during the image acquisition process, pushing forward the capabilities of crowd management applications. A comprehensive evaluation of the proposed method on the proposed dataset and the commonly used DroneCrowd dataset demonstrates the superiority of our approach over existing methods and highlights its efficacy in drone-based crowd object localisation tasks. These improvements markedly increase the algorithm’s applicability to operate in real-world scenarios, enabling more reliable localisation and counting of individuals in dynamic environments.
zh

[CV-28] CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing

【速读】：该论文旨在解决基于文本指令自动化修改计算机辅助设计（CAD）模型的问题。现有方法主要集中在设计变体生成或基于文本的CAD生成，要么缺乏对文本控制的支持，要么忽视了现有CAD模型作为约束条件。为应对训练所需高精度三元组数据的需求，论文提出了一种自动数据合成管道，利用设计变体模型生成原始和编辑后的CAD模型对，并使用大型视觉-语言模型（LVLMs）总结其差异以形成编辑指令。关键解决方案在于引入了一种定位-再填充框架，将任务分解为两个子任务：定位需要修改的区域和在这些区域内进行适当的修改。大型语言模型（LLMs）作为这两个子任务的基础，利用它们在自然语言理解和CAD知识方面的能力。

链接: https://arxiv.org/abs/2502.03997
作者: Yu Yuan,Shizhao Sun,Qi Liu,Jiang Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer Aided Design (CAD) is indispensable across various industries. \emphText-based CAD editing, which automates the modification of CAD models based on textual instructions, holds great potential but remains underexplored. Existing methods primarily focus on design variation generation or text-based CAD generation, either lacking support for text-based control or neglecting existing CAD models as constraints. We introduce \emphCAD-Editor, the first framework for text-based CAD editing. To address the challenge of demanding triplet data with accurate correspondence for training, we propose an automated data synthesis pipeline. This pipeline utilizes design variation models to generate pairs of original and edited CAD models and employs Large Vision-Language Models (LVLMs) to summarize their differences into editing instructions. To tackle the composite nature of text-based CAD editing, we propose a locate-then-infill framework that decomposes the task into two focused sub-tasks: locating regions requiring modification and infilling these regions with appropriate edits. Large Language Models (LLMs) serve as the backbone for both sub-tasks, leveraging their capabilities in natural language understanding and CAD knowledge. Experiments show that CAD-Editor achieves superior performance both quantitatively and qualitatively.
zh

[CV-29] RWKV-UI: UI Understanding with Enhanced Perception and Reasoning

【速读】：该论文旨在解决现有视觉语言模型在处理高分辨率网页界面时存在的信息丢失和受限推理能力的问题，特别是在需要网页布局理解和多步骤交互推理的任务中。论文的关键解决方案是提出了RWKV-UI模型，基于RWKV架构，专门设计用于处理高分辨率的用户界面图像。通过引入布局检测作为视觉提示以帮助模型更好地理解网页布局结构，并设计了一种基于链式思维（Chain-of-Thought, CoT）机制的视觉提示来增强模型对网页内容的理解和推理能力。实验结果表明，RWKV-UI在高分辨率用户界面理解和交互推理任务中表现出显著的性能提升。

链接: https://arxiv.org/abs/2502.03971
作者: Jiaxi Yang,Haowen Hou
机构: Sun Yat-sen University (中山大学); Shenzhen Yuanshi Intelligence Co., Ltd; Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东人工智能与数字经济实验室 (深圳))
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 10 pages, 5figures, conference

点击查看摘要

Abstract:Existing Visual Language Modelsoften struggle with information loss and limited reasoning abilities when handling high-resolution web interfaces that combine complex visual, textual, and interactive elements. These challenges are particularly evident in tasks requiring webpage layout comprehension and multi-step interactive reasoning. To address these challenges, we propose RWKV-UI, a Visual Language Model based on the RWKV architecture, specifically designed to handle high-resolution UI images. During model training, we introduce layout detection as a visual prompt to help the model better understand the webpage layout structures. Additionally, we design a visual prompt based on the Chain-of-Thought(CoT) mechanism, which enhances the model’s ability to understand and reason about webpage content through reasoning chains. Experimental results show that RWKV-UI demonstrates significant performance improvements in high-resolution UI understanding and interactive reasoning tasks.
zh

[CV-30] MultiFloodSynth: Multi-Annotated Flood Synthetic Dataset Generation AAAI2025

【速读】：该论文旨在解决洪水灾害检测系统中合成数据生成的问题。为实现高保真度和高质量，论文提出了将多个真实世界属性引入虚拟环境，并通过控制这些属性来模拟洪水情景的方法。解决方案的关键在于利用近期的图像到三维（Image-to-3D）及城市合成的生成模型，以高效地组合洪水环境，从而避免手工制作方式带来的数据偏差。基于此框架，构建了一个包含五个级别的洪水合成数据集MultiFloodSynth，该数据集提供了丰富的标注类型，如法线图、分割掩码和三维边界框，适用于多种下游任务。实验表明，该数据集在逼真度方面与真实数据集相当，同时提升了洪水灾害检测的性能。

链接: https://arxiv.org/abs/2502.03966
作者: YoonJe Kang,Yonghoon Jung,Wonseop Shin,Bumsoo Kim,Sanghyun Seo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 6 figures. Accepted as Oral Presentation to AAAI 2025 Workshop on Good-Data

点击查看摘要

Abstract:In this paper, we present synthetic data generation framework for flood hazard detection system. For high fidelity and quality, we characterize several real-world properties into virtual world and simulate the flood situation by controlling them. For the sake of efficiency, recent generative models in image-to-3D and urban city synthesis are leveraged to easily composite flood environments so that we avoid data bias due to the hand-crafted manner. Based on our framework, we build the flood synthetic dataset with 5 levels, dubbed MultiFloodSynth which contains rich annotation types like normal map, segmentation, 3D bounding box for a variety of downstream task. In experiments, our dataset demonstrate the enhanced performance of flood hazard detection with on-par realism compared with real dataset.
zh

[CV-31] Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated Samples WACV2025

【速读】：该论文旨在通过使用对抗生成样本，即那些被深度伪造检测器标记为假的输入图像的扰动掩模，来改进视觉解释方法。关键解决方案在于利用基于自然演化策略（Natural Evolution Strategies）生成的样本，这些样本旨在翻转原始深度伪造检测器的决策，使其将这些样本分类为真实图像。论文应用此方法于四种基于扰动的解释方法（LIME, SHAP, SOBOL 和 RISE），并通过最先进的深度伪造检测模型、基准数据集（FaceForensics++）及相应的解释评估框架评估了改进后方法的性能。量化评估显示，所提出的扰动方法在解释方法的性能上做出了积极贡献；定性分析表明，修改后的解释方法能够更准确地界定被操纵图像区域，从而提供更有用的解释。

链接: https://arxiv.org/abs/2502.03957
作者: Konstantinos Tsigos,Evlampios Apostolidis,Vasileios Mezaris
机构: Information Technologies Institute - CERTH (信息技术研究所-CERTH), Thessaloniki, Greece, 57001
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted for publication, AI4MFDD Workshop @ IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, Feb. 2025. This is the authors’ “accepted version”

点击查看摘要

Abstract:In this paper, we introduce the idea of using adversarially-generated samples of the input images that were classified as deepfakes by a detector, to form perturbation masks for inferring the importance of different input features and produce visual explanations. We generate these samples based on Natural Evolution Strategies, aiming to flip the original deepfake detector’s decision and classify these samples as real. We apply this idea to four perturbation-based explanation methods (LIME, SHAP, SOBOL and RISE) and evaluate the performance of the resulting modified methods using a SOTA deepfake detection model, a benchmarking dataset (FaceForensics++) and a corresponding explanation evaluation framework. Our quantitative assessments document the mostly positive contribution of the proposed perturbation approach in the performance of explanation methods. Our qualitative analysis shows the capacity of the modified explanation methods to demarcate the manipulated image regions more accurately, and thus to provide more useful explanations.
zh

[CV-32] LR0.FM: Low-Resolution Zero-shot Classification Benchmark For Foundation Models ICLR2025

【速读】：该论文旨在解决视觉-语言基础模型（Foundation Models, FMs）在低分辨率图像上的鲁棒性问题，这一挑战在现实世界场景中普遍存在。论文通过引入一个全面的基准测试（Low-Resolution Benchmark），评估了10种FMs在66个骨干网络和15个数据集上的零样本分类性能随低分辨率变化的影响。为更好地评估模型在不同分辨率和数据集上的表现，论文提出了一种新的度量方法——加权聚合鲁棒性（Weighted Aggregated Robustness）。关键解决方案是提出了一种简单的策略LR-TK0，该策略能够在不损害预训练权重的情况下增强模型对低分辨率图像的鲁棒性。

链接: https://arxiv.org/abs/2502.03950
作者: Priyank Pathak,Shyam Marjit,Shruti Vyas,Yogesh S Rawat
机构: University of Central Florida(中佛罗里达大学); IIIT Guwahati(印度理工学院瓜瓦哈提分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on large-scale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored. We introduce this http URL, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher-resolution models are less robust against LR. Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model’s initial layers more than the deeper layers. We use these insights and introduce a simple strategy, LR-TK0, to enhance the robustness of models without compromising their pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness against low-resolution across several datasets and its generalization capability across backbones and other approaches. Code is available at this this https URL
zh

[CV-33] No Free Lunch in Annotation either: An objective evaluation of foundation models for streamlining annotation in animal tracking

【速读】：该论文旨在解决动物跟踪数据标注过程中繁琐任务的挑战，并探讨如何通过结合自动标注与人工标注来提高跟踪模型的鲁棒性。关键在于确保自动化标注的质量，以避免引入噪声和不准确性，最终证明了自动化标注与人工标注相结合的方法可以显著提升性能，实现了80.8的IDF1分数，优于仅使用自动化标注工具的65.6 IDF1分数。

链接: https://arxiv.org/abs/2502.03907
作者: Emil Mededovic,Valdy Laurentius,Yuli Wu,Marcin Kopaczka,Zhu Chen,Mareike Schulz,René Tolba,Johannes Stegmaier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:We analyze the capabilities of foundation models addressing the tedious task of generating annotations for animal tracking. Annotating a large amount of data is vital and can be a make-or-break factor for the robustness of a tracking model. Robustness is particularly crucial in animal tracking, as accurate tracking over long time horizons is essential for capturing the behavior of animals. However, generating additional annotations using foundation models can be counterproductive, as the quality of the annotations is just as important. Poorly annotated data can introduce noise and inaccuracies, ultimately compromising the performance and accuracy of the trained model. Over-reliance on automated annotations without ensuring precision can lead to diminished results, making careful oversight and quality control essential in the annotation process. Ultimately, we demonstrate that a thoughtful combination of automated annotations and manually annotated data is a valuable strategy, yielding an IDF1 score of 80.8 against blind usage of SAM2 video with an IDF1 score of 65.6.
zh

[CV-34] LeAP: Consistent multi-domain 3D labeling using Foundation Models ICRA25

【速读】：该论文旨在解决3D点云数据自动语义标注的问题。当前，虽然获取未标注的3D点云数据较为容易，但手动添加语义标签耗时且成本高昂。现有的方法主要依赖于从2D视觉基础模型（Vision Foundation Models, VFMs）扩展到3D模型，这可能导致标签不一致。论文提出的关键解决方案是Label Any Pointcloud (LeAP)，它利用2D VFMs实现对任意类别3D数据的自动标注，并通过贝叶斯更新确保标签一致性。此外，引入了一种新颖的3D一致性网络（3D-CN），以进一步提升标签质量。实验结果表明，该方法能够在不同领域生成高质量的3D语义标签，且无需任何人工标注。

链接: https://arxiv.org/abs/2502.03901
作者: Simon Gebraad,Andras Palffy,Holger Caesar
机构: Delft University of Technology (代尔夫特理工大学); PercivAI (PercivAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 4 figures. ICRA25 preprint

点击查看摘要

Abstract:Availability of datasets is a strong driver for research on 3D semantic understanding, and whilst obtaining unlabeled 3D point cloud data is straightforward, manually annotating this data with semantic labels is time-consuming and costly. Recently, Vision Foundation Models (VFMs) enable open-set semantic segmentation on camera images, potentially aiding automatic labeling. However,VFMs for 3D data have been limited to adaptations of 2D models, which can introduce inconsistencies to 3D labels. This work introduces Label Any Pointcloud (LeAP), leveraging 2D VFMs to automatically label 3D data with any set of classes in any kind of application whilst ensuring label consistency. Using a Bayesian update, point labels are combined into voxels to improve spatio-temporal consistency. A novel 3D Consistency Network (3D-CN) exploits 3D information to further improve label quality. Through various experiments, we show that our method can generate high-quality 3D semantic labels across diverse fields without any manual labeling. Further, models adapted to new domains using our labels show up to a 34.2 mIoU increase in semantic segmentation tasks.
zh

[CV-35] UniForm: A Unified Diffusion Transformer for Audio-Video Generation

【速读】：该论文旨在解决现有基于扩散模型的音频-视频生成系统中，各模态模块相对独立导致未能充分利用音频与视觉模态间的内在关联性的问题。为了解决这一问题，论文提出了一种名为UniForm的统一扩散变换模型，其关键在于通过在统一潜在空间内同时生成音频和视频，利用连接的听觉和视觉信息来增强跨模态一致性，从而实现高质量且对齐良好的音视频配对生成。

链接: https://arxiv.org/abs/2502.03897
作者: Lei Zhao,Linfeng Feng,Dongxu Ge,Fangqiu Yi,Chi Zhang,Xiao-Lei Zhang,Xuelong Li
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:As a natural multimodal content, audible video delivers an immersive sensory experience. Consequently, audio-video generation systems have substantial potential. However, existing diffusion-based studies mainly employ relatively independent modules for generating each modality, which lack exploration of shared-weight generative modules. This approach may under-use the intrinsic correlations between audio and visual modalities, potentially resulting in sub-optimal generation quality. To address this, we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs. Extensive experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks. Our demos are available at this https URL.
zh

[CV-36] Rule-Based Modeling of Low-Dimensional Data with PCA and Binary Particle Swarm Optimization (BPSO) in ANFIS

【速读】：该论文旨在解决模糊规则系统在高维数据中规则爆炸的问题，并提高深度学习在低维数据中的可解释性。解决方案的关键在于提出了一种策略性的规则约简模型，通过主成分分析（Principal Component Analysis, PCA）对归一化激发强度进行处理，获得线性不相关的组件。随后，二进制粒子群优化（Binary Particle Swarm Optimization, BPSO）选择性地精炼这些组件，在显著减少规则数量的同时保持决策精度。此外，自定义参数更新机制动态调整BPSO参数，以避免局部最优，从而进一步优化特定的ANFIS层。这一方法在多个标准数据集及实际缺血性中风数据集上验证，展示了其适应性和实用性。

链接: https://arxiv.org/abs/2502.03895
作者: Afnan Al-Ali,Uvais Qidwai
机构: Computer Science and Engineering Department, Qatar University (卡塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 9 figures

点击查看摘要

Abstract:Fuzzy rule-based systems interpret data in low-dimensional domains, providing transparency and interpretability. In contrast, deep learning excels in complex tasks like image and speech recognition but is prone to overfitting in sparse, unstructured, or low-dimensional data. This interpretability is crucial in fields like healthcare and finance. Traditional rule-based systems, especially ANFIS with grid partitioning, suffer from exponential rule growth as dimensionality increases. We propose a strategic rule-reduction model that applies Principal Component Analysis (PCA) on normalized firing strengths to obtain linearly uncorrelated components. Binary Particle Swarm Optimization (BPSO) selectively refines these components, significantly reducing the number of rules while preserving precision in decision-making. A custom parameter update mechanism fine-tunes specific ANFIS layers by dynamically adjusting BPSO parameters, avoiding local minima. We validated our approach on standard UCI respiratory, keel classification, regression datasets, and a real-world ischemic stroke dataset, demonstrating adaptability and practicality. Results indicate fewer rules, shorter training, and high accuracy, underscoring the methods effectiveness for low-dimensional interpretability and complex data scenarios. This synergy of fuzzy logic and optimization fosters robust solutions. Our method contributes a powerful framework for interpretable AI in multiple domains. It addresses dimensionality, ensuring a rule base.
zh

[CV-37] Advanced Object Detection and Pose Estimation with Hybrid Task Cascade and High-Resolution Networks

【速读】：该论文旨在解决6D物体检测和姿态估计中的高精度难题，特别是在计算机视觉领域中对于机器人、增强现实和自动驾驶等应用至关重要。传统方法通常难以同时实现高精度的物体检测和精确的姿态估计。论文的关键解决方案在于提出了一种基于现有6D-VNet框架改进的管道，通过集成Hybrid Task Cascade (HTC)和High-Resolution Network (HRNet)骨干网络来增强其性能。这种方法利用了HTC多阶段细化过程的优势以及HRNet保持高分辨率表征的能力，从而显著提高了检测精度和姿态估计的准确性。此外，文中还引入了先进的后处理技术和一种新的模型集成策略，这些技术共同作用于公共和私有基准测试中，以实现卓越的性能。

链接: https://arxiv.org/abs/2502.03877
作者: Yuhui Jin,Yaqiong Zhang,Zheyuan Xu,Wenqing Zhang,Jingyu Xu
机构: California Institute of Technology (加州理工学院); University of Michigan, Ann Arbor (密歇根大学, 安阿伯); University of Washington (华盛顿大学); Washington University in St. Louis (圣路易斯华盛顿大学); Northern Arizona University (北亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the field of computer vision, 6D object detection and pose estimation are critical for applications such as robotics, augmented reality, and autonomous driving. Traditional methods often struggle with achieving high accuracy in both object detection and precise pose estimation simultaneously. This study proposes an improved 6D object detection and pose estimation pipeline based on the existing 6D-VNet framework, enhanced by integrating a Hybrid Task Cascade (HTC) and a High-Resolution Network (HRNet) backbone. By leveraging the strengths of HTC’s multi-stage refinement process and HRNet’s ability to maintain high-resolution representations, our approach significantly improves detection accuracy and pose estimation precision. Furthermore, we introduce advanced post-processing techniques and a novel model integration strategy that collectively contribute to superior performance on public and private benchmarks. Our method demonstrates substantial improvements over state-of-the-art models, making it a valuable contribution to the domain of 6D object detection and pose estimation.
zh

[CV-38] aking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation

【速读】：该论文致力于解决开放词汇场景图生成（Open Vocabulary Scene Graph Generation, OVSGG）中的两个主要问题：一是现有方法忽视了对象间的交互作用，二是将所有对象平等对待导致关系对不匹配。论文的关键解决方案在于提出了一种名为INOVA的交互感知框架，该框架在预训练阶段采用交互感知目标生成策略以区分交互与非交互对象，并在监督微调阶段引入交互引导查询选择策略来优先处理交互对象。此外，INOVA还通过交互一致的知识蒸馏增强了模型的鲁棒性。这些改进共同提升了OVSGG的性能，在两个基准数据集（VG和GQA）上的实验结果表明INOVA达到了当前最优水平。

链接: https://arxiv.org/abs/2502.03856
作者: Lin Li,Chuhan Zhang,Dong Zhang,Chong Sun,Chen Li,Long Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Today’s open vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Most existing methods adopt a two-stage pipeline: weakly supervised pre-training with image captions and supervised fine-tuning (SFT) on fully annotated scene graphs. Nonetheless, they omit explicit modeling of interacting objects and treat all objects equally, resulting in mismatched relation pairs. To this end, we propose an interaction-aware OVSGG framework INOVA. During pre-training, INOVA employs an interaction-aware target generation strategy to distinguish interacting objects from non-interacting ones. In SFT, INOVA devises an interaction-guided query selection tactic to prioritize interacting objects during bipartite graph matching. Besides, INOVA is equipped with an interaction-consistent knowledge distillation to enhance the robustness by pushing interacting object pairs away from the background. Extensive experiments on two benchmarks (VG and GQA) show that INOVA achieves state-of-the-art performance, demonstrating the potential of interaction-aware mechanisms for real-world applications.
zh

[CV-39] Semi-rPPG: Semi-Supervised Remote Physiological Measurement with Curriculum Pseudo-Labeling

【速读】：该论文旨在解决远程光体积描记法（rPPG）生理信号监测中标签数据难以收集的问题，限制了AI模型的泛化能力和规模。论文的关键解决方案是提出了一种名为Semi-rPPG的新型半监督学习方法，该方法结合了课程伪标签策略和准周期信号的一致性正则化。课程伪标签策略通过信噪比标准标注未标记数据，并自适应过滤低质量数据；一致性正则化则通过弱增强和强增强片段实现，以提取内在生理特征而不受噪声影响。

链接: https://arxiv.org/abs/2502.03855
作者: Bingjie Wu,Zitong Yu,Yiping Xie,Wei Liu,Chaoqi Luo,Yong Liu,Rick Siow Mong Goh
机构: Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (ASTAR), Singapore(高性能计算研究所(IHPC)，科学技术研究局(ASTAR)，新加坡); School of Computing and Information Technology, Great Bay University, Dongguan, 523000, China(计算与信息技术学院，大湾区大学，中国东莞); Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University, Shenzhen, 518060, China(计算机视觉研究所，计算机科学与软件工程学院，深圳大学，中国深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Instrumentation and Measurement (TIM)

点击查看摘要

Abstract:Remote Photoplethysmography (rPPG) is a promising technique to monitor physiological signals such as heart rate from facial videos. However, the labeled facial videos in this research are challenging to collect. Current rPPG research is mainly based on several small public datasets collected in simple environments, which limits the generalization and scale of the AI models. Semi-supervised methods that leverage a small amount of labeled data and abundant unlabeled data can fill this gap for rPPG learning. In this study, a novel semi-supervised learning method named Semi-rPPG that combines curriculum pseudo-labeling and consistency regularization is proposed to extract intrinsic physiological features from unlabelled data without impairing the model from noises. Specifically, a curriculum pseudo-labeling strategy with signal-to-noise ratio (SNR) criteria is proposed to annotate the unlabelled data while adaptively filtering out the low-quality unlabelled data. Besides, a novel consistency regularization term for quasi-periodic signals is proposed through weak and strong augmented clips. To benefit the research on semi-supervised rPPG measurement, we establish a novel semi-supervised benchmark for rPPG learning through intra-dataset and cross-dataset evaluation on four public datasets. The proposed Semi-rPPG method achieves the best results compared with three classical semi-supervised methods under different protocols. Ablation studies are conducted to prove the effectiveness of the proposed methods.
zh

[CV-40] Pursuing Better Decision Boundaries for Long-Tailed Object Detection via Category Information Amount ICLR2025

【速读】：该论文旨在解决在目标检测任务中类别偏置（Category Bias）的问题，即使在实例数量相对平衡的数据集中，模型依然表现出类别偏置。论文的关键在于引入了类别信息量（Category Information Amount）的概念及其度量方法，并观察到类别信息量与准确率之间存在显著的负相关关系。基于这一观察，提出了信息量引导角度边界（Information Amount-Guided Angular Margin, IGAM）损失函数。IGAM的核心思想是根据每个类别的信息量动态调整决策空间，从而减少长尾数据集中的类别偏置。

链接: https://arxiv.org/abs/2502.03852
作者: Yanbiao Ma,Wei Dai,Jiayi Chen
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:In object detection, the instance count is typically used to define whether a dataset exhibits a long-tail distribution, implicitly assuming that models will underperform on categories with fewer instances. This assumption has led to extensive research on category bias in datasets with imbalanced instance counts. However, models still exhibit category bias even in datasets where instance counts are relatively balanced, clearly indicating that instance count alone cannot explain this phenomenon. In this work, we first introduce the concept and measurement of category information amount. We observe a significant negative correlation between category information amount and accuracy, suggesting that category information amount more accurately reflects the learning difficulty of a category. Based on this observation, we propose Information Amount-Guided Angular Margin (IGAM) Loss. The core idea of IGAM is to dynamically adjust the decision space of each category based on its information amount, thereby reducing category bias in long-tail datasets. IGAM Loss not only performs well on long-tailed benchmark datasets such as LVIS v1.0 and COCO-LT but also shows significant improvement for underrepresented categories in the non-long-tailed dataset Pascal VOC. Comprehensive experiments demonstrate the potential of category information amount as a tool and the generality of our proposed method.
zh

[CV-41] Adapting Human Mesh Recovery with Vision-Language Feedback

【速读】：该论文旨在解决单目人体网格恢复中的高精度三维感知与图像一致性问题。论文的关键在于利用大规模视觉语言模型（Vision-Language Models, VLMs）生成交互式的身体部位描述，并将其作为隐式约束来增强三维感知并限制优化空间。具体而言，作者将单目人体网格恢复表述为一个分布适应任务，通过整合二维观测和语言描述实现这一目标。首先，训练文本编码器和姿态向量化自回归模型（pose VQ-VAE），使用对比学习将文本与身体姿态对齐到共享潜在空间。随后，采用基于扩散的框架，通过二维观测和文本描述引导的梯度来精炼初始参数。最终，该模型能够生成具有精确三维感知和图像一致性的姿态。

链接: https://arxiv.org/abs/2502.03836
作者: Chongyang Xu,Buzhen Huang,Chengfang Zhang,Ziliang Feng,Yangang Wang
机构: College of Computer Science, Sichuan University(四川大学计算机学院), China; Key Laboratory of Measurement and Control of Complex Systems of Engineering, Ministry of Education, School of Automation, Southeast University(东南大学复杂工程系统测量与控制重点实验室, 自动化学院), China; Intelligent Policing Key Laboratory of Sichuan Province, Sichuan Police College(四川警察学院智能警务四川省重点实验室), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:Human mesh recovery can be approached using either regression-based or optimization-based methods. Regression models achieve high pose accuracy but struggle with model-to-image alignment due to the lack of explicit 2D-3D correspondences. In contrast, optimization-based methods align 3D models to 2D observations but are prone to local minima and depth ambiguity. In this work, we leverage large vision-language models (VLMs) to generate interactive body part descriptions, which serve as implicit constraints to enhance 3D perception and limit the optimization space. Specifically, we formulate monocular human mesh recovery as a distribution adaptation task by integrating both 2D observations and language descriptions. To bridge the gap between text and 3D pose signals, we first train a text encoder and a pose VQ-VAE, aligning texts to body poses in a shared latent space using contrastive learning. Subsequently, we employ a diffusion-based framework to refine the initial parameters guided by gradients derived from both 2D observations and text descriptions. Finally, the model can produce poses with accurate 3D perception and image consistency. Experimental results on multiple benchmarks validate its effectiveness. The code will be made publicly available.
zh

[CV-42] Single-Domain Generalized Object Detection by Balancing Domain Diversity and Invariance

【速读】：该论文旨在解决单域泛化目标检测（S-DGOD）中的问题，特别是在从单一源领域转移到未见过的目标领域时，现有模型过度强调特征不变性导致忽略图像间实际差异的问题。论文的关键解决方案是提出了多样性不变性检测模型（DIDM），通过平衡领域特定特征的多样性与跨领域的不变性来应对这一挑战。为此，引入了多样性学习模块（DLM）以在保持领域特定信息多样性的同时限制类别语义，并采用了加权对齐模块（WAM）以在不损害特征多样性的前提下维持领域不变性。

链接: https://arxiv.org/abs/2502.03835
作者: Zhenwei He,Hongsu Ni
机构: Chongqing University of Technology(重庆理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-domain generalization for object detection (S-DGOD) aims to transfer knowledge from a single source domain to unseen target domains. In recent years, many models have focused primarily on achieving feature invariance to enhance robustness. However, due to the inherent diversity across domains, an excessive emphasis on invariance can cause the model to overlook the actual differences between images. This overemphasis may complicate the training process and lead to a loss of valuable information. To address this issue, we propose the Diversity Invariance Detection Model (DIDM), which focuses on the balance between the diversity of domain-specific and invariance cross domains. Recognizing that domain diversity introduces variations in domain-specific features, we introduce a Diversity Learning Module (DLM). The DLM is designed to preserve the diversity of domain-specific information with proposed feature diversity loss while limiting the category semantics in the features. In addition, to maintain domain invariance, we incorporate a Weighted Aligning Module (WAM), which aligns features without compromising feature diversity. We conducted our model on five distinct datasets, which have illustrated the superior performance and effectiveness of the proposed model.
zh

[CV-43] FE-UNet: Frequency Domain Enhanced U-Net with Segment Anything Capability for Versatile Image Segmentation

【速读】：该论文旨在解决图像分割任务中不同视觉特征提取的问题。具体而言，卷积神经网络（Convolutional Neural Networks, CNNs）倾向于捕捉高频特征，而Transformer则侧重于低频特征。论文的关键解决方案是提出了一种Wavelet-Guided Spectral Pooling Module (WSPM)，用于增强和平衡图像在频域中的特征，并引入Frequency Domain Enhanced Receptive Field Block (FE-RFB)来进一步模拟人眼视觉系统，从而从频域中提取丰富的特征。基于这些创新，作者开发了FE-UNet模型，该模型采用SAM2作为其主干，并结合Hiera-Large作为预训练块，以增强泛化能力并确保高分割精度。实验结果表明，FE-UNet在包括海洋动物和息肉分割在内的多种任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2502.03829
作者: Guohao Huo,Ruiting Dai,Ling Shao,Hao Tang
机构: University of Electronic Science and Technology of China(电子科技大学); University of Chinese Academy of Sciences(中国科学院大学); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image segmentation is a critical task in visual understanding. Convolutional Neural Networks (CNNs) are predisposed to capture high-frequency features in images, while Transformers exhibit a contrasting focus on low-frequency features. In this paper, we experimentally quantify the contrast sensitivity function of CNNs and compare it with that of the human visual system, informed by the seminal experiments of Mannos and Sakrison. Leveraging these insights, we propose the Wavelet-Guided Spectral Pooling Module (WSPM) to enhance and balance image features across the frequency domain. To further emulate the human visual system, we introduce the Frequency Domain Enhanced Receptive Field Block (FE-RFB), which integrates WSPM to extract enriched features from the frequency domain. Building on these innovations, we develop FE-UNet, a model that utilizes SAM2 as its backbone and incorporates Hiera-Large as a pre-trained block, designed to enhance generalization capabilities while ensuring high segmentation accuracy. Experimental results demonstrate that FE-UNet achieves state-of-the-art performance in diverse tasks, including marine animal and polyp segmentation, underscoring its versatility and effectiveness.
zh

[CV-44] FairT2I: Mitigating Social Bias in Text-to-Image Generation via Large Language Model-Assisted Detection and Attribute Rebalancing

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）模型在生成过程中存在的社会偏见问题。解决方案的关键在于FairT2I框架，该框架包含两个主要组件：基于大规模语言模型（LLM）的偏见检测模块，用于识别根据文本提示生成图像中的潜在社会偏见；以及属性重新平衡模块，通过微调敏感属性来减轻已识别的偏见，从而在保持高质量图像生成的同时显著减少偏见。

链接: https://arxiv.org/abs/2502.03826
作者: Jinya Sakurai,Issei Sato
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of Text-to-Image (T2I) models has revolutionized content creation, providing powerful tools for diverse applications ranging from artistic expression to educational material development and marketing. Despite these technological advancements, significant ethical concerns arise from these models’ reliance on large-scale datasets that often contain inherent societal biases. These biases are further amplified when AI-generated content is included in training data, potentially reinforcing and perpetuating stereotypes in the generated outputs. In this paper, we introduce FairT2I, a novel framework that harnesses large language models to detect and mitigate social biases in T2I generation. Our framework comprises two key components: (1) an LLM-based bias detection module that identifies potential social biases in generated images based on text prompts, and (2) an attribute rebalancing module that fine-tunes sensitive attributes within the T2I model to mitigate identified biases. Our extensive experiments across various T2I models and datasets show that FairT2I can significantly reduce bias while maintaining high-quality image generation. We conducted both qualitative user studies and quantitative non-parametric analyses in the generated image feature space, building upon the occupational dataset introduced in the Stable Bias study. Our results show that FairT2I successfully mitigates social biases and enhances the diversity of sensitive attributes in generated images. We further demonstrate, using the P2 dataset, that our framework can detect subtle biases that are challenging for human observers to perceive, extending beyond occupation-related prompts. On the basis of these findings, we introduce a new benchmark dataset for evaluating bias in T2I models.
zh

[CV-45] Optimized Unet with Attention Mechanism for Multi-Scale Semantic Segmentation

【速读】：该论文旨在解决传统Unet模型在处理复杂背景、长距离依赖及多尺度目标时的局限性。解决方案的关键在于结合注意力机制，引入通道注意力（channel attention）和空间注意力（spatial attention）模块，增强模型聚焦重要特征的能力，并通过多尺度特征融合策略优化跳跃连接（skip connections），从而提升全局语义信息与细粒度特征的结合效果。实验结果表明，改进后的模型在Cityscapes数据集上的平均交并比（mIoU）达到76.5%，像素准确率（PA）达到95.3%，验证了该方法在处理复杂场景和模糊目标边界方面的优越性。

链接: https://arxiv.org/abs/2502.03813
作者: Xuan Li,Quanchao Lu,Yankaiqi Li,Muqing Li,Yijiashun Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation is one of the core tasks in the field of computer vision, and its goal is to accurately classify each pixel in an image. The traditional Unet model achieves efficient feature extraction and fusion through an encoder-decoder structure, but it still has certain limitations when dealing with complex backgrounds, long-distance dependencies, and multi-scale targets. To this end, this paper proposes an improved Unet model combined with an attention mechanism, introduces channel attention and spatial attention modules, enhances the model’s ability to focus on important features, and optimizes skip connections through a multi-scale feature fusion strategy, thereby improving the combination of global semantic information and fine-grained features. The experiment is based on the Cityscapes dataset and compared with classic models such as FCN, SegNet, DeepLabv3+, and PSPNet. The improved model performs well in terms of mIoU and pixel accuracy (PA), reaching 76.5% and 95.3% respectively. The experimental results verify the superiority of this method in dealing with complex scenes and blurred target boundaries. In addition, this paper discusses the potential of the improved model in practical applications and future expansion directions, indicating that it has broad application value in fields such as autonomous driving, remote sensing image analysis, and medical image processing.
zh

[CV-46] DeblurDiff: Real-World Image Deblurring with Generative Diffusion Models

【速读】：该论文旨在解决真实世界图像去模糊（Image Deblurring）的问题。关键在于引入了一种潜空间中的联合训练方法，通过设计一个潜空间核预测网络（Latent Kernel Prediction Network, LKPN）与条件扩散模型（Conditional Diffusion）共同训练。LKPN学习了一个空间变化的核（spatially variant kernel），用于指导潜空间中清晰图像的恢复，并通过元素自适应卷积（Element-wise Adaptive Convolution, EAC）有效保留输入结构信息。这种方法更有效地引导了Stable Diffusion (SD) 的生成过程，提升了去模糊效果及细节重建质量。此外，通过在每次扩散步骤的结果上迭代估计LKPN中的核，进一步提高了去模糊过程的准确性和鲁棒性。

链接: https://arxiv.org/abs/2502.03810
作者: Lingshun Kong,Jiawei Zhang,Dongqing Zou,Jimmy Ren,Xiaohe Wu,Jiangxin Dong,Jinshan Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved significant progress in image generation. The pre-trained Stable Diffusion (SD) models are helpful for image deblurring by providing clear image priors. However, directly using a blurry image or pre-deblurred one as a conditional control for SD will either hinder accurate structure extraction or make the results overly dependent on the deblurring network. In this work, we propose a Latent Kernel Prediction Network (LKPN) to achieve robust real-world image deblurring. Specifically, we co-train the LKPN in latent space with conditional diffusion. The LKPN learns a spatially variant kernel to guide the restoration of sharp images in the latent space. By applying element-wise adaptive convolution (EAC), the learned kernel is utilized to adaptively process the input feature, effectively preserving the structural information of the input. This process thereby more effectively guides the generative process of Stable Diffusion (SD), enhancing both the deblurring efficacy and the quality of detail reconstruction. Moreover, the results at each diffusion step are utilized to iteratively estimate the kernels in LKPN to better restore the sharp latent by EAC. This iterative refinement enhances the accuracy and robustness of the deblurring process. Extensive experimental results demonstrate that the proposed method outperforms state-of-the-art image deblurring methods on both benchmark and real-world images.
zh

[CV-47] Gaze-Assisted Human-Centric Domain Adaptation for Cardiac Ultrasound Image Segmentation

【速读】：该论文旨在解决心脏超声图像分割中的领域适应（Domain Adaptation, DA）问题，特别关注因不完全伪标签和低质量目标到源图像导致的现有方法的局限性。论文的关键解决方案在于提出了一种基于注视引导的人本中心领域适应方法（gaze-assisted human-centric domain adaptation, GAHCDA）。GAHCDA 包含两个主要模块：注视增强对齐（Gaze Augment Alignment, GAA），用于使模型获得人类认知的一般特征，以在不同领域的超声心动图图像中识别分割目标；注视平衡损失（Gaze Balance Loss, GBL），通过融合注视热图与输出结果，使分割结果在结构上更接近目标域。实验结果表明，所提出的框架比基于 GAN 和其他自训练方法更能有效地分割目标域中的心脏超声图像，在临床应用中展现出巨大潜力。

链接: https://arxiv.org/abs/2502.03781
作者: Ruiyi Li,Yuting He,Rongjun Ge,Chong Wang,Daoqiang Zhang,Yang Chen,Shuo Li
机构: College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics (南京航空航天大学); School of Computer Science and Engineering, Southeast University (东南大学); School of Instrument Science and Engineering, Southeast University (东南大学); Department of Biomedical Engineering, Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Domain adaptation (DA) for cardiac ultrasound image segmentation is clinically significant and valuable. However, previous domain adaptation methods are prone to be affected by the incomplete pseudo-label and low-quality target to source images. Human-centric domain adaptation has great advantages of human cognitive guidance to help model adapt to target domain and reduce reliance on labels. Doctor gaze trajectories contains a large amount of cross-domain human guidance. To leverage gaze information and human cognition for guiding domain adaptation, we propose gaze-assisted human-centric domain adaptation (GAHCDA), which reliably guides the domain adaptation of cardiac ultrasound images. GAHCDA includes following modules: (1) Gaze Augment Alignment (GAA): GAA enables the model to obtain human cognition general features to recognize segmentation target in different domain of cardiac ultrasound images like humans. (2) Gaze Balance Loss (GBL): GBL fused gaze heatmap with outputs which makes the segmentation result structurally closer to the target domain. The experimental results illustrate that our proposed framework is able to segment cardiac ultrasound images more effectively in the target domain than GAN-based methods and other self-train based methods, showing great potential in clinical application.
zh

[CV-48] Multi-Label Test-Time Adaptation with Bound Entropy Minimization ICLR2025

【速读】：该论文旨在解决在多标签场景下测试时适应（Test-Time Adaptation, TTA）过程中遇到的问题。传统TTA方法通过熵最小化来缓解分布偏移，这通常会增加最自信类别的概率，但当面对多标签实例时，这种方法会导致其他正标签的适应效果不佳。关键在于开发了一种名为Bound Entropy Minimization (BEM)的目标函数，它能够同时提升多个预测标签的置信度。论文提出的方法通过获取每个增强视图的配对描述和文本标签，确定视图的标签数量，并将这些标签分配给视图和描述作为弱标签集和强标签集。随后，BEM将来自视图和描述的最高前k个预测标签视为单一实体，从而同步学习视图和描述提示，克服了仅优化最自信类别的局限性。

链接: https://arxiv.org/abs/2502.03777
作者: Xiangyu Wu,Feng Yu,Qing-Guo Chen,Yang Yang,Jianfeng Lu
机构: Nanjing University of Science and Technology (南京理工大学); Alibaba International Digital Commerce Group (阿里巴巴国际数字商业集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ICLR 2025; 17 pages; 3 figures

点击查看摘要

Abstract:Mainstream test-time adaptation (TTA) techniques endeavor to mitigate distribution shifts via entropy minimization for multi-class classification, inherently increasing the probability of the most confident class. However, when encountering multi-label instances, the primary challenge stems from the varying number of labels per image, and prioritizing only the highest probability class inevitably undermines the adaptation of other positive labels. To address this issue, we investigate TTA within multi-label scenario (ML–TTA), developing Bound Entropy Minimization (BEM) objective to simultaneously increase the confidence of multiple top predicted labels. Specifically, to determine the number of labels for each augmented view, we retrieve a paired caption with yielded textual labels for that view. These labels are allocated to both the view and caption, called weak label set and strong label set with the same size k. Following this, the proposed BEM considers the highest top-k predicted labels from view and caption as a single entity, respectively, learning both view and caption prompts concurrently. By binding top-k predicted labels, BEM overcomes the limitation of vanilla entropy minimization, which exclusively optimizes the most confident class. Across the MSCOCO, VOC, and NUSWIDE multi-label datasets, our ML–TTA framework equipped with BEM exhibits superior performance compared to the latest SOTA methods, across various model architectures, prompt initialization, and varying label scenarios. The code is available at this https URL.
zh

[CV-49] A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma

【速读】：该论文旨在解决肝细胞癌（HCC）早期筛查中使用超声检查存在灵敏度不足以及高度依赖放射科医生专业技能的问题。解决方案的关键在于提出了一种创新的分层稀疏查询变换器（HSQformer）模型，该模型结合了卷积神经网络（CNNs）和视觉变换器（ViTs）的优势，通过利用稀疏潜在空间表示来捕捉不同层次的细节，从而无需复杂调整即可增强HCC诊断的准确性。此外，HSQformer采用模块化、即插即用的设计理念，确保了模型的通用性和易用性。实验结果表明，在单中心、多中心及高风险患者测试的三种临床场景下，HSQformer均优于现有的最先进模型，并且其诊断能力可与资深放射科医生相媲美，全面超越初级放射科医生。

链接: https://arxiv.org/abs/2502.03772
作者: Chaoyin She,Ruifang Lu,Danni He,Jiayi Lv,Yadan Lin,Meiqing Cheng,Hui Huang,Lida Chen,Wei Wang,Qinghua Huang
机构: Northwestern Polytechnical University; The First Affiliated Hospital of Sun Yat-Sen University; The Seventh Affiliated Hospital of Sun Yat-Sen University; The First Affiliated Hospital of Guangxi Medical University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hepatocellular carcinoma (HCC) ranks as the third leading cause of cancer-related mortality worldwide, with early detection being crucial for improving patient survival rates. However, early screening for HCC using ultrasound suffers from insufficient sensitivity and is highly dependent on the expertise of radiologists for interpretation. Leveraging the latest advancements in artificial intelligence (AI) in medical imaging, this study proposes an innovative Hierarchical Sparse Query Transformer (HSQformer) model that combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to enhance the accuracy of HCC diagnosis in ultrasound screening. The HSQformer leverages sparse latent space representations to capture hierarchical details at various granularities without the need for complex adjustments, and adopts a modular, plug-and-play design philosophy, ensuring the model’s versatility and ease of use. The HSQformer’s performance was rigorously tested across three distinct clinical scenarios: single-center, multi-center, and high-risk patient testing. In each of these settings, it consistently outperformed existing state-of-the-art models, such as ConvNext and SwinTransformer. Notably, the HSQformer even matched the diagnostic capabilities of senior radiologists and comprehensively surpassed those of junior radiologists. The experimental results from this study strongly demonstrate the effectiveness and clinical potential of AI-assisted tools in HCC screening. The full code is available at this https URL.
zh

[CV-50] RAMOTS: A Real-Time System for Aerial Multi-Object Tracking based on Deep Learning and Big Data Technology

【速读】：该论文旨在解决无人机（UAV）视频中的多目标跟踪（MOT）挑战，特别是由于视角变化、低分辨率及小物体的存在所导致的问题。论文的关键解决方案在于提出了一种新颖的实时MOT框架，该框架集成了Apache Kafka和Apache Spark以实现高效且容错的视频流处理，并采用了最先进的深度学习模型YOLOv8/YOLOv10和BYTETRACK/BoTSORT用于精确的目标检测与跟踪。通过这些技术的结合应用，系统在Visdrone2019-MOT测试集上达到了48.14的HOTA和43.51的MOTA，同时保持了单GPU环境下每秒28帧的实时处理速度。

链接: https://arxiv.org/abs/2502.03760
作者: Nhat-Tan Do,Nhi Ngoc-Yen Nguyen,Dieu-Phuong Nguyen,Trong-Hop Do
机构: Faculty of Information Science and Engineering, University of Information Technology (信息技术与工程学院，胡志明市信息技术大学); Vietnam National University (越南国立大学)，Ho Chi Minh City, Vietnam (胡志明市)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) in UAV-based video is challenging due to variations in viewpoint, low resolution, and the presence of small objects. While other research on MOT dedicated to aerial videos primarily focuses on the academic aspect by developing sophisticated algorithms, there is a lack of attention to the practical aspect of these systems. In this paper, we propose a novel real-time MOT framework that integrates Apache Kafka and Apache Spark for efficient and fault-tolerant video stream processing, along with state-of-the-art deep learning models YOLOv8/YOLOv10 and BYTETRACK/BoTSORT for accurate object detection and tracking. Our work highlights the importance of not only the advanced algorithms but also the integration of these methods with scalable and distributed systems. By leveraging these technologies, our system achieves a HOTA of 48.14 and a MOTA of 43.51 on the Visdrone2019-MOT test set while maintaining a real-time processing speed of 28 FPS on a single GPU. Our work demonstrates the potential of big data technologies and deep learning for addressing the challenges of MOT in UAV applications.
zh

[CV-51] Improving Adversarial Robustness via Phase and Amplitude-aware Prompting

【速读】：该论文旨在解决深度神经网络易受对抗噪声攻击的问题。现有的基于提示（Prompt-based）的防御方法主要利用混合提示模式，但这些方法未能充分关注与对象语义密切相关的关键模式。为了解决这一问题，论文提出了一种基于相位和振幅感知的提示（Phase and Amplitude-aware Prompting, PAP）防御方法。其关键是构建每个类别的相位级和振幅级提示，并在训练过程中根据模型在这些提示下的鲁棒性能调整提示权重。

链接: https://arxiv.org/abs/2502.03758
作者: Yibo Xu,Dawei Zhou,Decheng Liu,Nannan Wang
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks are found to be vulnerable to adversarial noises. The prompt-based defense has been increasingly studied due to its high efficiency. However, existing prompt-based defenses mainly exploited mixed prompt patterns, where critical patterns closely related to object semantics lack sufficient focus. The phase and amplitude spectra have been proven to be highly related to specific semantic patterns and crucial for robustness. To this end, in this paper, we propose a Phase and Amplitude-aware Prompting (PAP) defense. Specifically, we construct phase-level and amplitude-level prompts for each class, and adjust weights for prompting according to the model’s robust performance under these prompts during training. During testing, we select prompts for each image using its predicted label to obtain the prompted image, which is inputted to the model to get the final prediction. Experimental results demonstrate the effectiveness of our method.
zh

[CV-52] Brain Tumor Identification using Improved YOLOv8

【速读】：该论文旨在解决脑肿瘤边界检测的挑战，特别是在磁共振成像（MRI）中精确识别肿瘤尺寸的问题。手动检测脑肿瘤边界耗时且需要专业知识。为应对这一挑战，论文提出了一种改进的You Only Look Once（YOLOv8）模型，关键改进包括：在检测头中用实时检测Transformer（RT-DETR）替代非极大值抑制（NMS）算法；用鬼卷积（Ghost Convolution）替换常规卷积块以减少计算和内存成本；以及在YOLOv8主干网络中引入视觉变换器（Vision Transformer）模块以提取上下文感知特征。这些改进使模型在公开的脑肿瘤数据集上实现了优于其他目标检测方法的表现，达到0.91的mAP@0.5。

链接: https://arxiv.org/abs/2502.03746
作者: Rupesh Dulal,Rabin Dulal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Identifying the extent of brain tumors is a significant challenge in brain cancer treatment. The main difficulty is in the approximate detection of tumor size. Magnetic resonance imaging (MRI) has become a critical diagnostic tool. However, manually detecting the boundaries of brain tumors from MRI scans is a labor-intensive task that requires extensive expertise. Deep learning and computer-aided detection techniques have led to notable advances in machine learning for this purpose. In this paper, we propose a modified You Only Look Once (YOLOv8) model to accurately detect the tumors within the MRI images. The proposed model replaced the Non-Maximum Suppression (NMS) algorithm with a Real-Time Detection Transformer (RT- DETR) in the detection head. NMS filters out redundant or overlapping bounding boxes in the detected tumors, but they are hand-designed and pre-set. RT-DETR removes hand-designed components. The second improvement was made by replacing the normal convolution block with ghost convolution. Ghost Convolution reduces computational and memory costs while maintaining high accuracy and enabling faster inference, making it ideal for resource-constrained environments and real-time applications. The third improvement was made by introducing a vision transformer block in the backbone of YOLOv8 to extract context-aware features. We used a publicly available dataset of brain tumors in the proposed model. The proposed model performed better than the original YOLOv8 model and also performed better than other object detectors (Faster R- CNN, Mask R-CNN, YOLO, YOLOv3, YOLOv4, YOLOv5, SSD, RetinaNet, EfficientDet, and DETR). The proposed model achieved 0.91 mAP (mean Average Precision)@0.5.
zh

[CV-53] Scaling Laws in Patchification: An Image Is Worth 50176 Tokens And More

【速读】：该论文旨在深入探究基于块划分（patchification）压缩编码范式所导致的信息损失及其对视觉理解的影响。研究的关键在于发现模型在减小块大小直至达到最小的1x1像素级别之前，能够持续受益于较小的块大小，并实现预测性能的提升。这一结论适用于不同的视觉任务、多种输入尺度以及包括Vision Transformer (ViT) 和最新的Mamba模型在内的多样化架构。此外，研究还发现，在使用较小块的情况下，特定任务解码器头对于密集预测的重要性降低。关键解决方案在于通过系统性地调整块大小，揭示其对视觉模型性能的影响规律。

链接: https://arxiv.org/abs/2502.03738
作者: Feng Wang,Yaodong Yu,Guoyizhe Wei,Wei Shao,Yuyin Zhou,Alan Yuille,Cihang Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a de facto image tokenization approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1x1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models. Moreover, as a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction. In the experiments, we successfully scale up the visual sequence to an exceptional length of 50,176 tokens, achieving a competitive test accuracy of 84.6% with a base-sized model on the ImageNet-1k benchmark. We hope this study can provide insights and theoretical foundations for future works of building non-compressive vision models. Code is available at this https URL.
zh

[CV-54] DICE: Distilling Classifier-Free Guidance into Text Embeddings

【速读】：该论文旨在解决文本到图像扩散模型在生成高质量图像时与给定文本提示对齐不佳的问题。论文的关键解决方案是提出了一种名为DIstilling CFG by enhancing text Embeddings (DICE)的方法，通过优化文本嵌入来复制分类器自由引导（Classifier-free Guidance, CFG）的方向，从而在不依赖CFG的情况下保持其带来的好处。这种方法避免了CFG带来的计算和理论上的缺点，实现了快速采样速度下的高质量、高对齐度图像生成，并支持负提示以进一步提高图像质量。

链接: https://arxiv.org/abs/2502.03726
作者: Zhenyu Zhou,Defang Chen,Can Wang,Chun Chen,Siwei Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models are capable of generating high-quality images, but these images often fail to align closely with the given text prompts. Classifier-free guidance (CFG) is a popular and effective technique for improving text-image alignment in the generative process. However, using CFG introduces significant computational overhead and deviates from the established theoretical foundations of diffusion models. In this paper, we present DIstilling CFG by enhancing text Embeddings (DICE), a novel approach that removes the reliance on CFG in the generative process while maintaining the benefits it provides. DICE distills a CFG-based text-to-image diffusion model into a CFG-free version by refining text embeddings to replicate CFG-based directions. In this way, we avoid the computational and theoretical drawbacks of CFG, enabling high-quality, well-aligned image generation at a fast sampling speed. Extensive experiments on multiple Stable Diffusion v1.5 variants, SDXL and PixArt- \alpha demonstrate the effectiveness of our method. Furthermore, DICE supports negative prompts for image editing to improve image quality further. Code will be available soon.
zh

[CV-55] MD-BERT: Action Recognition in Dark Videos via Dynamic Multi-Stream Fusion and Temporal Modeling

【速读】：该论文旨在解决在低光或噪声视频中动作识别的挑战，这些问题会因可见度降低而阻碍关键时空细节的捕捉。解决方案的关键在于提出了MD-BERT模型，这是一种多流方法，集成了互补的预处理技术如伽玛校正和直方图均衡化，与原始暗帧结合使用以应对这些挑战。此外，引入了动态特征融合（Dynamic Feature Fusion, DFF）模块，扩展了现有的注意力融合方法到三流设置，从而捕获不同亮度和对比度增强下的细粒度和全局上下文信息。最后，使用基于BERT的时间模型来有效捕捉跨帧的长距离依赖关系和上下文关系。

链接: https://arxiv.org/abs/2502.03724
作者: Sharana Dharshikgan Suresh Dass,Hrishav Bakul Barua,Ganesh Krishnasamy,Raveendran Paramesran,Raphael C.-W. Phan
机构: School of Information Technology, Monash University, Malaysia (信息科学技术学院，蒙纳士大学，马来西亚); Faculty of Information Technology, Monash University, Australia (信息技术学院，蒙纳士大学，澳大利亚); Robotics and Autonomous Systems Lab, TCS Research, India (机器人与自主系统实验室，塔塔咨询服务公司，印度)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Action recognition in dark, low-light (under-exposed) or noisy videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes MD-BERT, a novel multi-stream approach that integrates complementary pre-processing techniques such as gamma correction and histogram equalization alongside raw dark frames to address these challenges. We introduce the Dynamic Feature Fusion (DFF) module, extending existing attentional fusion methods to a three-stream setting, thereby capturing fine-grained and global contextual information across different brightness and contrast enhancements. The fused spatiotemporal features are then processed by a BERT-based temporal model, which leverages its bidirectional self-attention to effectively capture long-range dependencies and contextual relationships across frames. Extensive experiments on the ARID V1.0 and ARID V1.5 dark video datasets show that MD-BERT outperforms existing methods, establishing a new state-of-the-art performance. Ablation studies further highlight the individual contributions of each input stream and the effectiveness of the proposed DFF and BERT modules. The official website of this work is available at: this https URL
zh

[CV-56] Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

【速读】：该论文旨在解决跨多个预训练深度神经网络的可解释性问题，提出了一种名为通用稀疏自编码器（Universal Sparse Autoencoders, USAEs）的新框架。关键解决方案在于训练一个单一的、过完备的稀疏自编码器（sparse autoencoder, SAE），该自编码器能够摄入任意模型的激活，并解码以近似其他考虑中的任何模型的激活。通过优化共享目标函数，学习到的字典捕捉到了不同任务、架构和数据集之间的共同变化因素，即概念。这使得USAEs能够发现视觉模型中的语义一致且重要的通用概念，从低级特征到高级结构。总体而言，USAEs提供了一种强大的新方法，用于可解释的跨模型分析，并开启了在多模型AI系统中进行更深入洞察的新途径。

链接: https://arxiv.org/abs/2502.03714
作者: Harrish Thasarathan,Julian Forsyth,Thomas Fel,Matthew Kowal,Konstantinos Derpanis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation-concepts-across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offers novel applications, such as coordinated activation maximization, that open avenues for deeper insights in multi-model AI systems
zh

[CV-57] Conditional Diffusion Models are Medical Image Classifiers that Provide Explainability and Uncertainty for Free

【速读】：该论文旨在探索条件扩散模型在二维医学图像分类中的潜力，并提出了一种改进的多数投票方案以提升医学扩散分类器的性能。关键解决方案在于开发了这一新型多数投票策略，并通过实验验证了基础及从零训练的扩散模型在CheXpert和ISIC黑色素瘤皮肤癌数据集上的竞争力，同时展示了扩散分类器具有内在的可解释性和预测不确定性量化能力，从而增强其在临床环境中的可靠性和可信度。

链接: https://arxiv.org/abs/2502.03687
作者: Gian Mario Favero,Parham Saremi,Emily Kaczmarek,Brennan Nichyporuk,Tal Arbel
机构: McGill University (麦吉尔大学); Mila – Quebec AI Institute (蒙特利尔学习算法研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Discriminative classifiers have become a foundational tool in deep learning for medical imaging, excelling at learning separable features of complex data distributions. However, these models often need careful design, augmentation, and training techniques to ensure safe and reliable deployment. Recently, diffusion models have become synonymous with generative modeling in 2D. These models showcase robustness across a range of tasks including natural image classification, where classification is performed by comparing reconstruction errors across images generated for each possible conditioning input. This work presents the first exploration of the potential of class conditional diffusion models for 2D medical image classification. First, we develop a novel majority voting scheme shown to improve the performance of medical diffusion classifiers. Next, extensive experiments on the CheXpert and ISIC Melanoma skin cancer datasets demonstrate that foundation and trained-from-scratch diffusion models achieve competitive performance against SOTA discriminative classifiers without the need for explicit supervision. In addition, we show that diffusion classifiers are intrinsically explainable, and can be used to quantify the uncertainty of their predictions, increasing their trustworthiness and reliability in safety-critical, clinical contexts. Further information is available on our project page: this https URL
zh

[CV-58] Variational Control for Guidance in Diffusion Models

【速读】：该论文旨在解决扩散模型在指导方法上的局限性，现有方法通常需要额外的模型训练或仅限于特定任务。论文的关键在于提出了一种名为扩散轨迹匹配（Diffusion Trajectory Matching, DTM）的新方法，从变分推理和控制的角度重新审视扩散模型中的指导机制。DTM能够引导预训练的扩散轨迹以满足终端代价，无需额外的模型训练或修改。通过这一框架，论文引入了一种新方法，在多个线性和非线性逆问题中实现了最先进的结果，例如在ImageNet非线性去模糊任务中，该模型获得了34.31的FID分数，显著优于最佳的预训练方法基线（FID 78.07）。

链接: https://arxiv.org/abs/2502.03686
作者: Kushagra Pandey,Farrin Marouf Sofian,Felix Draxler,Theofanis Karaletsos,Stephan Mandt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 8 pages in main text. Total of 20 pages

点击查看摘要

Abstract:Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear and (blind) non-linear inverse problems without requiring additional model training or modifications. For instance, in ImageNet non-linear deblurring, our model achieves an FID score of 34.31, significantly improving over the best pretrained-method baseline (FID 78.07). We will make the code available in a future update.
zh

[CV-59] An Empirical Study of Methods for Small Object Detection from Satellite Imagery

【速读】：该论文旨在解决从遥感图像中检测小物体的方法问题，并通过实证评估四种最先进的方法来洞察这些方法的性能和技术挑战。关键在于使用公共高分辨率卫星图像数据集，通过检测城市卫星图像中的汽车和农业用地卫星图像中的蜂箱这两种应用场景，验证了几种顶级方法的有效性。

链接: https://arxiv.org/abs/2502.03674
作者: Xiaohui Yuan,Aniv Chakravarty,Lichuan Gu,Zhenchun Wei,Elinor Lichtenberg,Tian Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper reviews object detection methods for finding small objects from remote sensing imagery and provides an empirical evaluation of four state-of-the-art methods to gain insights into method performance and technical challenges. In particular, we use car detection from urban satellite images and bee box detection from satellite images of agricultural lands as application scenarios. Drawing from the existing surveys and literature, we identify several top-performing methods for the empirical study. Public, high-resolution satellite image datasets are used in our experiments.
zh

[CV-60] Advancing Weight and Channel Sparsification with Enhanced Saliency WACV2025

【速读】：该论文旨在解决通过剪枝加速和压缩模型时性能下降的问题。现有方法通常依赖于不完美的重要性评分来识别冗余参数，并且一旦移除这些参数则不可逆，导致剪枝后的模型性能降低。此外，动态稀疏训练虽然尝试在训练过程中调整稀疏结构以持续重新评估和优化，但也存在标准不一致、不适于结构化稀疏性以及短视的增长策略等局限性。

本文提出的关键解决方案是引入一种高效创新的范式，用于增强给定的重要性标准，适用于无结构或有结构的稀疏性。该方法将模型分为利用的主动结构和探索的空间。在利用阶段，优化主动结构；在探索阶段，通过与同一重要性标准一致的剪枝和增长步骤重新评估并重新集成探索空间中的参数。这种方法通过“重新激活”探索空间中的所有参数并进行短暂训练，为重新整合这些参数提供潜在性能增益的预览，从而克服了现有方法的局限性。

链接: https://arxiv.org/abs/2502.03658
作者: Xinglong Sun,Maying Shen,Hongxu Yin,Lei Mao,Pavlo Molchanov,Jose M. Alvarez
机构: NVIDIA
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025

点击查看摘要

Abstract:Pruning aims to accelerate and compress models by removing redundant parameters, identified by specifically designed importance scores which are usually imperfect. This removal is irreversible, often leading to subpar performance in pruned models. Dynamic sparse training, while attempting to adjust sparse structures during training for continual reassessment and refinement, has several limitations including criterion inconsistency between pruning and growth, unsuitability for structured sparsity, and short-sighted growth strategies. Our paper introduces an efficient, innovative paradigm to enhance a given importance criterion for either unstructured or structured sparsity. Our method separates the model into an active structure for exploitation and an exploration space for potential updates. During exploitation, we optimize the active structure, whereas in exploration, we reevaluate and reintegrate parameters from the exploration space through a pruning and growing step consistently guided by the same given importance criterion. To prepare for exploration, we briefly “reactivate” all parameters in the exploration space and train them for a few iterations while keeping the active part frozen, offering a preview of the potential performance gains from reintegrating these parameters. We show on various datasets and configurations that existing importance criterion even simple as magnitude can be enhanced with ours to achieve state-of-the-art performance and training cost reductions. Notably, on ImageNet with ResNet50, ours achieves an +1.3 increase in Top-1 accuracy over prior art at 90% ERK sparsity. Compared with the SOTA latency pruning method HALP, we reduced its training cost by over 70% while attaining a faster and more accurate pruned model.
zh

[CV-61] A Study in Dataset Distillation for Image Super-Resolution

【速读】：该论文旨在解决在图像超分辨率（Super-Resolution, SR）领域中数据集蒸馏（Dataset Distillation）技术的应用不足问题。论文的关键在于通过研究多种数据集蒸馏方法，包括像素空间（pixel-space）和隐空间（latent-space）的方法，实现高达91.12%的数据集规模缩减，同时保持与原始数据集相当的超分辨率性能。此外，论文还分析了不同的初始化策略和蒸馏方法，以优化内存效率和计算成本。

链接: https://arxiv.org/abs/2502.03656
作者: Tobias Dietz,Brian B. Moser,Tobias Nauen,Federico Raue,Stanislav Frolov,Andreas Dengel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dataset distillation is the concept of condensing large datasets into smaller but highly representative synthetic samples. While previous research has primarily focused on image classification, its application to image Super-Resolution (SR) remains underexplored. This exploratory work studies multiple dataset distillation techniques applied to SR, including pixel- and latent-space approaches under different aspects. Our experiments demonstrate that a 91.12% dataset size reduction can be achieved while maintaining comparable SR performance to the full dataset. We further analyze initialization strategies and distillation methods to optimize memory efficiency and computational costs. Our findings provide new insights into dataset distillation for SR and set the stage for future advancements.
zh

[CV-62] Gompertz Linear Units: Leverag ing Asymmetry for Enhanced Learning Dynamics

【速读】：该论文旨在解决深度学习架构中激活函数的训练动态问题，特别是针对ReLU引起的神经元死亡问题。论文的关键解决方案是引入了一种新的自门控激活函数——Gompertz线性单元（Gompertz Linear Unit, GoLU）。GoLU通过利用Gompertz函数的不对称性来更有效地减少潜在空间中的方差，并保持稳健的梯度流动，从而在多个任务中表现出超越现有最先进激活函数的优越性能。

链接: https://arxiv.org/abs/2502.03654
作者: Indrashis Das,Mahmoud Safari,Steven Adriaensen,Frank Hutter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, excluding references and appendix

点击查看摘要

Abstract:Activation functions are fundamental elements of deep learning architectures as they significantly influence training dynamics. ReLU, while widely used, is prone to the dying neuron problem, which has been mitigated by variants such as LeakyReLU, PReLU, and ELU that better handle negative neuron outputs. Recently, self-gated activations like GELU and Swish have emerged as state-of-the-art alternatives, leveraging their smoothness to ensure stable gradient flow and prevent neuron inactivity. In this work, we introduce the Gompertz Linear Unit (GoLU), a novel self-gated activation function defined as \mathrmGoLU(x) = x , \mathrmGompertz(x) , where \mathrmGompertz(x) = e^-e^-x . The GoLU activation leverages the asymmetry in the Gompertz function to reduce variance in the latent space more effectively compared to GELU and Swish, while preserving robust gradient flow. Extensive experiments across diverse tasks, including Image Classification, Language Modeling, Semantic Segmentation, Object Detection, Instance Segmentation, and Diffusion, highlight GoLU’s superior performance relative to state-of-the-art activation functions, establishing GoLU as a robust alternative to existing activation functions.
zh

[CV-63] All-in-One Image Compression and Restoration WACV2025

【速读】：该论文旨在解决在实际图像压缩过程中，视觉图像受到多种类型和不同程度退化的问题。现有大多数图像压缩方法主要针对干净图像设计，难以在这些退化图像上取得满意的结果。为了解决这一问题，论文提出了一种统一框架，用于所有类型的图像压缩与恢复，关键在于通过内容信息聚合和退化表示聚合来区分真实的图像内容与退化部分，并灵活消除各种退化，而无需先验知识。

链接: https://arxiv.org/abs/2502.03649
作者: Huimin Zeng,Jiacheng Li,Ziqiang Zheng,Zhiwei Xiong
机构: University of Science and Technology of China; The Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2025 (oral)

点击查看摘要

Abstract:Visual images corrupted by various types and levels of degradations are commonly encountered in practical image compression. However, most existing image compression methods are tailored for clean images, therefore struggling to achieve satisfying results on these images. Joint compression and restoration methods typically focus on a single type of degradation and fail to address a variety of degradations in practice. To this end, we propose a unified framework for all-in-one image compression and restoration, which incorporates the image restoration capability against various degradations into the process of image compression. The key challenges involve distinguishing authentic image content from degradations, and flexibly eliminating various degradations without prior knowledge. Specifically, the proposed framework approaches these challenges from two perspectives: i.e., content information aggregation, and degradation representation aggregation. Extensive experiments demonstrate the following merits of our model: 1) superior rate-distortion (RD) performance on various degraded inputs while preserving the performance on clean data; 2) strong generalization ability to real-world and unseen scenarios; 3) higher computing efficiency over compared methods. Our code is available at this https URL.
zh

[CV-64] owards Physical Understanding in Video Generation: A 3D Point Regularization Approach

【速读】：该论文旨在解决视频生成模型中缺乏三维形状意识导致的常见问题，如物体变形和不自然的形态变化。关键解决方案在于引入了一个新的视频生成框架，通过在二维视频中嵌入三维点轨迹，并在像素空间中对其对齐，构建了一个名为PointVid的三维感知视频数据集。基于此数据集，模型通过潜扩散模型进行微调，能够追踪带有三维笛卡尔坐标的二维物体。进一步通过形状和运动的正则化处理，消除了非物理变形等不良现象，从而提升了生成RGB视频的质量，并增强了视频中物体的三维一致性，减少了形态和运动上的突变。

链接: https://arxiv.org/abs/2502.03639
作者: Yunuo Chen,Junli Cao,Anil Kag,Vidit Goel,Sergei Korolev,Chenfanfu Jiang,Sergey Tulyakov,Jian Ren
机构: University of California, Los Angeles(SUCLA); Snap Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: \url{ this https URL }

点击查看摘要

Abstract:We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness. To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space. The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model, enabling it to track 2D objects with 3D Cartesian coordinates. Building on this, we regularize the shape and motion of objects in the video to eliminate undesired artifacts, \eg, nonphysical deformation. Consequently, we enhance the quality of generated RGB videos and alleviate common issues like object morphing, which are prevalent in current video models due to a lack of shape awareness. With our 3D augmentation and regularization, our model is capable of handling contact-rich scenarios such as task-oriented videos. These videos involve complex interactions of solids, where 3D information is essential for perceiving deformation and contact. Furthermore, our model improves the overall quality of video generation by promoting the 3D consistency of moving objects and reducing abrupt changes in shape and motion.
zh

[CV-65] he Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在生成过程中倾向于产生语法上连贯但与视觉无关的内容这一问题。论文的关键在于通过分析生成过程中的tokens logits排名，揭示了模型处理信息的三个模式：逐步的视觉信息损失、早期激发以及隐藏的真实信息。基于这些洞察，论文提出了一种无需额外监督且适用于多种解码策略的无训练干预框架VISTA（Visual Information Steering with Token-logit Augmentation）。VISTA通过增强激活空间中的视觉信息，并利用早期层激活来促进语义上有意义的解码，从而减少幻觉现象并提升真实信息的呈现。

链接: https://arxiv.org/abs/2502.03628
作者: Zhuowei Li,Haizhou Shi,Yunhe Gao,Di Liu,Zhenting Wang,Yuxiao Chen,Ting Liu,Long Zhao,Hao Wang,Dimitris N. Metaxas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits rankings throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss – visually grounded tokens gradually become less favored throughout generation, and (2) early excitation – semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information – visually grounded tokens though not being eventually decided still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information. VISTA works by combining two complementary approaches: reinforcing visual information in activation space and leveraging early layer activations to promote semantically meaningful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Extensive experiments show that VISTA on average reduces hallucination by abount 40% on evaluated open-ended generation task, and it consistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies.
zh

[CV-66] DynVFX: Augmenting Real Videos with Dynamic Content

【速读】：该论文旨在解决如何通过简单的文本指令实现对真实世界视频的动态内容增强。关键在于提出了一种无需训练的零样本框架，利用预训练的文生视频扩散变换器生成新内容，并使用预训练的视觉语言模型详细构想增强场景。特别是，引入了一种基于推理的方法，操纵注意力机制内的特征，从而实现新内容的精确定位与无缝集成，同时保持原始场景的完整性。

链接: https://arxiv.org/abs/2502.03621
作者: Danah Yatim,Rafail Fridman,Omer Bar-Tal,Tali Dekel
机构: Weizmann Institute of Science(魏茨曼科学研究学院)Israel; Pika Labs(皮卡实验室)USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained Vision Language Model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.
zh

[CV-67] Solar Panel Mapping via Oriented Object Detection

【速读】：该论文旨在解决太阳能电站详细地图绘制过程中手动标注太阳能面板的繁琐任务，并且这一任务在太阳能电站容量不断扩大的背景下难以扩展。论文的关键解决方案在于提出了一种端到端的深度学习框架，使用旋转目标检测架构来识别单个太阳能面板，从而提高了检测效率和可扩展性。评估结果显示，在来自美国各地的多样化太阳能电站数据集上，该方法达到了83.3%的平均精度（mAP）。

链接: https://arxiv.org/abs/2502.03592
作者: Conor Wallace,Isaac Corley,Jonathan Lwowski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Maintaining the integrity of solar power plants is a vital component in dealing with the current climate crisis. This process begins with analysts creating a detailed map of a plant with the coordinates of every solar panel, making it possible to quickly locate and mitigate potential faulty solar panels. However, this task is extremely tedious and is not scalable for the ever increasing capacity of solar power across the globe. Therefore, we propose an end-to-end deep learning framework for detecting individual solar panels using a rotated object detection architecture. We evaluate our approach on a diverse dataset of solar power plants collected from across the United States and report a mAP score of 83.3%.
zh

[CV-68] Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function

【速读】：该论文旨在解决多标签胸部X光（Chest X-ray, CXR）图像分类中的临床可解释性问题，同时保持单一模型单次运行的训练流程。关键解决方案在于引入了一种自定义的层次二元交叉熵（Hierarchical Binary Cross-Entropy, HBCE）损失函数，通过固定或数据驱动的惩罚类型来强制执行标签依赖关系，从而捕捉诊断之间的临床相关性。这一方法使得模型在测试集上的平均接收者操作特征曲线下面积（Area Under the Receiver Operating Characteristic Curve, AUROC）达到0.903。

链接: https://arxiv.org/abs/2502.03591
作者: Mehrdad Asadi,Komi Sodoké,Ian J. Gerard,Marta Kersten-Oertel
机构: Gina Cody School of Engineering and Computer Science, Concordia University (康科迪亚大学工程与计算机科学学院), Montreal, QC, Canada; YULCOM Technologies (YULCOM科技), Montreal, QC, Canada; Division of Radiation Oncology, McGill University Health Centre (麦吉尔大学健康中心放射肿瘤科), Montreal, QC, Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages with 3 figures, for associated implementation see this https URL

点击查看摘要

Abstract:In this work, we present a novel approach to multi-label chest X-ray (CXR) image classification that enhances clinical interpretability while maintaining a streamlined, single-model, single-run training pipeline. Leveraging the CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical label groupings to capture clinically meaningful relationships between diagnoses. To achieve this, we designed a custom hierarchical binary cross-entropy (HBCE) loss function that enforces label dependencies using either fixed or data-driven penalty types. Our model achieved a mean area under the receiver operating characteristic curve (AUROC) of 0.903 on the test set. Additionally, we provide visual explanations and uncertainty estimations to further enhance model interpretability. All code, model configurations, and experiment details are made available.
zh

[CV-69] CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

【速读】：该论文旨在解决CLIP（Contrastive Language-Image Pretraining）在表示组合概念时存在的问题，特别是其在多对象场景下无法正确绑定属性与相应对象的现象。论文指出这一现象源于跨模态对齐过程中依赖的余弦相似性方法。为了解决此问题，作者提出了一种名为LABCLIP（Linear Attribute Binding CLIP）的方法，通过在计算余弦相似性之前对文本嵌入进行线性变换来改进属性绑定。这种方法显著提升了CLIP在多对象场景下的属性绑定能力，从而增强了其对组合概念的理解。

链接: https://arxiv.org/abs/2502.03566
作者: Darina Koishigarina,Arnas Uselis,Seong Joon Oh
机构: University of Tübingen(Tübingen大学); Tübingen AI Center(Tübingen AI中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP’s ability to bind attributes to correct objects, thereby enhancing its compositional understanding.
zh

[CV-70] Efficient Global Neural Architecture Search

【速读】：该论文旨在解决神经架构搜索（NAS）在自动化网络设计过程中计算成本高昂的问题，特别是在搜索最优架构时需要评估大量候选模型所带来的高训练开销。论文的关键解决方案在于设计了一个可导航且架构多样的宏观-微观搜索空间，并提出了针对不同网络采用变量训练方案的架构感知近似方法。此外，通过将宏观和微观网络设计分离，提出了一种高效的搜索策略，从而在保证精度和模型大小的同时，实现了比现有最快全局搜索方法快2-4倍的速度，并在多个数据集上达到了新的最先进水平。

链接: https://arxiv.org/abs/2502.03553
作者: Shahid Siddiqui,Christos Kyrkou,Theocharis Theocharides
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CAIP2023

点击查看摘要

Abstract:Neural architecture search (NAS) has shown promise towards automating neural network design for a given task, but it is computationally demanding due to training costs associated with evaluating a large number of architectures to find the optimal one. To speed up NAS, recent works limit the search to network building blocks (modular search) instead of searching the entire architecture (global search), approximate candidates’ performance evaluation in lieu of complete training, and use gradient descent rather than naturally suitable discrete optimization approaches. However, modular search does not determine network’s macro architecture i.e. depth and width, demanding manual trial and error post-search, hence lacking automation. In this work, we revisit NAS and design a navigable, yet architecturally diverse, macro-micro search space. In addition, to determine relative rankings of candidates, existing methods employ consistent approximations across entire search spaces, whereas different networks may not be fairly comparable under one training protocol. Hence, we propose an architecture-aware approximation with variable training schemes for different networks. Moreover, we develop an efficient search strategy by disjoining macro-micro network design that yields competitive architectures in terms of both accuracy and size. Our proposed framework achieves a new state-of-the-art on EMNIST and KMNIST, while being highly competitive on the CIFAR-10, CIFAR-100, and FashionMNIST datasets and being 2-4x faster than the fastest global search methods. Lastly, we demonstrate the transferability of our framework to real-world computer vision problems by discovering competitive architectures for face recognition applications.
zh

[CV-71] Kronecker Mask and Interpretive Prompts are Language-Action Video Learners

【速读】：该论文旨在解决如何有效将对比语言图像预训练（CLIP）模型适应于视频领域的问题，特别是在动作识别任务中的应用。论文的关键在于提出了一种名为\textbfCLAVER的新方法，它不仅改进了CLIP的视觉分支，还特别强化了其文本分支，使其能够更好地理解和对齐抽象动词而非静态名词。具体而言，\textbfCLAVER通过引入一种新颖的克罗内克掩码注意力机制来优化时间建模，从而扩大每个标记的时间感受野，并作为有效的时空异质性归纳偏差，解决了时空同质化的问题。同时，通过利用大型语言模型生成丰富的动作描述语句，进一步增强了模型对动作动词的理解。这一综合改进显著提升了模型在多个基准数据集上的表现。

链接: https://arxiv.org/abs/2502.03549
作者: Jingyi Yang,Zitong Yu,Xiuming Ni,Jia He,Hui Li
机构: University of Science and Technology of China(中国科学技术大学); Great Bay University(大湾区大学); Dongguan Key Laboratory for Intelligence and Information Technology(东莞智能信息技术重点实验室); Anhui Tsinglink Information Technology Co.,Ltd.(安徽清创信息技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive language-image pretraining (CLIP) has significantly advanced image-based vision learning. A pressing topic subsequently arises: how can we effectively adapt CLIP to the video domain? Recent studies have focused on adjusting either the textual or visual branch of CLIP for action recognition. However, we argue that adaptations of both branches are crucial. In this paper, we propose \textbfCLAVER: a \textbfContrastive \textbfLanguage-\textbfAction \textbfVideo Learn\textbfer, designed to shift CLIP’s focus from the alignment of static visual objects and concrete nouns to the alignment of dynamic action behaviors and abstract verbs. Specifically, we introduce a novel Kronecker mask attention for temporal modeling. Our tailored Kronecker mask offers three benefits 1) it expands the temporal receptive field for each token, 2) it serves as an effective spatiotemporal heterogeneity inductive bias, mitigating the issue of spatiotemporal homogenization, and 3) it can be seamlessly plugged into transformer-based models. Regarding the textual branch, we leverage large language models to generate diverse, sentence-level and semantically rich interpretive prompts of actions, which shift the model’s focus towards the verb comprehension. Extensive experiments on various benchmarks and learning scenarios demonstrate the superiority and generality of our approach. The code will be available soon.
zh

[CV-72] Mapping and Localization Using LiDAR Fiducial Markers

【速读】：该论文旨在解决LiDAR标定标记（LFMs）在实际应用中滞后于视觉标定标记（VFMs）的问题，特别是在机器人技术和计算机视觉领域中的普及与实用性。论文的关键解决方案在于提出了一种基于强度图像的LiDAR标定标记（IFM）系统，并引入了增强算法扩展检测至三维地图，以及一种新的基于LFM的映射与定位方法。这些方案通过结合强度与几何信息，优化点云及标记的姿态估计，从而有效提升了三维地图融合等任务的性能。

链接: https://arxiv.org/abs/2502.03510
作者: Yibo Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: PhD thesis

点击查看摘要

Abstract:LiDAR sensors are essential for autonomous systems, yet LiDAR fiducial markers (LFMs) lag behind visual fiducial markers (VFMs) in adoption and utility. Bridging this gap is vital for robotics and computer vision but challenging due to the sparse, unstructured nature of 3D LiDAR data and 2D-focused fiducial marker designs. This dissertation proposes a novel framework for mapping and localization using LFMs is proposed to benefit a variety of real-world applications, including the collection of 3D assets and training data for point cloud registration, 3D map merging, Augmented Reality (AR), and many more. First, an Intensity Image-based LiDAR Fiducial Marker (IFM) system is introduced, using thin, letter-sized markers compatible with VFMs. A detection method locates 3D fiducials from intensity images, enabling LiDAR pose estimation. Second, an enhanced algorithm extends detection to 3D maps, increasing marker range and facilitating tasks like 3D map merging. This method leverages both intensity and geometry, overcoming limitations of geometry-only detection approaches. Third, a new LFM-based mapping and localization method registers unordered, low-overlap point clouds. It employs adaptive threshold detection and a two-level graph framework to solve a maximum a-posteriori (MAP) problem, optimizing point cloud and marker poses. Additionally, the Livox-3DMatch dataset is introduced, improving learning-based multiview point cloud registration methods. Extensive experiments with various LiDAR models in diverse indoor and outdoor scenes demonstrate the effectiveness and superiority of the proposed framework. Comments: PhD thesis Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.03510 [cs.CV] (or arXiv:2502.03510v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.03510 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-73] Expanding Training Data for Endoscopic Phenotyping of Eosinophilic Esophagitis

【速读】：该论文旨在解决慢性食管疾病嗜酸性粒细胞性食管炎（Eosinophilic Esophagitis, EoE）诊断过程中过度依赖侵入性组织学评估的问题。为实现这一目标，研究的关键在于通过整合来自网络平台、公共数据集及电子教科书的多样化图像数据，将训练数据集从435张图像扩展到7050张图像，并采用高效的数据增强方法及图像分类模型——Data-efficient Image Transformer，同时引入注意力图可视化以提高模型的可解释性。这些改进显著提升了EoE表型分类的诊断准确性、鲁棒性和全面分析能力，从而改善患者预后。

链接: https://arxiv.org/abs/2502.04199
作者: Juming Xiong,Hou Xiong,Quan Liu,Ruining Deng,Regina N Tyree,Girish Hiremath,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Eosinophilic esophagitis (EoE) is a chronic esophageal disorder marked by eosinophil-dominated inflammation. Diagnosing EoE usually involves endoscopic inspection of the esophageal mucosa and obtaining esophageal biopsies for histologic confirmation. Recent advances have seen AI-assisted endoscopic imaging, guided by the EREFS system, emerge as a potential alternative to reduce reliance on invasive histological assessments. Despite these advancements, significant challenges persist due to the limited availability of data for training AI models - a common issue even in the development of AI for more prevalent diseases. This study seeks to improve the performance of deep learning-based EoE phenotype classification by augmenting our training data with a diverse set of images from online platforms, public datasets, and electronic textbooks increasing our dataset from 435 to 7050 images. We utilized the Data-efficient Image Transformer for image classification and incorporated attention map visualizations to boost interpretability. The findings show that our expanded dataset and model enhancements improved diagnostic accuracy, robustness, and comprehensive analysis, enhancing patient outcomes.
zh

[CV-74] DEALing with Image Reconstruction: Deep Attentive Least Squares

【速读】：该论文旨在解决复杂且高度参数化的深度架构在图像重建中的依赖问题。解决方案的关键在于提出了一种以经典Tikhonov正则化为灵感的数据驱动重建方法。该方法通过求解一系列二次问题来迭代优化中间重建结果，其中包含两个重要组件：(i) 学习得到的滤波器用于提取显著的图像特征，(ii) 一种注意力机制用于局部调整滤波响应的惩罚。这种方法实现了与领先的即插即用及学习正则化方法相当的性能，同时提供了可解释性、鲁棒性和收敛行为。

链接: https://arxiv.org/abs/2502.04079
作者: Mehrsa Pourya,Erich Kobler,Michael Unser,Sebastian Neumayer
机构: Biomedical Imaging Group, EPFL(EPFL生物医学影像小组), Lausanne, Switzerland(瑞士); Institute for Machine Learning, LIT AI lab, Institute for Virtual Morphology, Johannes Kepler University(约翰·开普勒林茨大学机器学习研究所, LIT AI实验室, 虚拟形态研究所), Linz; Faculty of Mathematics, TU Chemnitz(德累斯顿工业大学数学系), Germany(德国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:State-of-the-art image reconstruction often relies on complex, highly parameterized deep architectures. We propose an alternative: a data-driven reconstruction method inspired by the classic Tikhonov regularization. Our approach iteratively refines intermediate reconstructions by solving a sequence of quadratic problems. These updates have two key components: (i) learned filters to extract salient image features, and (ii) an attention mechanism that locally adjusts the penalty of filter responses. Our method achieves performance on par with leading plug-and-play and learned regularizer approaches while offering interpretability, robustness, and convergent behavior. In effect, we bridge traditional regularization and deep learning with a principled reconstruction approach.
zh

[CV-75] A Self-supervised Multimodal Deep Learning Approach to Differentiate Post-radiotherapy Progression from Pseudoprogression in Glioblastoma

【速读】：该论文旨在解决胶质母细胞瘤（GBM）患者放疗（RT）后伪进展（PsP）与真进展（TP）区分困难的问题。关键在于提出了一种多模态深度学习方法，通过利用常规解剖MR图像、临床参数以及RT治疗计划信息中的互补信息来提高预测准确性。具体而言，该方法使用自监督视觉变换器（Vision Transformer, ViT）编码多序列MR脑体积，以有效捕捉高维输入中的全局和局部上下文，并通过引导跨模态注意力机制将编码后的MR输入与临床数据及RT治疗计划信息整合，从而改善进展分类的准确性。

链接: https://arxiv.org/abs/2502.03999
作者: Ahmed Gomaa,Yixing Huang,Pluvio Stephan,Katharina Breininger,Benjamin Frey,Arnd Dörfler,Oliver Schnell,Daniel Delev,Roland Coras,Charlotte Schmitter,Jenny Stritzelberger,Sabine Semrau,Andreas Maier,Siming Bayer,Stephan Schönecker,Dieter H Heiland,Peter Hau,Udo S. Gaipl,Christoph Bert,Rainer Fietkau,Manuel A. Schmidt,Florian Putz
机构: University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学埃尔朗根医院); Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN) (埃尔朗根-EMN综合癌症中心); Bavarian Cancer Research Center (BZKF) (巴伐利亚癌症研究中心); Institute of Neuroradiology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学埃尔朗根医院神经放射学研究所); Department of Neurosurgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学埃尔朗根医院神经外科系); Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学模式识别实验室); Translational Neurosurgery, Alexander-Friedrich-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡亚历山大-弗里德里希大学转化神经外科); Department of Neurology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学埃尔朗根医院神经科系); Department of Neurology, University Hospital Regensburg (雷根斯堡大学医院神经科); Wilhelm Sander-NeuroOncology Unit, University Hospital Regensburg (雷根斯堡大学医院Wilhelm Sander神经肿瘤科单元); Universität Würzburg, Center for Artificial Intelligence and Data Science (维尔茨堡大学人工智能与数据科学中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate differentiation of pseudoprogression (PsP) from True Progression (TP) following radiotherapy (RT) in glioblastoma (GBM) patients is crucial for optimal treatment planning. However, this task remains challenging due to the overlapping imaging characteristics of PsP and TP. This study therefore proposes a multimodal deep-learning approach utilizing complementary information from routine anatomical MR images, clinical parameters, and RT treatment planning information for improved predictive accuracy. The approach utilizes a self-supervised Vision Transformer (ViT) to encode multi-sequence MR brain volumes to effectively capture both global and local context from the high dimensional input. The encoder is trained in a self-supervised upstream task on unlabeled glioma MRI datasets from the open BraTS2021, UPenn-GBM, and UCSF-PDGM datasets to generate compact, clinically relevant representations from FLAIR and T1 post-contrast sequences. These encoded MR inputs are then integrated with clinical data and RT treatment planning information through guided cross-modal attention, improving progression classification accuracy. This work was developed using two datasets from different centers: the Burdenko Glioblastoma Progression Dataset (n = 59) for training and validation, and the GlioCMV progression dataset from the University Hospital Erlangen (UKER) (n = 20) for testing. The proposed method achieved an AUC of 75.3%, outperforming the current state-of-the-art data-driven approaches. Importantly, the proposed approach relies on readily available anatomical MRI sequences, clinical data, and RT treatment planning information, enhancing its clinical feasibility. The proposed approach addresses the challenge of limited data availability for PsP and TP differentiation and could allow for improved clinical decision-making and optimized treatment plans for GBM patients.
zh

[CV-76] Synthetic Poisoning Attacks: The Impact of Poisoned MRI Image on U-Net Brain Tumor Segmentation

【速读】：该论文旨在研究合成磁共振成像（MRI）数据对基于U-Net模型的脑肿瘤分割鲁棒性和准确性的影响。论文的关键在于通过训练U-Net模型在不同程度“污染”的数据集上（合成数据比例从16.67%到83.33%），量化合成数据污染对模型性能的负面影响，发现随着合成数据比例增加，Dice系数显著下降，从而揭示了合成数据对分割鲁棒性的不利影响。这突显了在合成数据集成过程中进行严格质量控制的重要性。

链接: https://arxiv.org/abs/2502.03825
作者: Tianhao Li,Tianyu Zeng,Yujia Zheng,Chulong Zhang,Jingyu Lu,Haotian Huang,Chuangxin Chu,Fang-Fang Yin,Zhenyu Yang
机构: Duke University (杜克大学); Hong Kong Polytechnic University (香港理工大学); North China University of Technology (华北理工大学); State Key Laboratory of Intelligent Game, Institute of Software Chinese Academy of Sciences (中国科学院软件研究所智能游戏重点实验室); Australian National University (澳大利亚国立大学); Nanyang Technological University (南洋理工大学); Duke Kunshan University (杜克昆山大学)
类目: Image and Video Processing (eess.IV); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based medical image segmentation models, such as U-Net, rely on high-quality annotated datasets to achieve accurate predictions. However, the increasing use of generative models for synthetic data augmentation introduces potential risks, particularly in the absence of rigorous quality control. In this paper, we investigate the impact of synthetic MRI data on the robustness and segmentation accuracy of U-Net models for brain tumor segmentation. Specifically, we generate synthetic T1-contrast-enhanced (T1-Ce) MRI scans using a GAN-based model with a shared encoding-decoding framework and shortest-path regularization. To quantify the effect of synthetic data contamination, we train U-Net models on progressively “poisoned” datasets, where synthetic data proportions range from 16.67% to 83.33%. Experimental results on a real MRI validation set reveal a significant performance degradation as synthetic data increases, with Dice coefficients dropping from 0.8937 (33.33% synthetic) to 0.7474 (83.33% synthetic). Accuracy and sensitivity exhibit similar downward trends, demonstrating the detrimental effect of synthetic data on segmentation robustness. These findings underscore the importance of quality control in synthetic data integration and highlight the risks of unregulated synthetic augmentation in medical image analysis. Our study provides critical insights for the development of more reliable and trustworthy AI-driven medical imaging systems.
zh

[CV-77] UltraBones100k: An Ultrasound Image Dataset with CT-Derived Labels for Lower Extremity Long Bone Surface Segmentation

【速读】：该论文旨在解决利用超声图像进行骨表面分割时存在的低信噪比及声影效应导致的解释困难问题。为克服现有深度学习模型依赖昂贵的手动标注且难以泛化的问题，提出了一种收集离体超声数据集的方法，并自动生成包括无回声区域在内的骨标签。关键解决方案在于通过精确地将追踪的骨CT模型叠加到追踪的超声图像上来生成初始标签，并进一步优化以考虑超声物理特性。临床评估表明，所提方法显著提高了骨标签的质量。最终，使用该数据集训练的神经网络在所有指标上均优于手动标注，尤其是在低强度区域表现出色（320%的完整性提升）。

链接: https://arxiv.org/abs/2502.03783
作者: Luohong Wu,Nicola A. Cavalcanti,Matthias Seibold,Giuseppe Loggia,Lisa Reissner,Jonas Hein,Silvan Beeler,Arnd Viehöfer,Stephan Wirth,Lilian Calvet,Philipp Fürnstahl
机构: balgrist.ch(巴拉兹医院); inf.ethz.ch(苏黎世联邦理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Ultrasound-based bone surface segmentation is crucial in computer-assisted orthopedic surgery. However, ultrasound images have limitations, including a low signal-to-noise ratio, and acoustic shadowing, which make interpretation difficult. Existing deep learning models for bone segmentation rely primarily on costly manual labeling by experts, limiting dataset size and model generalizability. Additionally, the complexity of ultrasound physics and acoustic shadow makes the images difficult for humans to interpret, leading to incomplete labels in anechoic regions and limiting model performance. To advance ultrasound bone segmentation and establish effective model benchmarks, larger and higher-quality datasets are needed. We propose a methodology for collecting ex-vivo ultrasound datasets with automatically generated bone labels, including anechoic regions. The proposed labels are derived by accurately superimposing tracked bone CT models onto the tracked ultrasound images. These initial labels are refined to account for ultrasound physics. A clinical evaluation is conducted by an expert physician specialized on orthopedic sonography to assess the quality of the generated bone labels. A neural network for bone segmentation is trained on the collected dataset and its predictions are compared to expert manual labels, evaluating accuracy, completeness, and F1-score. We collected the largest known dataset of 100k ultrasound images of human lower limbs with bone labels, called UltraBones100k. A Wilcoxon signed-rank test with Bonferroni correction confirmed that the bone alignment after our method significantly improved the quality of bone labeling (p 0.001). The model trained on UltraBones100k consistently outperforms manual labeling in all metrics, particularly in low-intensity regions (320% improvement in completeness at a distance threshold of 0.5 mm). Comments: 13 pages, 4 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.03783 [eess.IV] (or arXiv:2502.03783v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2502.03783 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-78] MetaFE-DE: Learning Meta Feature Embedding for Depth Estimation from Monocular Endoscopic Images

【速读】：该论文旨在解决单目内窥镜图像深度估计中的挑战，如人体软组织的不规则形状和光照条件的变化，这些因素导致现有方法在解释性和准确性方面存在局限。论文的关键解决方案在于引入了一种称为“元特征嵌入（MetaFE）”的概念，通过共享特征表示内窥手术中的物理实体（如组织和手术器械），这些特征可以解码为RGB或深度图像。基于这一概念，论文提出了一个两阶段的自监督学习范式：第一阶段使用扩散模型进行时间表示学习，并通过交叉归一化与空间信息对齐以构建MetaFE；第二阶段应用带有亮度校准的自监督单目深度估计来将元特征解码为深度图像。实验证明，该方法在多个内窥镜数据集上的表现优于现有最先进方法，实现了更高的准确度和泛化能力。

链接: https://arxiv.org/abs/2502.03493
作者: Dawei Lu,Deqiang Xiao,Danni Ai,Jingfan Fan,Tianyu Fu,Yucong Lin,Hong Song,Xujiong Ye,Lei Zhang,Jian Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Depth estimation from monocular endoscopic images presents significant challenges due to the complexity of endoscopic surgery, such as irregular shapes of human soft tissues, as well as variations in lighting conditions. Existing methods primarily estimate the depth information from RGB images directly, and often surffer the limited interpretability and accuracy. Given that RGB and depth images are two views of the same endoscopic surgery scene, in this paper, we introduce a novel concept referred as ``meta feature embedding (MetaFE)", in which the physical entities (e.g., tissues and surgical instruments) of endoscopic surgery are represented using the shared features that can be alternatively decoded into RGB or depth image. With this concept, we propose a two-stage self-supervised learning paradigm for the monocular endoscopic depth estimation. In the first stage, we propose a temporal representation learner using diffusion models, which are aligned with the spatial information through the cross normalization to construct the MetaFE. In the second stage, self-supervised monocular depth estimation with the brightness calibration is applied to decode the meta features into the depth image. Extensive evaluation on diverse endoscopic datasets demonstrates that our approach outperforms the state-of-the-art method in depth estimation, achieving superior accuracy and generalization. The source code will be publicly available.
zh

[CV-79] Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis

【速读】：该论文旨在探究AI辅助决策在放射科医生前列腺癌诊断中的作用，并解决如何通过反馈机制优化人机协作决策效果。研究的关键在于设计了两种实验流程：第一项研究中，临床医生首先独立做出初步诊断，随后参考AI预测并最终确定诊断结果；第二项研究则在提供参与者前序实验的性能统计数据后，直接展示AI预测以供诊断。研究发现，尽管人机团队的整体表现优于单独的人类医生，但依旧未能充分发挥AI的潜力，主要是由于人类医生对AI的依赖不足。提供性能反馈并未显著提升人机团队的表现，但提前展示AI的决策建议能够促使医生更多地遵循AI的指导。此外，论文还观察到人机团队的集合决策可以超越单靠AI的决策，这为未来的合作提供了有前景的方向。

链接: https://arxiv.org/abs/2502.03482
作者: Chacha Chen,Han Liu,Jiamin Yang,Benjamin M. Mervak,Bora Kalaycioglu,Grace Lee,Emre Cakmakli,Matteo Bonatti,Sridhar Pudu,Osman Kahraman,Gul Gizem Pamuk,Aytekin Oto,Aritrick Chatterjee,Chenhao Tan
机构: University of Chicago; Toyota Technological Institute at Chicago; University of Michigan; Bagcilar Training and Research Hospital; Hospital of Bolzano (SABES-ASDAA); Radiology Associates of North Texas; İstanbul Medipol University Hospital; University of Chicago; University of Chicago; University of Chicago; University of Chicago
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the growing interest in human-AI decision making, experimental studies with domain experts remain rare, largely due to the complexity of working with domain experts and the challenges in setting up realistic experiments. In this work, we conduct an in-depth collaboration with radiologists in prostate cancer diagnosis based on MRI images. Building on existing tools for teaching prostate cancer diagnosis, we develop an interface and conduct two experiments to study how AI assistance and performance feedback shape the decision making of domain experts. In Study 1, clinicians were asked to provide an initial diagnosis (human), then view the AI’s prediction, and subsequently finalize their decision (human-AI team). In Study 2 (after a memory wash-out period), the same participants first received aggregated performance statistics from Study 1, specifically their own performance, the AI’s performance, and their human-AI team performance, and then directly viewed the AI’s prediction before making their diagnosis (i.e., no independent initial diagnosis). These two workflows represent realistic ways that clinical AI tools might be used in practice, where the second study simulates a scenario where doctors can adjust their reliance and trust on AI based on prior performance feedback. Our findings show that, while human-AI teams consistently outperform humans alone, they still underperform the AI due to under-reliance, similar to prior studies with crowdworkers. Providing clinicians with performance feedback did not significantly improve the performance of human-AI teams, although showing AI decisions in advance nudges people to follow AI more. Meanwhile, we observe that the ensemble of human-AI teams can outperform AI alone, suggesting promising directions for human-AI collaboration.
zh

人工智能

[AI-0] HOG-Diff: Higher-Order Guided Diffusion for Graph Generation

链接: https://arxiv.org/abs/2502.04308
作者: Yiming Huang,Tolga Birdal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Graph generation is a critical yet challenging task as empirical analyses require a deep understanding of complex, non-Euclidean structures. Although diffusion models have recently made significant achievements in graph generation, these models typically adapt from the frameworks designed for image generation, making them ill-suited for capturing the topological properties of graphs. In this work, we propose a novel Higher-order Guided Diffusion (HOG-Diff) model that follows a coarse-to-fine generation curriculum and is guided by higher-order information, enabling the progressive generation of plausible graphs with inherent topological structures. We further prove that our model exhibits a stronger theoretical guarantee than classical diffusion frameworks. Extensive experiments on both molecular and generic graph generation tasks demonstrate that our method consistently outperforms or remains competitive with state-of-the-art baselines. Our code is available at this https URL.

[AI-1] DexterityGen: Foundation Controller for Unprecedented Dexterity

链接: https://arxiv.org/abs/2502.04307
作者: Zhao-Heng Yin,Changhao Wang,Luis Pineda,Francois Hogan,Krishna Bodduluri,Akash Sharma,Patrick Lancaster,Ishita Prasad,Mrinal Kalakrishnan,Jitendra Malik,Mike Lambeta,Tingfan Wu,Pieter Abbeel,Mustafa Mukadam
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Project: this https URL

点击查看摘要

Abstract:Teaching robots dexterous manipulation skills, such as tool use, presents a significant challenge. Current approaches can be broadly categorized into two strategies: human teleoperation (for imitation learning) and sim-to-real reinforcement learning. The first approach is difficult as it is hard for humans to produce safe and dexterous motions on a different embodiment without touch feedback. The second RL-based approach struggles with the domain gap and involves highly task-specific reward engineering on complex tasks. Our key insight is that RL is effective at learning low-level motion primitives, while humans excel at providing coarse motion commands for complex, long-horizon tasks. Therefore, the optimal solution might be a combination of both approaches. In this paper, we introduce DexterityGen (DexGen), which uses RL to pretrain large-scale dexterous motion primitives, such as in-hand rotation or translation. We then leverage this learned dataset to train a dexterous foundational controller. In the real world, we use human teleoperation as a prompt to the controller to produce highly dexterous behavior. We evaluate the effectiveness of DexGen in both simulation and real world, demonstrating that it is a general-purpose controller that can realize input dexterous manipulation commands and significantly improves stability by 10-100x measured as duration of holding objects across diverse tasks. Notably, with DexGen we demonstrate unprecedented dexterous skills including diverse object reorientation and dexterous tool use such as pen, syringe, and screwdriver for the first time.

[AI-2] Strong Equivalence in Answer Set Programming with Constraints

链接: https://arxiv.org/abs/2502.04302
作者: Pedro Cabalar,Jorge Fandinno,Torsten Schaub,Philipp Wanko
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 30 pages

点击查看摘要

Abstract:We investigate the concept of strong equivalence within the extended framework of Answer Set Programming with constraints. Two groups of rules are considered strongly equivalent if, informally speaking, they have the same meaning in any context. We demonstrate that, under certain assumptions, strong equivalence between rule sets in this extended setting can be precisely characterized by their equivalence in the logic of Here-and-There with constraints. Furthermore, we present a translation from the language of several clingo-based answer set solvers that handle constraints into the language of Here-and-There with constraints. This translation enables us to leverage the logic of Here-and-There to reason about strong equivalence within the context of these solvers. We also explore the computational complexity of determining strong equivalence in this context.

[AI-3] Every Call is Precious: Global Optimization of Black-Box Functions with Unknown Lipschitz Constants AISTATS2025

链接: https://arxiv.org/abs/2502.04290
作者: Fares Fourati,Salma Kharrat,Vaneet Aggarwal,Mohamed-Slim Alouini
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Accepted at AISTATS 2025

点击查看摘要

Abstract:Optimizing expensive, non-convex, black-box Lipschitz continuous functions presents significant challenges, particularly when the Lipschitz constant of the underlying function is unknown. Such problems often demand numerous function evaluations to approximate the global optimum, which can be prohibitive in terms of time, energy, or resources. In this work, we introduce Every Call is Precious (ECP), a novel global optimization algorithm that minimizes unpromising evaluations by strategically focusing on potentially optimal regions. Unlike previous approaches, ECP eliminates the need to estimate the Lipschitz constant, thereby avoiding additional function evaluations. ECP guarantees no-regret performance for infinite evaluation budgets and achieves minimax-optimal regret bounds within finite budgets. Extensive ablation studies validate the algorithm’s robustness, while empirical evaluations show that ECP outperforms 10 benchmark algorithms including Lipschitz, Bayesian, bandits, and evolutionary methods across 30 multi-dimensional non-convex synthetic and real-world optimization problems, which positions ECP as a competitive approach for global optimization.

[AI-4] Free Energy Risk Metrics for Systemically Safe AI: Gatekeeping Multi-Agent Study

链接: https://arxiv.org/abs/2502.04249
作者: Michael Walters,Rafael Kaufmann,Justice Sefas,Thomas Kopinski
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 9 pages, 1 figure

点击查看摘要

Abstract:We investigate the Free Energy Principle as a foundation for measuring risk in agentic and multi-agent systems. From these principles we introduce a Cumulative Risk Exposure metric that is flexible to differing contexts and needs. We contrast this to other popular theories for safe AI that hinge on massive amounts of data or describing arbitrarily complex world models. In our framework, stakeholders need only specify their preferences over system outcomes, providing straightforward and transparent decision rules for risk governance and mitigation. This framework naturally accounts for uncertainty in both world model and preference model, allowing for decision-making that is epistemically and axiologically humble, parsimonious, and future-proof. We demonstrate this novel approach in a simplified autonomous vehicle environment with multi-agent vehicles whose driving policies are mediated by gatekeepers that evaluate, in an online fashion, the risk to the collective safety in their neighborhood, and intervene through each vehicle’s policy when appropriate. We show that the introduction of gatekeepers in an AV fleet, even at low penetration, can generate significant positive externalities in terms of increased system safety.

[AI-5] A Theoretical Framework for Data Efficient Multi-Source Transfer Learning Based on Cramer-Rao Bound

链接: https://arxiv.org/abs/2502.04242
作者: Qingyue Zhang,Haohao Fu,Guanbo Huang,Yaoyuan Liang,Chang Chu,Tianren Peng,Yanru Wu,Qi Li,Yang Li,Shao-Lun Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-source transfer learning provides an effective solution to data scarcity in real-world supervised learning scenarios by leveraging multiple source tasks. In this field, existing works typically use all available samples from sources in training, which constrains their training efficiency and may lead to suboptimal results. To address this, we propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model? Specifically, we introduce a generalization error measure that aligns with cross-entropy loss, and minimize it based on the Cramér-Rao Bound to determine the optimal transfer quantity for each source task. Additionally, we develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for training deep multi-source transfer learning models. Experimental studies on diverse architectures and two real-world benchmark datasets show that our proposed algorithm significantly outperforms state-of-the-art approaches in both accuracy and data efficiency. The code and supplementary materials are available in this https URL.

[AI-6] XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

链接: https://arxiv.org/abs/2502.04230
作者: Yixin Liu,Lie Lu,Jihui Jin,Lichao Sun,Andrea Fanelli
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces Cross-Attention Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility. Our approach achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing with strong editing strength. The project webpage is available at this https URL.

[AI-7] Dark Distillation: Backdooring Distilled Datasets without Accessing Raw Data

链接: https://arxiv.org/abs/2502.04229
作者: Ziyuan Yang,Ming Yan,Yi Zhang,Joey Tianyi Zhou
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dataset distillation (DD) enhances training efficiency and reduces bandwidth by condensing large datasets into smaller synthetic ones. It enables models to achieve performance comparable to those trained on the raw full dataset and has become a widely adopted method for data sharing. However, security concerns in DD remain underexplored. Existing studies typically assume that malicious behavior originates from dataset owners during the initial distillation process, where backdoors are injected into raw datasets. In contrast, this work is the first to address a more realistic and concerning threat: attackers may intercept the dataset distribution process, inject backdoors into the distilled datasets, and redistribute them to users. While distilled datasets were previously considered resistant to backdoor attacks, we demonstrate that they remain vulnerable to such attacks. Furthermore, we show that attackers do not even require access to any raw data to inject the backdoors successfully. Specifically, our approach reconstructs conceptual archetypes for each class from the model trained on the distilled dataset. Backdoors are then injected into these archetypes to update the distilled dataset. Moreover, we ensure the updated dataset not only retains the backdoor but also preserves the original optimization trajectory, thus maintaining the knowledge of the raw dataset. To achieve this, a hybrid loss is designed to integrate backdoor information along the benign optimization trajectory, ensuring that previously learned information is not forgotten. Extensive experiments demonstrate that distilled datasets are highly vulnerable to backdoor attacks, with risks pervasive across various raw datasets, distillation methods, and downstream training strategies. Moreover, our attack method is efficient, capable of synthesizing a malicious distilled dataset in under one minute in certain cases.

[AI-8] NLP-Based .NET CLR Event Logs Analyzer

链接: https://arxiv.org/abs/2502.04219
作者: Maxim Stavtsev,Sergey Shershakov
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we present a tool for analyzing .NET CLR event logs based on a novel method inspired by Natural Language Processing (NLP) approach. Our research addresses the growing need for effective monitoring and optimization of software systems through detailed event log analysis. We utilize a BERT-based architecture with an enhanced tokenization process customized to event logs. The tool, developed using Python, its libraries, and an SQLite database, allows both conducting experiments for academic purposes and efficiently solving industry-emerging tasks. Our experiments demonstrate the efficacy of our approach in compressing event sequences, detecting recurring patterns, and identifying anomalies. The trained model shows promising results, with a high accuracy rate in anomaly detection, which demonstrates the potential of NLP methods to improve the reliability and stability of software systems.

[AI-9] Algorithmic causal structure emerging through compression

链接: https://arxiv.org/abs/2502.04210
作者: Liang Wendong,Simon Buchholz,Bernhard Schölkopf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:We explore the relationship between causality, symmetry, and compression. We build on and generalize the known connection between learning and compression to a setting where causal models are not identifiable. We propose a framework where causality emerges as a consequence of compressing data across multiple environments. We define algorithmic causality as an alternative definition of causality when traditional assumptions for causal identifiability do not hold. We demonstrate how algorithmic causal and symmetric structures can emerge from minimizing upper bounds on Kolmogorov complexity, without knowledge of intervention targets. We hypothesize that these insights may also provide a novel perspective on the emergence of causality in machine learning models, such as large language models, where causal relationships may not be explicitly identifiable.

[AI-10] Archetypal Analysis for Binary Data ICASSP2025

链接: https://arxiv.org/abs/2502.04172
作者: A. Emilie J. Wedenborg,Morten Mørup
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 5 pages, Accepted at ICASSP 2025

点击查看摘要

Abstract:Archetypal analysis (AA) is a matrix decomposition method that identifies distinct patterns using convex combinations of the data points denoted archetypes with each data point in turn reconstructed as convex combinations of the archetypes. AA thereby forms a polytope representing trade-offs of the distinct aspects in the data. Most existing methods for AA are designed for continuous data and do not exploit the structure of the data distribution. In this paper, we propose two new optimization frameworks for archetypal analysis for binary data. i) A second order approximation of the AA likelihood based on the Bernoulli distribution with efficient closed-form updates using an active set procedure for learning the convex combinations defining the archetypes, and a sequential minimal optimization strategy for learning the observation specific reconstructions. ii) A Bernoulli likelihood based version of the principal convex hull analysis (PCHA) algorithm originally developed for least squares optimization. We compare these approaches with the only existing binary AA procedure relying on multiplicative updates and demonstrate their superiority on both synthetic and real binary data. Notably, the proposed optimization frameworks for AA can easily be extended to other data distributions providing generic efficient optimization frameworks for AA based on tailored likelihood functions reflecting the underlying data distribution.

[AI-11] Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs

链接: https://arxiv.org/abs/2502.04140
作者: Jost Arndt,Utku Isil,Michael Detzel,Wojciech Samek,Jackie Ma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Currently under review

点击查看摘要

Abstract:Many physical processes can be expressed through partial differential equations (PDEs). Real-world measurements of such processes are often collected at irregularly distributed points in space, which can be effectively represented as graphs; however, there are currently only a few existing datasets. Our work aims to make advancements in the field of PDE-modeling accessible to the temporal graph machine learning community, while addressing the data scarcity problem, by creating and utilizing datasets based on PDEs. In this work, we create and use synthetic datasets based on PDEs to support spatio-temporal graph modeling in machine learning for different applications. More precisely, we showcase three equations to model different types of disasters and hazards in the fields of epidemiology, atmospheric particles, and tsunami waves. Further, we show how such created datasets can be used by benchmarking several machine learning models on the epidemiological dataset. Additionally, we show how pre-training on this dataset can improve model performance on real-world epidemiological data. The presented methods enable others to create datasets and benchmarks customized to individual requirements. The source code for our methodology and the three created datasets can be found on this https URL.

[AI-12] Ancient Greek Technology: An Immersive Learning Use Case Described Using a Co-Intelligent Custom ChatGPT Assistant

链接: https://arxiv.org/abs/2502.04110
作者: Vlasis Kasapakis,Leonel Morgado
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 5 pages, presented at the 2024 IEEE 3rd International Conference on Intelligent Reality (ICIR 2024), 6th of December, 2024

点击查看摘要

Abstract:Achieving consistency in immersive learning case descriptions is essential but challenging due to variations in research focus, methodology, and researchers’ background. We address these challenges by leveraging the Immersive Learning Case Sheet (ILCS), a methodological instrument to standardize case descriptions, that we applied to an immersive learning case on ancient Greek technology in VRChat. Research team members had differing levels of familiarity with the ILCS and the case content, so we developed a custom ChatGPT assistant to facilitate consistent terminology and process alignment across the team. This paper constitutes an example of how structured case reports can be a novel contribution to immersive learning literature. Our findings demonstrate how the ILCS supports structured reflection and interpretation of the case. Further we report that the use of a ChatGPT assistant significantly sup-ports the coherence and quality of the team members development of the final ILCS. This exposes the potential of employing AI-driven tools to enhance collaboration and standardization of research practices in qualitative educational research. However, we also discuss the limitations and challenges, including reliance on AI for interpretive tasks and managing varied levels of expertise within the team. This study thus provides insights into the practical application of AI in standardizing immersive learning research processes.

[AI-13] VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output

链接: https://arxiv.org/abs/2502.04103
作者: Eason Chen,Chengyu Lin,Xinyi Tang,Aprille Xi,Canwen Wang,Jionghao Lin,Kenneth R Koedinger
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) has transformed human-computer interaction (HCI), but the interaction with LLMs is currently mainly focused on text-based interactions, while other multi-model approaches remain under-explored. This paper introduces VTutor, an open-source Software Development Kit (SDK) that combines generative AI with advanced animation technologies to create engaging, adaptable, and realistic APAs for human-AI multi-media interactions. VTutor leverages LLMs for real-time personalized feedback, advanced lip synchronization for natural speech alignment, and WebGL rendering for seamless web integration. Supporting various 2D and 3D character models, VTutor enables researchers and developers to design emotionally resonant, contextually adaptive learning agents. This toolkit enhances learner engagement, feedback receptivity, and human-AI interaction while promoting trustworthy AI principles in education. VTutor sets a new standard for next-generation APAs, offering an accessible, scalable solution for fostering meaningful and immersive human-AI interaction experiences. The VTutor project is open-sourced and welcomes community-driven contributions and showcases.

[AI-14] Strategic Learning with Local Explanations as Feedback

链接: https://arxiv.org/abs/2502.04058
作者: Kiet Q. H. Vo,Siu Lun Chau,Masahiro Kato,Yixin Wang,Krikamol Muandet
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We investigate algorithmic decision problems where agents can respond strategically to the decision maker’s (DM) models. The demand for clear and actionable explanations from DMs to (potentially strategic) agents continues to rise. While prior work often treats explanations as full model disclosures, explanations in practice might convey only partial information, which can lead to misinterpretations and harmful responses. When full disclosure of the predictive model is neither feasible nor desirable, a key open question is how DMs can use explanations to maximise their utility without compromising agent welfare. In this work, we explore well-known local and global explanation methods, and establish a necessary condition to prevent explanations from misleading agents into self-harming actions. Moreover, with conditional homogeneity, we establish that action recommendation (AR)-based explanations are sufficient for non-harmful responses, akin to the revelation principle in information design. To operationalise AR-based explanations, we propose a simple algorithm to jointly optimise the predictive model and AR policy to balance DM outcomes with agent welfare. Our empirical results demonstrate the benefits of this approach as a more refined strategy for safe and effective partial model disclosure in algorithmic decision-making.

[AI-15] Probe-Free Low-Rank Activation Intervention NAACL2025

链接: https://arxiv.org/abs/2502.04043
作者: Chonghe Jiang,Bao Nguyen,Anthony Man-Cho So,Viet Anh Nguyen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by NAACL 2025

点击查看摘要

Abstract:Language models (LMs) can produce texts that appear accurate and coherent but contain untruthful or toxic content. Inference-time interventions that edit the hidden activations have shown promising results in steering the LMs towards desirable generations. Existing activation intervention methods often comprise an activation probe to detect undesirable generation, triggering the activation modification to steer subsequent generation. This paper proposes a probe-free intervention method FLORAIN for all attention heads in a specific activation layer. It eliminates the need to train classifiers for probing purposes. The intervention function is parametrized by a sample-wise nonlinear low-rank mapping, which is trained by minimizing the distance between the modified activations and their projection onto the manifold of desirable content. Under specific constructions of the manifold and projection distance, we show that the intervention strategy can be computed efficiently by solving a smooth optimization problem. The empirical results, benchmarked on multiple base models, demonstrate that FLORAIN consistently outperforms several baseline methods in enhancing model truthfulness and quality across generation and multiple-choice tasks.

[AI-16] Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization

链接: https://arxiv.org/abs/2502.04034
作者: Ran Song,Yinpu Bai,Hui Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The accurate prediction of drug responses remains a formidable challenge, particularly at the single-cell level and in clinical treatment contexts. Some studies employ transfer learning techniques to predict drug responses in individual cells and patients, but they require access to target-domain data during training, which is often unavailable or only obtainable in future. In this study, we propose a novel domain generalization framework, termed panCancerDR, to address this challenge. We conceptualize each cancer type as a distinct source domain, with its cell lines serving as domain-specific samples. Our primary objective is to extract domain-invariant features from the expression profiles of cell lines across diverse cancer types, thereby generalize the predictive capacity to out-of-distribution samples. To enhance robustness, we introduce a latent independence projection (LIP) module that encourages the encoder to extract informative yet non-redundant features. Also, we propose an asymmetric adaptive clustering constraint, which clusters drug-sensitive samples into a compact group while drives resistant samples dispersed across separate clusters in the latent space. Our empirical experiments demonstrate that panCancerDR effectively learns task-relevant features from diverse source domains, and achieves accurate predictions of drug response for unseen cancer type during training. Furthermore, when evaluated on single-cell and patient-level prediction tasks, our model-trained solely on in vitro cell line data without access to target-domain information-consistently outperforms and matched current state-of-the-art methods. These findings highlights the potential of our method for real-world clinical applications.

[AI-17] Fine Ill Merge It Myself: A Multi-Fidelity Framework for Automated Model Merging

链接: https://arxiv.org/abs/2502.04030
作者: Guinan Su,Jonas Geiping
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reasoning capabilities represent a critical frontier for large language models (LLMs), but developing them requires extensive proprietary datasets and computational resources. One way to efficiently supplement capabilities with is by model merging, which offers a promising alternative by combining multiple models without retraining. However, current merging approaches rely on manually-designed strategies for merging hyperparameters, limiting the exploration of potential model combinations and requiring significant human effort. We propose an Automated Model Merging Framework that enables fine-grained exploration of merging strategies while reducing costs through multi-fidelity approximations. We support both single and multi-objective optimization and introduce two novel search spaces: layerwise fusion (LFS) and depth-wise integration (DIS). Evaluating across a number of benchmarks, we find that the search autonomously finds 1) Merges that further boost single-objective performance, even on tasks the model has already been finetuned on, and 2) Merges that optimize multi-objective frontiers across tasks. Effective merges are found with limited compute, e.g. within less than 500 search steps.

[AI-18] Automating a Complete Software Test Process Using LLM s: An Automotive Case Study ICSE

链接: https://arxiv.org/abs/2502.04008
作者: Shuai Wang,Yinan Yu,Robert Feldt,Dhasarathy Parthasarathy
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted by International Conference on Software Engineering (ICSE) 2025

点击查看摘要

Abstract:Vehicle API testing verifies whether the interactions between a vehicle’s internal systems and external applications meet expectations, ensuring that users can access and control various vehicle functions and data. However, this task is inherently complex, requiring the alignment and coordination of API systems, communication protocols, and even vehicle simulation systems to develop valid test cases. In practical industrial scenarios, inconsistencies, ambiguities, and interdependencies across various documents and system specifications pose significant challenges. This paper presents a system designed for the automated testing of in-vehicle APIs. By clearly defining and segmenting the testing process, we enable Large Language Models (LLMs) to focus on specific tasks, ensuring a stable and controlled testing workflow. Experiments conducted on over 100 APIs demonstrate that our system effectively automates vehicle API testing. The results also confirm that LLMs can efficiently handle mundane tasks requiring human judgment, making them suitable for complete automation in similar industrial contexts.

[AI-19] Online Learning of Counter Categories and Ratings in PvP Games

链接: https://arxiv.org/abs/2502.03998
作者: Chiu-Chou Lin,I-Chen Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In competitive games, strength ratings like Elo are widely used to quantify player skill and support matchmaking by accounting for skill disparities better than simple win rate statistics. However, scalar ratings cannot handle complex intransitive relationships, such as counter strategies seen in Rock-Paper-Scissors. To address this, recent work introduced Neural Rating Table and Neural Counter Table, which combine scalar ratings with discrete counter categories to model intransitivity. While effective, these methods rely on neural network training and cannot perform real-time updates. In this paper, we propose an online update algorithm that extends Elo principles to incorporate real-time learning of counter categories. Our method dynamically adjusts both ratings and counter relationships after each match, preserving the explainability of scalar ratings while addressing intransitivity. Experiments on zero-sum competitive games demonstrate its practicality, particularly in scenarios without complex team compositions.

[AI-20] owards Unified Music Emotion Recognition across Dimensional and Categorical Models

链接: https://arxiv.org/abs/2502.03979
作者: Jaeyong Kang,Dorien Herremans
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:One of the most significant challenges in Music Emotion Recognition (MER) comes from the fact that emotion labels can be heterogeneous across datasets with regard to the emotion representation, including categorical (e.g., happy, sad) versus dimensional labels (e.g., valence-arousal). In this paper, we present a unified multitask learning framework that combines these two types of labels and is thus able to be trained on multiple datasets. This framework uses an effective input representation that combines musical features (i.e., key and chords) and MERT embeddings. Moreover, knowledge distillation is employed to transfer the knowledge of teacher models trained on individual datasets to a student model, enhancing its ability to generalize across multiple tasks. To validate our proposed framework, we conducted extensive experiments on a variety of datasets, including MTG-Jamendo, DEAM, PMEmo, and EmoMusic. According to our experimental results, the inclusion of musical features, multitask learning, and knowledge distillation significantly enhances performance. In particular, our model outperforms the state-of-the-art models, including the best-performing model from the MediaEval 2021 competition on the MTG-Jamendo dataset. Our work makes a significant contribution to MER by allowing the combination of categorical and dimensional emotion labels in one unified framework, thus enabling training across datasets.

[AI-21] Adaptation of Task Goal States from Prior Knowledge

链接: https://arxiv.org/abs/2502.03918
作者: Andrei Costinescu,Darius Burschka
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a framework to define a task with freedom and variability in its goal state. A robot could use this to observe the execution of a task and target a different goal from the observed one; a goal that is still compatible with the task description but would be easier for the robot to execute. We define the model of an environment state and an environment variation, and present experiments on how to interactively create the variation from a single task demonstration and how to use this variation to create an execution plan for bringing any environment into the goal state.

[AI-22] Rank Also Matters: Hierarchical Configuration for Mixture of Adapter Experts in LLM Fine-Tuning

链接: https://arxiv.org/abs/2502.03884
作者: Peizhuang Cong,Wenpu Liu,Wenhan Yu,Haochen Zhao,Tong Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable success across various tasks, accompanied by a continuous increase in their parameter size. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address the challenges of fine-tuning LLMs by significantly reducing the number of trainable parameters. Recent studies have integrated LoRA with Mixture of Experts (MoE) architectures, leveraging multiple adapter experts and gating mechanisms to further improve fine-tuning performance. However, existing approaches primarily focus on adjusting the allocations of adapter experts per layer to optimize the introduced trainable parameter size, while neglecting a critical factor of adapters’ rank. To this end, we propose a hierarchical scheme for expert allocation and rank configuration, HILO, which dynamically adjusts the number and rank of adapter experts across layers, matching the varying representational complexity of model layers in adapter-granularity. Extensive experiments on multiple benchmark tasks demonstrate that HILO outperforms existing methods in accuracy while introducing fewer trainable parameters, providing an efficient and practical solution for fine-tuning LLMs.

[AI-23] Large Language Models for Multi-Robot Systems: A Survey

链接: https://arxiv.org/abs/2502.03814
作者: Peihan Li,Zijian An,Shams Abrar,Lifeng Zhou
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has opened new possibilities in Multi-Robot Systems (MRS), enabling enhanced communication, task planning, and human-robot interaction. Unlike traditional single-robot and multi-agent systems, MRS poses unique challenges, including coordination, scalability, and real-world adaptability. This survey provides the first comprehensive exploration of LLM integration into MRS. It systematically categorizes their applications across high-level task allocation, mid-level motion planning, low-level action generation, and human intervention. We highlight key applications in diverse domains, such as household robotics, construction, formation control, target tracking, and robot games, showcasing the versatility and transformative potential of LLMs in MRS. Furthermore, we examine the challenges that limit adapting LLMs in MRS, including mathematical reasoning limitations, hallucination, latency issues, and the need for robust benchmarking systems. Finally, we outline opportunities for future research, emphasizing advancements in fine-tuning, reasoning techniques, and task-specific models. This survey aims to guide researchers in the intelligence and real-world deployment of MRS powered by LLMs. Based on the fast-evolving nature of research in the field, we keep updating the papers in the open-source Github repository.

[AI-24] Understanding and Supporting Formal Email Exchange by Answering AI-Generated Questions

链接: https://arxiv.org/abs/2502.03804
作者: Yusuke Miura,Chi-Lan Yang,Masaki Kuribayashi,Keigo Matsumoto,Hideaki Kuzuoka,Shigeo Morishima
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Replying to formal emails is time-consuming and cognitively demanding, as it requires polite phrasing and ensuring an adequate response to the sender’s demands. Although systems with Large Language Models (LLM) were designed to simplify the email replying process, users still needed to provide detailed prompts to obtain the expected output. Therefore, we proposed and evaluated an LLM-powered question-and-answer (QA)-based approach for users to reply to emails by answering a set of simple and short questions generated from the incoming email. We developed a prototype system, ResQ, and conducted controlled and field experiments with 12 and 8 participants. Our results demonstrated that QA-based approach improves the efficiency of replying to emails and reduces workload while maintaining email quality compared to a conventional prompt-based approach that requires users to craft appropriate prompts to obtain email drafts. We discuss how QA-based approach influences the email reply process and interpersonal relationship dynamics, as well as the opportunities and challenges associated with using a QA-based approach in AI-mediated communication.

[AI-25] SoK: Benchmarking Poisoning Attacks and Defenses in Federated Learning

链接: https://arxiv.org/abs/2502.03801
作者: Heyi Zhang,Yule Liu,Xinlei He,Jun Wu,Tianshuo Cong,Xinyi Huang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training while preserving data privacy, but its decentralized nature exposes it to client-side data poisoning attacks (DPAs) and model poisoning attacks (MPAs) that degrade global model performance. While numerous proposed defenses claim substantial effectiveness, their evaluation is typically done in isolation with limited attack strategies, raising concerns about their validity. Additionally, existing studies overlook the mutual effectiveness of defenses against both DPAs and MPAs, causing fragmentation in this field. This paper aims to provide a unified benchmark and analysis of defenses against DPAs and MPAs, clarifying the distinction between these two similar but slightly distinct domains. We present a systematic taxonomy of poisoning attacks and defense strategies, outlining their design, strengths, and limitations. Then, a unified comparative evaluation across FL algorithms and data heterogeneity is conducted to validate their individual and mutual effectiveness and derive key insights for design principles and future research. Along with the analysis, we frame our work to a unified benchmark, FLPoison, with high modularity and scalability to evaluate 15 representative poisoning attacks and 17 defense strategies, facilitating future research in this domain. Code is available at this https URL.

[AI-26] ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

链接: https://arxiv.org/abs/2502.03773
作者: Chhavi Yadav,Evan Monroe Laufer,Dan Boneh,Kamalika Chaudhuri
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In principle, explanations are intended as a way to increase trust in machine learning models and are often obligated by regulations. However, many circumstances where these are demanded are adversarial in nature, meaning the involved parties have misaligned interests and are incentivized to manipulate explanations for their purpose. As a result, explainability methods fail to be operational in such settings despite the demand \citebordt2022post. In this paper, we take a step towards operationalizing explanations in adversarial scenarios with Zero-Knowledge Proofs (ZKPs), a cryptographic primitive. Specifically we explore ZKP-amenable versions of the popular explainability algorithm LIME and evaluate their performance on Neural Networks and Random Forests.

[AI-27] PRISM: A Robust Framework for Skill-based Meta-Reinforcement Learning with Noisy Demonstrations ICML2025

链接: https://arxiv.org/abs/2502.03752
作者: Sanghyeon Lee,Sangjun Bae,Yisak Park,Seungyul Han
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages main, 19 pages appendix with reference. Submitted to ICML 2025

点击查看摘要

Abstract:Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, resulting in unstable skill learning and degraded performance. To overcome this, we propose Prioritized Refinement for Skill-Based Meta-RL (PRISM), a robust framework that integrates exploration near noisy data to generate online trajectories and combines them with offline data. Through prioritization, PRISM extracts high-quality data to learn task-relevant skills effectively. By addressing the impact of noise, our method ensures stable skill learning and achieves superior performance in long-horizon tasks, even with noisy and sub-optimal data.

[AI-28] Principal Curvatures Estimation with Applications to Single Cell Data ICASSP ICASSP2025

链接: https://arxiv.org/abs/2502.03750
作者: Yanlei Zhang,Lydia Mezrag,Xingzhi Sun,Charles Xu,Kincaid Macdonald,Dhananjay Bhaskar,Smita Krishnaswamy,Guy Wolf,Bastian Rieck
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To be published in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

点击查看摘要

Abstract:The rapidly growing field of single-cell transcriptomic sequencing (scRNAseq) presents challenges for data analysis due to its massive datasets. A common method in manifold learning consists in hypothesizing that datasets lie on a lower dimensional manifold. This allows to study the geometry of point clouds by extracting meaningful descriptors like curvature. In this work, we will present Adaptive Local PCA (AdaL-PCA), a data-driven method for accurately estimating various notions of intrinsic curvature on data manifolds, in particular principal curvatures for surfaces. The model relies on local PCA to estimate the tangent spaces. The evaluation of AdaL-PCA on sampled surfaces shows state-of-the-art results. Combined with a PHATE embedding, the model applied to single-cell RNA sequencing data allows us to identify key variations in the cellular differentiation.

[AI-29] Multiple Invertible and Partial-Equivariant Function for Latent Vector Transformation to Enhance Disentanglement in VAEs

链接: https://arxiv.org/abs/2502.03740
作者: Hee-Jun Jung,Jaehyoung Jeong,Kangil Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages, 21 figures

点击查看摘要

Abstract:Disentanglement learning is a core issue for understanding and re-using trained information in Variational AutoEncoder (VAE), and effective inductive bias has been reported as a key factor. However, the actual implementation of such bias is still vague. In this paper, we propose a novel method, called Multiple Invertible and partial-equivariant transformation (MIPE-transformation), to inject inductive bias by 1) guaranteeing the invertibility of latent-to-latent vector transformation while preserving a certain portion of equivariance of input-to-latent vector transformation, called Invertible and partial-equivariant transformation (IPE-transformation), 2) extending the form of prior and posterior in VAE frameworks to an unrestricted form through a learnable conversion to an approximated exponential family, called Exponential Family conversion (EF-conversion), and 3) integrating multiple units of IPE-transformation and EF-conversion, and their training. In experiments on 3D Cars, 3D Shapes, and dSprites datasets, MIPE-transformation improves the disentanglement performance of state-of-the-art VAEs.

[AI-30] Action-Free Reasoning for Policy Generalization

链接: https://arxiv.org/abs/2502.03729
作者: Jaden Clark,Suvir Mirchandani,Dorsa Sadigh,Suneel Belkhale
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:End-to-end imitation learning offers a promising approach for training robot policies. However, generalizing to new settings remains a significant challenge. Although large-scale robot demonstration datasets have shown potential for inducing generalization, they are resource-intensive to scale. In contrast, human video data is abundant and diverse, presenting an attractive alternative. Yet, these human-video datasets lack action labels, complicating their use in imitation learning. Existing methods attempt to extract grounded action representations (e.g., hand poses), but resulting policies struggle to bridge the embodiment gap between human and robot actions. We propose an alternative approach: leveraging language-based reasoning from human videos-essential for guiding robot actions-to train generalizable robot policies. Building on recent advances in reasoning-based policy architectures, we introduce Reasoning through Action-free Data (RAD). RAD learns from both robot demonstration data (with reasoning and action labels) and action-free human video data (with only reasoning labels). The robot data teaches the model to map reasoning to low-level actions, while the action-free data enhances reasoning capabilities. Additionally, we will release a new dataset of 3,377 human-hand demonstrations with reasoning annotations compatible with the Bridge V2 benchmark and aimed at facilitating future research on reasoning-driven robot learning. Our experiments show that RAD enables effective transfer across the embodiment gap, allowing robots to perform tasks seen only in action-free data. Furthermore, scaling up action-free reasoning data significantly improves policy performance and generalization to novel tasks. These results highlight the promise of reasoning-driven learning from action-free datasets for advancing generalizable robot control. Project page: this https URL

[AI-31] Efficiently Generating Expressive Quadruped Behaviors via Language-Guided Preference Learning

链接: https://arxiv.org/abs/2502.03717
作者: Jaden Clark,Joey Hejna,Dorsa Sadigh
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages 5 figures

点击查看摘要

Abstract:Expressive robotic behavior is essential for the widespread acceptance of robots in social environments. Recent advancements in learned legged locomotion controllers have enabled more dynamic and versatile robot behaviors. However, determining the optimal behavior for interactions with different users across varied scenarios remains a challenge. Current methods either rely on natural language input, which is efficient but low-resolution, or learn from human preferences, which, although high-resolution, is sample inefficient. This paper introduces a novel approach that leverages priors generated by pre-trained LLMs alongside the precision of preference learning. Our method, termed Language-Guided Preference Learning (LGPL), uses LLMs to generate initial behavior samples, which are then refined through preference-based feedback to learn behaviors that closely align with human expectations. Our core insight is that LLMs can guide the sampling process for preference learning, leading to a substantial improvement in sample efficiency. We demonstrate that LGPL can quickly learn accurate and expressive behaviors with as few as four queries, outperforming both purely language-parameterized models and traditional preference learning approaches. Website with videos: this https URL

[AI-32] Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models

链接: https://arxiv.org/abs/2502.03715
作者: Rui Cai,Chao Wang,Qianyi Cai,Dazhong Shen,Hui Xiong
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge Graph-based recommendations have gained significant attention due to their ability to leverage rich semantic relationships. However, constructing and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent advancements in Large Language Models (LLMs) offer a promising way to improve the quality and relevance of KGs for recommendation tasks. Despite this, integrating LLMs into KG-based systems presents challenges, such as efficiently augmenting KGs, addressing hallucinations, and developing effective joint learning methods. In this paper, we propose the Confidence-aware KG-based Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework that combines KGs and LLMs for recommendation task. The framework includes: (1) an LLM-based subgraph augmenter for enriching KGs with high-quality information, (2) a confidence-aware message propagation mechanism to filter noisy triplets, and (3) a dual-view contrastive learning method to integrate user-item interactions and KG data. Additionally, we employ a confidence-aware explanation generation process to guide LLMs in producing realistic explanations for recommendations. Finally, extensive experiments demonstrate the effectiveness of CKG-LLMA across multiple public datasets.

[AI-33] Unrealized Expectations: Comparing AI Methods vs Classical Algorithms for Maximum Independent Set

链接: https://arxiv.org/abs/2502.03669
作者: Yikai Wu,Haoyu Zhao,Sanjeev Arora
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 24 pages, 7 figures, 8 tables

点击查看摘要

Abstract:AI methods, such as generative models and reinforcement learning, have recently been applied to combinatorial optimization (CO) problems, especially NP-hard ones. This paper compares such GPU-based methods with classical CPU-based methods on Maximum Independent Set (MIS). Experiments on standard graph families show that AI-based algorithms fail to outperform and, in many cases, to match the solution quality of the state-of-art classical solver KaMIS running on a single CPU. Some GPU-based methods even perform similarly to the simplest heuristic, degree-based greedy. Even with post-processing techniques like local search, AI-based methods still perform worse than CPU-based solvers. We develop a new mode of analysis to reveal that non-backtracking AI methods, e.g. LTFT (which is based on GFlowNets), end up reasoning similarly to the simplest degree-based greedy approach, and thus worse than KaMIS. We also find that CPU-based algorithms, notably KaMIS, have strong performance on sparse random graphs, which appears to refute a well-known conjectured upper bound for efficient algorithms from Coja-Oghlan Efthymiou (2015). Comments: 24 pages, 7 figures, 8 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2502.03669 [cs.LG] (or arXiv:2502.03669v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.03669 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-34] AdaPhish: AI-Powered Adaptive Defense and Education Resource Against Deceptive Emails

链接: https://arxiv.org/abs/2502.03622
作者: Rei Meguro,Ng S. T. Chong
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 7 pages, 3 figures, 2 tables, accepted in 4th IEEE International Conference on AI in Cybersecurity (ICAIC)

点击查看摘要

Abstract:Phishing attacks remain a significant threat in the digital age, yet organizations lack effective methods to tackle phishing attacks without leaking sensitive information. Phish bowl initiatives are a vital part of cybersecurity efforts against these attacks. However, traditional phish bowls require manual anonymization and are often limited to internal use. To overcome these limitations, we introduce AdaPhish, an AI-powered phish bowl platform that automatically anonymizes and analyzes phishing emails using large language models (LLMs) and vector databases. AdaPhish achieves real-time detection and adaptation to new phishing tactics while enabling long-term tracking of phishing trends. Through automated reporting, adaptive analysis, and real-time alerts, AdaPhish presents a scalable, collaborative solution for phishing detection and cybersecurity education.

[AI-35] A Novel Zero-Touch Zero-Trust AI/ML Enablement Framework for IoT Network Security

链接: https://arxiv.org/abs/2502.03614
作者: Sushil Shakya,Robert Abbas,Sasa Maric
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The IoT facilitates a connected, intelligent, and sustainable society; therefore, it is imperative to protect the IoT ecosystem. The IoT-based 5G and 6G will leverage the use of machine learning and artificial intelligence (ML/AI) more to pave the way for autonomous and collaborative secure IoT networks. Zero-touch, zero-trust IoT security with AI and machine learning (ML) enablement frameworks offers a powerful approach to securing the expanding landscape of Internet of Things (IoT) devices. This paper presents a novel framework based on the integration of Zero Trust, Zero Touch, and AI/ML powered for the detection, mitigation, and prevention of DDoS attacks in modern IoT ecosystems. The focus will be on the new integrated framework by establishing zero trust for all IoT traffic, fixed and mobile 5G/6G IoT network traffic, and data security (quarantine-zero touch and dynamic policy enforcement). We perform a comparative analysis of five machine learning models, namely, XGBoost, Random Forest, K-Nearest Neighbors, Stochastic Gradient Descent, and Native Bayes, by comparing these models based on accuracy, precision, recall, F1-score, and ROC-AUC. Results show that the best performance in detecting and mitigating different DDoS vectors comes from the ensemble-based approaches.

[AI-36] (GG) MoE vs. MLP on Tabular Data

链接: https://arxiv.org/abs/2502.03608
作者: Andrei Chernov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, significant efforts have been directed toward adapting modern neural network architectures for tabular data. However, despite their larger number of parameters and longer training and inference times, these models often fail to consistently outperform vanilla multilayer perceptron (MLP) neural networks. Moreover, MLP-based ensembles have recently demonstrated superior performance and efficiency compared to advanced deep learning methods. Therefore, rather than focusing on building deeper and more complex deep learning models, we propose investigating whether MLP neural networks can be replaced with more efficient architectures without sacrificing performance. In this paper, we first introduce GG MoE, a mixture-of-experts (MoE) model with a Gumbel-Softmax gating function. We then demonstrate that GG MoE with an embedding layer achieves the highest performance across 38 datasets compared to standard MoE and MLP models. Finally, we show that both MoE and GG MoE utilize significantly fewer parameters than MLPs, making them a promising alternative for scaling and ensemble methods.

[AI-37] Simultaneous Multi-Robot Motion Planning with Projected Diffusion Models

链接: https://arxiv.org/abs/2502.03607
作者: Jinhao Liang,Jacob K Christopher,Sven Koenig,Ferdinando Fioretto
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in diffusion models hold significant potential in robotics, enabling the generation of diverse and smooth trajectories directly from raw representations of the environment. Despite this promise, applying diffusion models to motion planning remains challenging due to their difficulty in enforcing critical constraints, such as collision avoidance and kinematic feasibility. These limitations become even more pronounced in Multi-Robot Motion Planning (MRMP), where multiple robots must coordinate in shared spaces. To address this challenge, this work proposes Simultaneous MRMP Diffusion (SMD), a novel approach integrating constrained optimization into the diffusion sampling process to produce collision-free, kinematically feasible trajectories. Additionally, the paper introduces a comprehensive MRMP benchmark to evaluate trajectory planning algorithms across scenarios with varying robot densities, obstacle complexities, and motion constraints. Experimental results show SMD consistently outperforms classical and learning-based motion planners, achieving higher success rates and efficiency in complex multi-robot environments.

[AI-38] A Multi-Task Learning Approach to Linear Multivariate Forecasting

链接: https://arxiv.org/abs/2502.03571
作者: Liran Nochumsohn,Hedi Zisling,Omri Azencot
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate forecasting of multivariate time series data is important in many engineering and scientific applications. Recent state-of-the-art works ignore the inter-relations between variates, using their model on each variate independently. This raises several research questions related to proper modeling of multivariate data. In this work, we propose to view multivariate forecasting as a multi-task learning problem, facilitating the analysis of forecasting by considering the angle between task gradients and their balance. To do so, we analyze linear models to characterize the behavior of tasks. Our analysis suggests that tasks can be defined by grouping similar variates together, which we achieve via a simple clustering that depends on correlation-based similarities. Moreover, to balance tasks, we scale gradients with respect to their prediction error. Then, each task is solved with a linear model within our MTLinear framework. We evaluate our approach on challenging benchmarks in comparison to strong baselines, and we show it obtains on-par or better results on multivariate forecasting problems. The implementation is available at: this https URL

[AI-39] Code Simulation as a Proxy for High-order Tasks in Large Language Models

链接: https://arxiv.org/abs/2502.03568
作者: Emanuele La Malfa,Christoph Weinhuber,Orazio Torre,Fangru Lin,X. Angelo Huang,Samuele Marro,Anthony Cohn,Nigel Shadbolt,Michael Wooldridge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2401.09074

点击查看摘要

Abstract:Many reasoning, planning, and problem-solving tasks share an intrinsic algorithmic nature: correctly simulating each step is a sufficient condition to solve them correctly. We collect pairs of naturalistic and synthetic reasoning tasks to assess the capabilities of Large Language Models (LLM). While naturalistic tasks often require careful human handcrafting, we show that synthetic data is, in many cases, a good proxy that is much easier to collect at scale. We leverage common constructs in programming as the counterpart of the building blocks of naturalistic reasoning tasks, such as straight-line programs, code that contains critical paths, and approximate and redundant instructions. We further assess the capabilities of LLMs on sorting problems and repeated operations via sorting algorithms and nested loops. Our synthetic datasets further reveal that while the most powerful LLMs exhibit relatively strong execution capabilities, the process is fragile: it is negatively affected by memorisation and seems to rely heavily on pattern recognition. Our contribution builds upon synthetically testing the reasoning capabilities of LLMs as a scalable complement to handcrafted human-annotated problems.

[AI-40] Proportional Selection in Networks

链接: https://arxiv.org/abs/2502.03545
作者: Georgios Papasotiropoulos,Oskar Skibski,Piotr Skowron,Tomasz Wąs
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:We address the problem of selecting k representative nodes from a network, aiming to achieve two objectives: identifying the most influential nodes and ensuring the selection proportionally reflects the network’s diversity. We propose two approaches to accomplish this, analyze them theoretically, and demonstrate their effectiveness through a series of experiments.

[AI-41] Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2

链接: https://arxiv.org/abs/2502.03544
作者: Yuri Chervonyi,Trieu H. Trinh,Miroslav Olšák,Xiaomeng Yang,Hoang Nguyen,Marcelo Menegali,Junehyuk Jung,Vikas Verma,Quoc V. Le,Thang Luong
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 28 pages, 16 figures

点击查看摘要

Abstract:We present AlphaGeometry2, a significantly improved version of AlphaGeometry introduced in Trinh et al. (2024), which has now surpassed an average gold medalist in solving Olympiad geometry problems. To achieve this, we first extend the original AlphaGeometry language to tackle harder problems involving movements of objects, and problems containing linear equations of angles, ratios, and distances. This, together with other additions, has markedly improved the coverage rate of the AlphaGeometry language on International Math Olympiads (IMO) 2000-2024 geometry problems from 66% to 88%. The search process of AlphaGeometry2 has also been greatly improved through the use of Gemini architecture for better language modeling, and a novel knowledge-sharing mechanism that combines multiple search trees. Together with further enhancements to the symbolic engine and synthetic data generation, we have significantly boosted the overall solving rate of AlphaGeometry2 to 84% for \textitall geometry problems over the last 25 years, compared to 54% previously. AlphaGeometry2 was also part of the system that achieved silver-medal standard at IMO 2024 this https URL. Last but not least, we report progress towards using AlphaGeometry2 as a part of a fully automated system that reliably solves geometry problems directly from natural language input.

[AI-42] Path Planning for Masked Diffusion Model Sampling

链接: https://arxiv.org/abs/2502.03540
作者: Fred Zhangzhi Peng,Zachary Bezemek,Sawan Patel,Sherwood Yao,Jarrid Rector-Brooks,Alexander Tong,Pranam Chatterjee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we investigate how the order in which tokens are unmasked during masked diffusion models (MDMs) inference affects generative quality. We derive an expanded evidence lower bound (ELBO) that introduces a planner, responsible for selecting which tokens to unmask at each step. Our analysis suggests that alternative unmasking strategies can improve generative performance. Based on these insights, we propose Path Planning (P2), a sampling framework that leverages pre-trained BERT or the denoiser itself to guide unmasking decisions. P2 generalizes all known MDM sampling strategies and enables significant improvements across diverse domains including language generation (in-context learning, code generation, story infilling, mathematical reasoning, reverse curse correction) and biological sequence generation (protein and RNA sequences).

[AI-43] YINYANG-ALIGN: Benchmarking Contradictory Objectives and Proposing Multi-Objective Optimization based DPO for Text-to-Image Alignment

链接: https://arxiv.org/abs/2502.03512
作者: Amitava Das,Yaswanth Narsupalli,Gurpreet Singh,Vinija Jain,Vasu Sharma,Suranjana Trivedy,Aman Chadha,Amit Sheth
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Precise alignment in Text-to-Image (T2I) systems is crucial to ensure that generated visuals not only accurately encapsulate user intents but also conform to stringent ethical and aesthetic benchmarks. Incidents like the Google Gemini fiasco, where misaligned outputs triggered significant public backlash, underscore the critical need for robust alignment mechanisms. In contrast, Large Language Models (LLMs) have achieved notable success in alignment. Building on these advancements, researchers are eager to apply similar alignment techniques, such as Direct Preference Optimization (DPO), to T2I systems to enhance image generation fidelity and reliability. We present YinYangAlign, an advanced benchmarking framework that systematically quantifies the alignment fidelity of T2I systems, addressing six fundamental and inherently contradictory design objectives. Each pair represents fundamental tensions in image generation, such as balancing adherence to user prompts with creative modifications or maintaining diversity alongside visual coherence. YinYangAlign includes detailed axiom datasets featuring human prompts, aligned (chosen) responses, misaligned (rejected) AI-generated outputs, and explanations of the underlying contradictions. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2502.03512 [cs.AI] (or arXiv:2502.03512v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.03512 Focus to learn more arXiv-issued DOI via DataCite

[AI-44] Examining Two Hop Reasoning Through Information Content Scaling

链接: https://arxiv.org/abs/2502.03490
作者: David Johnston,Nora Belrose
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prior work has found that transformers have an inconsistent ability to learn to answer latent two-hop questions – questions of the form “Who is Bob’s mother’s boss?” We study why this is the case by examining how transformers’ capacity to learn datasets of two-hop questions and answers (two-hop QA) scales with their size, motivated by prior work on transformer knowledge capacity for simple factual memorization. We find that capacity scaling and generalization both support the hypothesis that latent two-hop QA requires transformers to learn each fact twice, while two-hop QA with chain of thought does not. We also show that with appropriate dataset parameters, it is possible to “trap” very small models in a regime where they memorize answers to two-hop questions independently, even though they would perform better if they could learn to answer them with function composition. Our findings show that measurement of capacity scaling can complement existing interpretability methods, though there are challenges in using it for this purpose.

[AI-45] Artificial Intelligence and Legal Analysis: Implications for Legal Education and the Profession

链接: https://arxiv.org/abs/2502.03487
作者: Lee Peoples
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article reports the results of a study examining the ability of legal and non-legal Large Language Models to perform legal analysis using the Issue-Rule-Application-Conclusion framework. LLMs were tested on legal reasoning tasks involving rule analysis and analogical reasoning. The results show that LLMs can conduct basic IRAC analysis, but are limited by brief responses lacking detail, an inability to commit to answers, false confidence, and hallucinations. The study compares legal and nonlegal LLMs, identifies shortcomings, and explores traits that may hinder their ability to think like a lawyer. It also discusses the implications for legal education and practice, highlighting the need for critical thinking skills in future lawyers and the potential pitfalls of overreliance on artificial intelligence AI resulting in a loss of logic, reasoning, and critical thinking skills.

[AI-46] A Capability Approach to AI Ethics

链接: https://arxiv.org/abs/2502.03469
作者: Emanuele Ratti,Mark Graves
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a conceptualization and implementation of AI ethics via the capability approach. We aim to show that conceptualizing AI ethics through the capability approach has two main advantages for AI ethics as a discipline. First, it helps clarify the ethical dimension of AI tools. Second, it provides guidance to implementing ethical considerations within the design of AI tools. We illustrate these advantages in the context of AI tools in medicine, by showing how ethics-based auditing of AI tools in medicine can greatly benefit from our capability-based approach.

[AI-47] Where AI Assurance Might Go Wrong: Initial lessons from engineering of critical systems

链接: https://arxiv.org/abs/2502.03467
作者: Robin Bloomfield,John Rushby
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Presented at UK AI Safety Institute (AISI) Conference on Frontier AI Safety Frameworks (FAISC 24), Berkeley CA, November 2024

点击查看摘要

Abstract:We draw on our experience working on system and software assurance and evaluation for systems important to society to summarise how safety engineering is performed in traditional critical systems, such as aircraft flight control. We analyse how this critical systems perspective might support the development and implementation of AI Safety Frameworks. We present the analysis in terms of: system engineering, safety and risk analysis, and decision analysis and support. We consider four key questions: What is the system? How good does it have to be? What is the impact of criticality on system development? and How much should we trust it? We identify topics worthy of further discussion. In particular, we are concerned that system boundaries are not broad enough, that the tolerability and nature of the risks are not sufficiently elaborated, and that the assurance methods lack theories that would allow behaviours to be adequately assured. We advocate the use of assurance cases based on Assurance 2.0 to support decision making in which the criticality of the decision as well as the criticality of the system are evaluated. We point out the orders of magnitude difference in confidence needed in critical rather than everyday systems and how everyday techniques do not scale in rigour. Finally we map our findings in detail to two of the questions posed by the FAISC organisers and we note that the engineering of critical systems has evolved through open and diverse discussion. We hope that topics identified here will support the post-FAISC dialogues. Comments: Presented at UK AI Safety Institute (AISI) Conference on Frontier AI Safety Frameworks (FAISC 24), Berkeley CA, November 2024 Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2502.03467 [cs.CY] (or arXiv:2502.03467v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2502.03467 Focus to learn more arXiv-issued DOI via DataCite Submission history From: John Rushby [view email] [v1] Tue, 7 Jan 2025 22:02:23 UTC (769 KB)

[AI-48] Quantum Circuit Design using a Progressive Widening Monte Carlo Tree Search

链接: https://arxiv.org/abs/2502.03962
作者: Vincenzo Lipardi,Domenica Dibenedetto,Georgios Stamoulis,Mark H.M. Winands
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The performance of Variational Quantum Algorithms (VQAs) strongly depends on the choice of the parameterized quantum circuit to optimize. One of the biggest challenges in VQAs is designing quantum circuits tailored to the particular problem and to the quantum hardware. This article proposes a gradient-free Monte Carlo Tree Search (MCTS) technique to automate the process of quantum circuit design. It introduces a novel formulation of the action space based on a sampling scheme and a progressive widening technique to explore the space dynamically. When testing our MCTS approach on the domain of random quantum circuits, MCTS approximates unstructured circuits under different values of stabilizer Rényi entropy. It turns out that MCTS manages to approximate the benchmark quantum states independently from their degree of nonstabilizerness. Next, our technique exhibits robustness across various application domains, including quantum chemistry and systems of linear equations. Compared to previous MCTS research, our technique reduces the number of quantum circuit evaluations by a factor of 10 to 100 while achieving equal or better results. In addition, the resulting quantum circuits have up to three times fewer CNOT gates.

[AI-49] Energy Force Regression on DFT Trajectories is Not Enough for Universal Machine Learning Interatomic Potentials

链接: https://arxiv.org/abs/2502.03660
作者: Santiago Miret,Kin Long Kelvin Lee,Carmelo Gonzales,Sajid Mannan,N. M. Anoop Krishnan
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Universal Machine Learning Interactomic Potentials (MLIPs) enable accelerated simulations for materials discovery. However, current research efforts fail to impactfully utilize MLIPs due to: 1. Overreliance on Density Functional Theory (DFT) for MLIP training data creation; 2. MLIPs’ inability to reliably and accurately perform large-scale molecular dynamics (MD) simulations for diverse materials; 3. Limited understanding of MLIPs’ underlying capabilities. To address these shortcomings, we aargue that MLIP research efforts should prioritize: 1. Employing more accurate simulation methods for large-scale MLIP training data creation (e.g. Coupled Cluster Theory) that cover a wide range of materials design spaces; 2. Creating MLIP metrology tools that leverage large-scale benchmarking, visualization, and interpretability analyses to provide a deeper understanding of MLIPs’ inner workings; 3. Developing computationally efficient MLIPs to execute MD simulations that accurately model a broad set of materials properties. Together, these interdisciplinary research directions can help further the real-world application of MLIPs to accurately model complex materials at device scale.

[AI-50] Elucidation of the Concept of Consciousness from the Theory of Non-Human Communication Agents

链接: https://arxiv.org/abs/2502.03508
作者: Julian Tagnin
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Version febrero 2025, originalmente escrito en diciembre de 2024

点击查看摘要

Abstract:This article focuses on elucidating the concept of consciousness from a relational and post-phenomenological theory of non-human communication agents (ANHC). Specifically, we explore the contributions of Thomas Metzinger s Self Model Theory, Katherine Hayles conceptualizations of non-conscious cognitive processes centered on knowledge processing phenomena shared between biological and technical systems and Lenore and Manuel Blum s theoretical perspective on computation, which defines consciousness as an emergent phenomenon of complex computational systems, arising from the appropriate organization of their inorganic materiality. Building on interactions with non-human cognitive agents, among other factors, the explainability of sociotechnical systems challenges the humanistic common sense of modern philosophy and science. This critical integration of various approaches ultimately questions other concepts associated with consciousness, such as autonomy, freedom, and mutual responsibility. The aim is to contribute to a necessary discussion for designing new frameworks of understanding that pave the way toward an ethical and pragmatic approach to addressing contemporary challenges in the design, regulation, and interaction with ANHC. Such frameworks, in turn, enable a more inclusive and relational understanding of agency in an interconnected world.

[AI-51] Enhancing Free-hand 3D Photoacoustic and Ultrasound Reconstruction using Deep Learning

链接: https://arxiv.org/abs/2502.03505
作者: SiYeoul Lee,SeonHo Kim,Minkyung Seo,SeongKyu Park,Salehin Imrus,Kambaluru Ashok,DongEon Lee,Chunsu Park,SeonYeong Lee,Jiye Kim,Jae-Heung Yoo,MinWoo Kim
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconstruction, faces significant challenges in accurate motion estimation without relying on external positional sensors. MoGLo-Net addresses these limitations through an innovative adaptation of the self-attention mechanism, which effectively exploits the critical regions, such as fully-developed speckle area or high-echogenic tissue area within successive ultrasound images to accurately estimate motion parameters. This facilitates the extraction of intricate features from individual frames. Additionally, we designed a patch-wise correlation operation to generate a correlation volume that is highly correlated with the scanning motion. A custom loss function was also developed to ensure robust learning with minimized bias, leveraging the characteristics of the motion parameters. Experimental evaluations demonstrated that MoGLo-Net surpasses current state-of-the-art methods in both quantitative and qualitative performance metrics. Furthermore, we expanded the application of 3D reconstruction technology beyond simple B-mode ultrasound volumes to incorporate Doppler ultrasound and photoacoustic imaging, enabling 3D visualization of vasculature. The source code for this study is publicly available at: this https URL

[AI-52] Immersion for AI: Immersive Learning with Artificial Intelligence

链接: https://arxiv.org/abs/2502.03504
作者: Leonel Morgado(1 and 2) ((1) Universidade Aberta, (2) INESC TEC)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 16 pages. To be published in the Proceedings of the 11th Annual International Conference of the Immersive Learning Research Network (iLRN2025)

点击查看摘要

Abstract:This work reflects upon what Immersion can mean from the perspective of an Artificial Intelligence (AI). Applying the lens of immersive learning theory, it seeks to understand whether this new perspective supports ways for AI participation in cognitive ecologies. By treating AI as a participant rather than a tool, it explores what other participants (humans and other AIs) need to consider in environments where AI can meaningfully engage and contribute to the cognitive ecology, and what the implications are for designing such learning environments. Drawing from the three conceptual dimensions of immersion - System, Narrative, and Agency - this work reinterprets AIs in immersive learning contexts. It outlines practical implications for designing learning environments where AIs are surrounded by external digital services, can interpret a narrative of origins, changes, and structural developments in data, and dynamically respond, making operational and tactical decisions that shape human-AI collaboration. Finally, this work suggests how these insights might influence the future of AI training, proposing that immersive learning theory can inform the development of AIs capable of evolving beyond static models. This paper paves the way for understanding AI as an immersive learner and participant in evolving human-AI cognitive ecosystems.

[AI-53] wo in context learning tasks with complex functions

链接: https://arxiv.org/abs/2502.03503
作者: Omar Naim,Nicholas Asher
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We examine two in context learning (ICL) tasks with mathematical functions in several train and test settings for transformer models. Our study generalizes work on linear functions by showing that small transformers, even models with attention layers only, can approximate arbitrary polynomial functions and hence continuous functions under certain conditions. Our models also can approximate previously unseen classes of polynomial functions, as well as the zeros of complex functions. Our models perform far better on this task than LLMs like GPT4 and involve complex reasoning when provided with suitable training data and methods. Our models also have important limitations; they fail to generalize outside of training distributions and so don’t learn class forms of functions. We explain why this is so.

[AI-54] DC-VSR: Spatially and Temporally Consistent Video Super-Resolution with Video Diffusion Prior

链接: https://arxiv.org/abs/2502.03502
作者: Janghyeok Han,Gyujin Sim,Geonung Kim,Hyunseung Lee,Kyuha Choi,Youngseok Han,Sunghyun Cho
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Equal contributions from first two authors

点击查看摘要

Abstract:Video super-resolution (VSR) aims to reconstruct a high-resolution (HR) video from a low-resolution (LR) counterpart. Achieving successful VSR requires producing realistic HR details and ensuring both spatial and temporal consistency. To restore realistic details, diffusion-based VSR approaches have recently been proposed. However, the inherent randomness of diffusion, combined with their tile-based approach, often leads to spatio-temporal inconsistencies. In this paper, we propose DC-VSR, a novel VSR approach to produce spatially and temporally consistent VSR results with realistic textures. To achieve spatial and temporal consistency, DC-VSR adopts a novel Spatial Attention Propagation (SAP) scheme and a Temporal Attention Propagation (TAP) scheme that propagate information across spatio-temporal tiles based on the self-attention mechanism. To enhance high-frequency details, we also introduce Detail-Suppression Self-Attention Guidance (DSSAG), a novel diffusion guidance scheme. Comprehensive experiments demonstrate that DC-VSR achieves spatially and temporally consistent, high-quality VSR results, outperforming previous approaches.

[AI-55] Efficient Image Restoration via Latent Consistency Flow Matching

链接: https://arxiv.org/abs/2502.03500
作者: Elad Cohen,Idan Achituve,Idit Diamant,Arnon Netzer,Hai Victor Habi
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: 21 pages, 11 figures

点击查看摘要

Abstract:Recent advances in generative image restoration (IR) have demonstrated impressive results. However, these methods are hindered by their substantial size and computational demands, rendering them unsuitable for deployment on edge devices. This work introduces ELIR, an Efficient Latent Image Restoration method. ELIR operates in latent space by first predicting the latent representation of the minimum mean square error (MMSE) estimator and then transporting this estimate to high-quality images using a latent consistency flow-based model. Consequently, ELIR is more than 4x faster compared to the state-of-the-art diffusion and flow-based approaches. Moreover, ELIR is also more than 4x smaller, making it well-suited for deployment on resource-constrained edge devices. Comprehensive evaluations of various image restoration tasks show that ELIR achieves competitive results, effectively balancing distortion and perceptual quality metrics while offering improved efficiency in terms of memory and computation.

[AI-56] Omni-DNA: A Unified Genomic Foundation Model for Cross-Modal and Multi-Task Learning

链接: https://arxiv.org/abs/2502.03499
作者: Zehui Li,Vallijah Subasri,Yifei Shen,Dongsheng Li,Yiren Zhao,Guy-Bart Stan,Caihua Shan
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable generalizability across diverse tasks, yet genomic foundation models (GFMs) still require separate finetuning for each downstream application, creating significant overhead as model sizes grow. Moreover, existing GFMs are constrained by rigid output formats, limiting their applicability to various genomic tasks. In this work, we revisit the transformer-based auto-regressive models and introduce Omni-DNA, a family of cross-modal multi-task models ranging from 20 million to 1 billion parameters. Our approach consists of two stages: (i) pretraining on DNA sequences with next token prediction objective, and (ii) expanding the multi-modal task-specific tokens and finetuning for multiple downstream tasks simultaneously. When evaluated on the Nucleotide Transformer and GB benchmarks, Omni-DNA achieves state-of-the-art performance on 18 out of 26 tasks. Through multi-task finetuning, Omni-DNA addresses 10 acetylation and methylation tasks at once, surpassing models trained on each task individually. Finally, we design two complex genomic tasks, DNA2Function and Needle-in-DNA, which map DNA sequences to textual functional descriptions and images, respectively, indicating Omni-DNA’s cross-modal capabilities to broaden the scope of genomic applications. All the models are available through this https URL

机器学习

[LG-0] Value-Based Deep RL Scales Predictably

链接: https://arxiv.org/abs/2502.04327
作者: Oleh Rybkin,Michal Nauman,Preston Fu,Charlie Snell,Pieter Abbeel,Sergey Levine,Aviral Kumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling data and compute is critical to the success of machine learning. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.

[LG-1] he Uniformly Rotated Mondrian Kernel AISTATS

链接: https://arxiv.org/abs/2502.04323
作者: Calvin Osborne,Eliza O’Reilly
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 22 pages, 4 figures, postprint for 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025

点击查看摘要

Abstract:First proposed by Rahimi and Recht, random features are used to decrease the computational cost of kernel machines in large-scale problems. The Mondrian kernel is one such example of a fast random feature approximation of the Laplace kernel, generated by a computationally efficient hierarchical random partition of the input space known as the Mondrian process. In this work, we study a variation of this random feature map by using uniformly randomly rotated Mondrian processes to approximate a kernel that is invariant under rotations. We obtain a closed-form expression for this isotropic kernel, as well as a uniform convergence rate of the uniformly rotated Mondrian kernel to this limit. To this end, we utilize techniques from the theory of stationary random tessellations in stochastic geometry and prove a new result on the geometry of the typical cell of the superposition of uniformly random rotations of Mondrian tessellations. Finally, we test the empirical performance of this random feature map on both synthetic and real-world datasets, demonstrating its improved performance over the Mondrian kernel on a debiased dataset.

[LG-2] Consistency of augmentation graph and network approximability in contrastive learning

链接: https://arxiv.org/abs/2502.04312
作者: Chenghui Li,A. Martina Neuman
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Spectral Theory (math.SP)
*备注:

点击查看摘要

Abstract:Contrastive learning leverages data augmentation to develop feature representation without relying on large labeled datasets. However, despite its empirical success, the theoretical foundations of contrastive learning remain incomplete, with many essential guarantees left unaddressed, particularly the realizability assumption concerning neural approximability of an optimal spectral contrastive loss solution. In this work, we overcome these limitations by analyzing the pointwise and spectral consistency of the augmentation graph Laplacian. We establish that, under specific conditions for data generation and graph connectivity, as the augmented dataset size increases, the augmentation graph Laplacian converges to a weighted Laplace-Beltrami operator on the natural data manifold. These consistency results ensure that the graph Laplacian spectrum effectively captures the manifold geometry. Consequently, they give way to a robust framework for establishing neural approximability, directly resolving the realizability assumption in a current paradigm.

[LG-3] Finding Pegasus: Enhancing Unsupervised Anomaly Detection in High-Dimensional Data using a Manifold-Based Approach

链接: https://arxiv.org/abs/2502.04310
作者: R. P. Nathan,Nikolaos Nikolaou,Ofer Lahav
类目: Machine Learning (cs.LG)
*备注: 21 pages, 14 figures

点击查看摘要

Abstract:Unsupervised machine learning methods are well suited to searching for anomalies at scale but can struggle with the high-dimensional representation of many modern datasets, hence dimensionality reduction (DR) is often performed first. In this paper we analyse unsupervised anomaly detection (AD) from the perspective of the manifold created in DR. We present an idealised illustration, “Finding Pegasus”, and a novel formal framework with which we categorise AD methods and their results into “on manifold” and “off manifold”. We define these terms and show how they differ. We then use this insight to develop an approach of combining AD methods which significantly boosts AD recall without sacrificing precision in situations employing high DR. When tested on MNIST data, our approach of combining AD methods improves recall by as much as 16 percent compared with simply combining with the best standalone AD method (Isolation Forest), a result which shows great promise for its application to real-world data.

[LG-4] argeted Learning for Data Fairness

链接: https://arxiv.org/abs/2502.04309
作者: Alexander Asemota,Giles Hooker
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data and algorithms have the potential to produce and perpetuate discrimination and disparate treatment. As such, significant effort has been invested in developing approaches to defining, detecting, and eliminating unfair outcomes in algorithms. In this paper, we focus on performing statistical inference for fairness. Prior work in fairness inference has largely focused on inferring the fairness properties of a given predictive algorithm. Here, we expand fairness inference by evaluating fairness in the data generating process itself, referred to here as data fairness. We perform inference on data fairness using targeted learning, a flexible framework for nonparametric inference. We derive estimators demographic parity, equal opportunity, and conditional mutual information. Additionally, we find that our estimators for probabilistic metrics exploit double robustness. To validate our approach, we perform several simulations and apply our estimators to real data.

[LG-5] Statistical guarantees for continuous-time policy evaluation: blessing of ellipticity and new tradeoffs

链接: https://arxiv.org/abs/2502.04297
作者: Wenlong Mou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the estimation of the value function for continuous-time Markov diffusion processes using a single, discretely observed ergodic trajectory. Our work provides non-asymptotic statistical guarantees for the least-squares temporal-difference (LSTD) method, with performance measured in the first-order Sobolev norm. Specifically, the estimator attains an O(1 / \sqrtT) convergence rate when using a trajectory of length T ; notably, this rate is achieved as long as T scales nearly linearly with both the mixing time of the diffusion and the number of basis functions employed. A key insight of our approach is that the ellipticity inherent in the diffusion process ensures robust performance even as the effective horizon diverges to infinity. Moreover, we demonstrate that the Markovian component of the statistical error can be controlled by the approximation error, while the martingale component grows at a slower rate relative to the number of basis functions. By carefully balancing these two sources of error, our analysis reveals novel trade-offs between approximation and statistical errors. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST) Cite as: arXiv:2502.04297 [cs.LG] (or arXiv:2502.04297v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.04297 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] Leverag ing Geolocation in Clinical Records to Improve Alzheimers Disease Diagnosis Using DMV Framework

链接: https://arxiv.org/abs/2502.04288
作者: Peng Zhang,Divya Chaudhary
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) early detection is critical for enabling timely intervention and improving patient outcomes. This paper presents a DMV framework using Llama3-70B and GPT-4o as embedding models to analyze clinical notes and predict a continuous risk score associated with early AD onset. Framing the task as a regression problem, we model the relationship between linguistic features in clinical notes (inputs) and a target variable (data value) that answers specific questions related to AD risk within certain topic categories. By leveraging a multi-faceted feature set that includes geolocation data, we capture additional environmental context potentially linked to AD. Our results demonstrate that the integration of the geolocation information significantly decreases the error of predicting early AD risk scores over prior models by 28.57% (Llama3-70B) and 33.47% (GPT4-o). Our findings suggest that this combined approach can enhance the predictive accuracy of AD risk assessment, supporting early diagnosis and intervention in clinical settings. Additionally, the framework’s ability to incorporate geolocation data provides a more comprehensive risk assessment model that could help healthcare providers better understand and address environmental factors contributing to AD development.

[LG-7] DECAF: Learning to be Fair in Multi-agent Resource Allocation

链接: https://arxiv.org/abs/2502.04281
作者: Ashwin Kumar,William Yeoh
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:A wide variety of resource allocation problems operate under resource constraints that are managed by a central arbitrator, with agents who evaluate and communicate preferences over these resources. We formulate this broad class of problems as Distributed Evaluation, Centralized Allocation (DECA) problems and propose methods to learn fair and efficient policies in centralized resource allocation. Our methods are applied to learning long-term fairness in a novel and general framework for fairness in multi-agent systems. We show three different methods based on Double Deep Q-Learning: (1) A joint weighted optimization of fairness and utility, (2) a split optimization, learning two separate Q-estimators for utility and fairness, and (3) an online policy perturbation to guide existing black-box utility functions toward fair solutions. Our methods outperform existing fair MARL approaches on multiple resource allocation domains, even when evaluated using diverse fairness functions, and allow for flexible online trade-offs between utility and fairness.

[LG-8] Orthogonal Representation Learning for Estimating Causal Quantities

链接: https://arxiv.org/abs/2502.04274
作者: Valentyn Melnychuk,Dennis Frauen,Jonas Schweisthal,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representation learning is widely used for estimating causal quantities (e.g., the conditional average treatment effect) from observational data. While existing representation learning methods have the benefit of allowing for end-to-end learning, they do not have favorable theoretical properties of Neyman-orthogonal learners, such as double robustness and quasi-oracle efficiency. Also, such representation learning methods often employ additional constraints, like balancing, which may even lead to inconsistent estimation. In this paper, we propose a novel class of Neyman-orthogonal learners for causal quantities defined at the representation level, which we call OR-learners. Our OR-learners have several practical advantages: they allow for consistent estimation of causal quantities based on any learned representation, while offering favorable theoretical properties including double robustness and quasi-oracle efficiency. In multiple experiments, we show that, under certain regularity conditions, our OR-learners improve existing representation learning methods and achieve state-of-the-art performance. To the best of our knowledge, our OR-learners are the first work to offer a unified framework of representation learning methods and Neyman-orthogonal learners for causal quantities estimation.

[LG-9] Electrical Impedance Tomography for Anisotropic Media: a Machine Learning Approach to Classify Inclusions

链接: https://arxiv.org/abs/2502.04273
作者: Romina Gaburro,Patrick Healy,Shraddha Naidu,Clifford Nolan
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 27 pages, 17 figures

点击查看摘要

Abstract:We consider the problem in Electrical Impedance Tomography (EIT) of identifying one or multiple inclusions in a background-conducting body \Omega\subset\mathbbR^2 , from the knowledge of a finite number of electrostatic measurements taken on its boundary \partial\Omega and modelled by the Dirichlet-to-Neumann (D-N) matrix. Once the presence of one inclusion in \Omega is established, our model, combined with the machine learning techniques of Artificial Neural Networks (ANN) and Support Vector Machines (SVM), may be used to determine the size of the inclusion, the presence of multiple inclusions, and also that of anisotropy within the inclusion(s). Utilising both real and simulated datasets within a 16-electrode setup, we achieve a high rate of inclusion detection and show that two measurements are sufficient to achieve a good level of accuracy when predicting the size of an inclusion. This underscores the substantial potential of integrating machine learning approaches with the more classical analysis of EIT and the inverse inclusion problem to extract critical insights, such as the presence of anisotropy.

[LG-10] PILAF: Optimal Human Preference Sampling for Reward Modeling

链接: https://arxiv.org/abs/2502.04270
作者: Yunzhen Feng,Ariel Kwiatkowski,Kunhao Zheng,Julia Kempe,Yaqi Duan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.

[LG-11] Efficient Randomized Experiments Using Foundation Models

链接: https://arxiv.org/abs/2502.04262
作者: Piersilvio De Bartolomeis,Javier Abad,Guanbo Wang,Konstantin Donhauser,Raymond M. Duch,Fanny Yang,Issa J. Dahabreh
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Randomized experiments are the preferred approach for evaluating the effects of interventions, but they are costly and often yield estimates with substantial uncertainty. On the other hand, in silico experiments leveraging foundation models offer a cost-effective alternative that can potentially attain higher statistical precision. However, the benefits of in silico experiments come with a significant risk: statistical inferences are not valid if the models fail to accurately predict experimental responses to interventions. In this paper, we propose a novel approach that integrates the predictions from multiple foundation models with experimental data while preserving valid statistical inference. Our estimator is consistent and asymptotically normal, with asymptotic variance no larger than the standard estimator based on experimental data alone. Importantly, these statistical properties hold even when model predictions are arbitrarily biased. Empirical results across several randomized experiments show that our estimator offers substantial precision gains, equivalent to a reduction of up to 20% in the sample size needed to match the same precision as the standard estimator based on experimental data alone.

[LG-12] Realistic Image-to-Image Machine Unlearning via Decoupling and Knowledge Retention

链接: https://arxiv.org/abs/2502.04260
作者: Ayush K. Varshney,Vicenç Torra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Unlearning allows participants to remove their data from a trained machine learning model in order to preserve their privacy, and security. However, the machine unlearning literature for generative models is rather limited. The literature for image-to-image generative model (I2I model) considers minimizing the distance between Gaussian noise and the output of I2I model for forget samples as machine unlearning. However, we argue that the machine learning model performs fairly well on unseen data i.e., a retrained model will be able to catch generic patterns in the data and hence will not generate an output which is equivalent to Gaussian noise. In this paper, we consider that the model after unlearning should treat forget samples as out-of-distribution (OOD) data, i.e., the unlearned model should no longer recognize or encode the specific patterns found in the forget samples. To achieve this, we propose a framework which decouples the model parameters with gradient ascent, ensuring that forget samples are OOD for unlearned model with theoretical guarantee. We also provide (\epsilon, \delta) -unlearning guarantee for model updates with gradient ascent. The unlearned model is further fine-tuned on the remaining samples to maintain its performance. We also propose an attack model to ensure that the unlearned model has effectively removed the influence of forget samples. Extensive empirical evaluation on two large-scale datasets, ImageNet-1K and Places365 highlights the superiority of our approach. To show comparable performance with retrained model, we also show the comparison of a simple AutoEncoder on various baselines on CIFAR-10 dataset.

[LG-13] Combining Language and App UI Analysis for the Automated Assessment of Bug Reproduction Steps

链接: https://arxiv.org/abs/2502.04251
作者: Junayed Mahmud,Antu Saha,Oscar Chaparro,Kevin Moran,Andrian Marcus
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 12 pages, to appear in the Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension (ICPC’25)

点击查看摘要

Abstract:Bug reports are essential for developers to confirm software problems, investigate their causes, and validate fixes. Unfortunately, reports often miss important information or are written unclearly, which can cause delays, increased issue resolution effort, or even the inability to solve issues. One of the most common components of reports that are problematic is the steps to reproduce the bug(s) (S2Rs), which are essential to replicate the described program failures and reason about fixes. Given the proclivity for deficiencies in reported S2Rs, prior work has proposed techniques that assist reporters in writing or assessing the quality of S2Rs. However, automated understanding of S2Rs is challenging, and requires linking nuanced natural language phrases with specific, semantically related program information. Prior techniques often struggle to form such language to program connections - due to issues in language variability and limitations of information gleaned from program analyses. To more effectively tackle the problem of S2R quality annotation, we propose a new technique called AstroBR, which leverages the language understanding capabilities of LLMs to identify and extract the S2Rs from bug reports and map them to GUI interactions in a program state model derived via dynamic analysis. We compared AstroBR to a related state-of-the-art approach and we found that AstroBR annotates S2Rs 25.2% better (in terms of F1 score) than the baseline. Additionally, AstroBR suggests more accurate missing S2Rs than the baseline (by 71.4% in terms of F1 score). Comments: 12 pages, to appear in the Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension (ICPC’25) Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2502.04251 [cs.SE] (or arXiv:2502.04251v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2502.04251 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-14] Adapting to Evolving Adversaries with Regularized Continual Robust Training

链接: https://arxiv.org/abs/2502.04248
作者: Sihui Dai,Christian Cianfarani,Arjun Bhagoji,Vikash Sehwag,Prateek Mittal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust training methods typically defend against specific attack types, such as Lp attacks with fixed budgets, and rarely account for the fact that defenders may encounter new attacks over time. A natural solution is to adapt the defended model to new adversaries as they arise via fine-tuning, a method which we call continual robust training (CRT). However, when implemented naively, fine-tuning on new attacks degrades robustness on previous attacks. This raises the question: how can we improve the initial training and fine-tuning of the model to simultaneously achieve robustness against previous and new attacks? We present theoretical results which show that the gap in a model’s robustness against different attacks is bounded by how far each attack perturbs a sample in the model’s logit space, suggesting that regularizing with respect to this logit space distance can help maintain robustness against previous attacks. Extensive experiments on 3 datasets (CIFAR-10, CIFAR-100, and ImageNette) and over 100 attack combinations demonstrate that the proposed regularization improves robust accuracy with little overhead in training time. Our findings and open-source code lay the groundwork for the deployment of models robust to evolving attacks.

[LG-15] Graph machine learning for flight delay prediction due to holding manouver

链接: https://arxiv.org/abs/2502.04233
作者: Jorge L. Franco,Manoel V. Machado Neto,Filipe A. N. Verri,Diego R. Amancio
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Flight delays due to holding maneuvers are a critical and costly phenomenon in aviation, driven by the need to manage air traffic congestion and ensure safety. Holding maneuvers occur when aircraft are instructed to circle in designated airspace, often due to factors such as airport congestion, adverse weather, or air traffic control restrictions. This study models the prediction of flight delays due to holding maneuvers as a graph problem, leveraging advanced Graph Machine Learning (Graph ML) techniques to capture complex interdependencies in air traffic networks. Holding maneuvers, while crucial for safety, cause increased fuel usage, emissions, and passenger dissatisfaction, making accurate prediction essential for operational efficiency. Traditional machine learning models, typically using tabular data, often overlook spatial-temporal relations within air traffic data. To address this, we model the problem of predicting holding as edge feature prediction in a directed (multi)graph where we apply both CatBoost, enriched with graph features capturing network centrality and connectivity, and Graph Attention Networks (GATs), which excel in relational data contexts. Our results indicate that CatBoost outperforms GAT in this imbalanced dataset, effectively predicting holding events and offering interpretability through graph-based feature importance. Additionally, we discuss the model’s potential operational impact through a web-based tool that allows users to simulate real-time delay predictions. This research underscores the viability of graph-based approaches for predictive analysis in aviation, with implications for enhancing fuel efficiency, reducing delays, and improving passenger experience.

[LG-16] Ensuring Reliability via Hyperparameter Selection: Review and Advances

链接: https://arxiv.org/abs/2502.04206
作者: Amirmohammad Farzaneh,Osvaldo Simeone
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Hyperparameter selection is a critical step in the deployment of artificial intelligence (AI) models, particularly in the current era of foundational, pre-trained, models. By framing hyperparameter selection as a multiple hypothesis testing problem, recent research has shown that it is possible to provide statistical guarantees on population risk measures attained by the selected hyperparameter. This paper reviews the Learn-Then-Test (LTT) framework, which formalizes this approach, and explores several extensions tailored to engineering-relevant scenarios. These extensions encompass different risk measures and statistical guarantees, multi-objective optimization, the incorporation of prior knowledge and dependency structures into the hyperparameter selection process, as well as adaptivity. The paper also includes illustrative applications for communication systems.

[LG-17] “Short-length” Adversarial Training Helps LLM s Defend “Long-length” Jailbreak Attacks: Theoretical and Empirical Evidence

链接: https://arxiv.org/abs/2502.04204
作者: Shaopeng Fu,Liang Ding,Di Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length \Theta(M) , it is enough to align LLMs on prompts with adversarial suffixes of length \Theta(\sqrtM) . Theoretically, we analyze the adversarial in-context learning of linear transformers on linear regression tasks and prove a robust generalization bound for trained transformers. The bound depends on the term \Theta(\sqrtM_\texttest/M_\texttrain) , where M_\texttrain and M_\texttest are the number of adversarially perturbed in-context samples during training and testing. Empirically, we conduct AT on popular open-source LLMs and evaluate their robustness against jailbreak attacks of different adversarial suffix lengths. Results confirm a positive correlation between the attack success rate and the ratio of the square root of the adversarial suffix during jailbreaking to the length during AT. Our findings show that it is practical to defend “long-length” jailbreak attacks via efficient “short-length” AT. The code is available at this https URL.

[LG-18] MRAMG-Bench: A BeyondText Benchmark for Multimodal Retrieval-Augmented Multimodal Generation

链接: https://arxiv.org/abs/2502.04176
作者: Qinhan Yu,Zhiyou Xiao,Binghui Li,Zhengren Wang,Chong Chen,Wentao Zhang
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 11 pages

点击查看摘要

Abstract:Recent advancements in Retrieval-Augmented Generation (RAG) have shown remarkable performance in enhancing response accuracy and relevance by integrating external knowledge into generative models. However, existing RAG methods primarily focus on providing text-only answers, even in multimodal retrieval-augmented generation scenarios. In this work, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, which aims to generate answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite the importance of this task, there is a notable absence of a comprehensive benchmark to effectively evaluate MRAMG performance. To bridge this gap, we introduce the MRAMG-Bench, a carefully curated, human-annotated dataset comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, sourced from three categories: Web Data, Academic Papers, and Lifestyle. The dataset incorporates diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating multimodal generation tasks. To facilitate rigorous evaluation, our MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of popular generative models in the MRAMG task. Besides, we propose an efficient multimodal answer generation framework that leverages both LLMs and MLLMs to generate multimodal responses. Our datasets are available at: this https URL.

[LG-19] Making Sense of Touch: Unsupervised Shapelet Learning in Bag-of-words Sense

链接: https://arxiv.org/abs/2502.04167
作者: Zhicong Xian,Tabish Chaudhary,Jürgen Bock
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper introduces NN-STNE, a neural network using t-distributed stochastic neighbor embedding (t-SNE) as a hidden layer to reduce input dimensions by mapping long time-series data into shapelet membership probabilities. A Gaussian kernel-based mean square error preserves local data structure, while K-means initializes shapelet candidates due to the non-convex optimization challenge. Unlike existing methods, our approach uses t-SNE to address crowding in low-dimensional space and applies L1-norm regularization to optimize shapelet length. Evaluations on the UCR dataset and an electrical component manipulation task, like switching on, demonstrate improved clustering accuracy over state-of-the-art feature-learning methods in robotics.

[LG-20] Efficient Distributed Optimization under Heavy-Tailed Noise

链接: https://arxiv.org/abs/2502.04164
作者: Su Hyeong Lee,Manzil Zaheer,Tian Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributed optimization has become the default training paradigm in modern machine learning due to the growing scale of models and datasets. To mitigate communication overhead, local updates are often applied before global aggregation, resulting in a nested optimization approach with inner and outer steps. However, heavy-tailed stochastic gradient noise remains a significant challenge, particularly in attention-based models, hindering effective training. In this work, we propose TailOPT, an efficient framework designed to address heavy-tailed noise by leveraging adaptive optimization or clipping techniques. We establish convergence guarantees for the TailOPT framework under heavy-tailed noise with potentially unbounded gradient variance and local updates. Among its variants, we highlight a memory and communication efficient instantiation which we call Bi^2Clip , which performs coordinate-wise clipping at both the inner and outer optimizers, achieving adaptive-like performance (e.g., Adam) without the cost of maintaining or transmitting additional gradient statistics. Empirically, TailOPT, including Bi^2Clip , demonstrates superior performance on several language tasks and models, outperforming state-of-the-art methods.

[LG-21] A data-driven two-microphone method for in-situ sound absorption measurements

链接: https://arxiv.org/abs/2502.04143
作者: Leon Emmerich,Patrik Aste,Eric Brandão,Mélanie Nolan,Jacques Cuenca,U. Peter Svensson,Marcus Maeder,Steffen Marburg,Elias Zea
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 41 pages, 8 figures

点击查看摘要

Abstract:This work presents a data-driven approach to estimating the sound absorption coefficient of an infinite porous slab using a neural network and a two-microphone measurement on a finite porous sample. A 1D-convolutional network predicts the sound absorption coefficient from the complex-valued transfer function between the sound pressure measured at the two microphone positions. The network is trained and validated with numerical data generated by a boundary element model using the Delany-Bazley-Miki model, demonstrating accurate predictions for various numerical samples. The method is experimentally validated with baffled rectangular samples of a fibrous material, where sample size and source height are varied. The results show that the neural network offers the possibility to reliably predict the in-situ sound absorption of a porous material using the traditional two-microphone method as if the sample were infinite. The normal-incidence sound absorption coefficient obtained by the network compares well with that obtained theoretically and in an impedance tube. The proposed method has promising perspectives for estimating the sound absorption coefficient of acoustic materials after installation and in realistic operational conditions.

[LG-22] Behavioral Entropy-Guided Dataset Generation for Offline Reinforcement Learning ICLR2025

链接: https://arxiv.org/abs/2502.04141
作者: Wesley A. Suttle,Aamodh Suresh,Carlos Nieto-Granda
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Entropy-based objectives are widely used to perform state space exploration in reinforcement learning (RL) and dataset generation for offline RL. Behavioral entropy (BE), a rigorous generalization of classical entropies that incorporates cognitive and perceptual biases of agents, was recently proposed for discrete settings and shown to be a promising metric for robotic exploration problems. In this work, we propose using BE as a principled exploration objective for systematically generating datasets that provide diverse state space coverage in complex, continuous, potentially high-dimensional domains. To achieve this, we extend the notion of BE to continuous settings, derive tractable k -nearest neighbor estimators, provide theoretical guarantees for these estimators, and develop practical reward functions that can be used with standard RL methods to learn BE-maximizing policies. Using standard MuJoCo environments, we experimentally compare the performance of offline RL algorithms for a variety of downstream tasks on datasets generated using BE, Rényi, and Shannon entropy-maximizing policies, as well as the SMM and RND algorithms. We find that offline RL algorithms trained on datasets collected using BE outperform those trained on datasets collected using Shannon entropy, SMM, and RND on all tasks considered, and on 80% of the tasks compared to datasets collected using Rényi entropy.

[LG-23] ransfer Learning for Covert Speech Classification Using EEG Hilbert Envelope and Temporal Fine Structure ICASSP2025

链接: https://arxiv.org/abs/2502.04132
作者: Saravanakumar Duraisamy,Mateusz Dubiel,Maurice Rekrut,Luis A. Leiva
类目: Machine Learning (cs.LG)
*备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Brain-Computer Interfaces (BCIs) can decode imagined speech from neural activity. However, these systems typically require extensive training sessions where participants imaginedly repeat words, leading to mental fatigue and difficulties identifying the onset of words, especially when imagining sequences of words. This paper addresses these challenges by transferring a classifier trained in overt speech data to covert speech classification. We used electroencephalogram (EEG) features derived from the Hilbert envelope and temporal fine structure, and used them to train a bidirectional long-short-term memory (BiLSTM) model for classification. Our method reduces the burden of extensive training and achieves state-of-the-art classification accuracy: 86.44% for overt speech and 79.82% for covert speech using the overt speech classifier.

[LG-24] On the importance of structural identifiability for machine learning with partially observed dynamical systems

链接: https://arxiv.org/abs/2502.04131
作者: Janis Norden,Elisa Oostwal,Michael Chappell,Peter Tino,Kerstin Bunte
类目: Machine Learning (cs.LG)
*备注: 15 pages, 18 figures

点击查看摘要

Abstract:The successful application of modern machine learning for time series classification is often hampered by limitations in quality and quantity of available training data. To overcome these limitations, available domain expert knowledge in the form of parametrised mechanistic dynamical models can be used whenever it is available and time series observations may be represented as an element from a given class of parametrised dynamical models. This makes the learning process interpretable and allows the modeller to deal with sparsely and irregularly sampled data in a natural way. However, the internal processes of a dynamical model are often only partially observed. This can lead to ambiguity regarding which particular model realization best explains a given time series observation. This problem is well-known in the literature, and a dynamical model with this issue is referred to as structurally unidentifiable. Training a classifier that incorporates knowledge about a structurally unidentifiable dynamical model can negatively influence classification performance. To address this issue, we employ structural identifiability analysis to explicitly relate parameter configurations that are associated with identical system outputs. Using the derived relations in classifier training, we demonstrate that this method significantly improves the classifier’s ability to generalize to unseen data on a number of example models from the biomedical domain. This effect is especially pronounced when the number of training instances is limited. Our results demonstrate the importance of accounting for structural identifiability, a topic that has received relatively little attention from the machine learning community.

[LG-25] Optimizing Perturbations for Improved Training of Machine Learning Models

链接: https://arxiv.org/abs/2502.04121
作者: Sagi Meir,Tommer D. Keidar,Shlomi Reuveni,Barak Hirshberg
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Machine learning models have become indispensable tools in applications across the physical sciences. Their training is often time-consuming, vastly exceeding the inference timescales. Several protocols have been developed to perturb the learning process and improve the training, such as shrink and perturb, warm restarts, and stochastic resetting. For classifiers, these perturbations have been shown to result in enhanced speedups or improved generalization. However, the design of such perturbations is usually done \textitad hoc by intuition and trial and error. To rationally optimize training protocols, we frame them as first-passage processes and consider their response to perturbations. We show that if the unperturbed learning process reaches a quasi-steady state, the response at a single perturbation frequency can predict the behavior at a wide range of frequencies. We demonstrate that this is the case when training a CIFAR-10 classifier using the ResNet-18 model and use this approach to identify an optimal perturbation and frequency. Our work allows optimization of training protocols of machine learning models using a statistical mechanical approach.

[LG-26] Smart IoT Security: Lightweight Machine Learning Techniques for Multi-Class Attack Detection in IoT Networks

链接: https://arxiv.org/abs/2502.04057
作者: Shahran Rahman Alve,Muhammad Zawad Mahmud,Samiha Islam,Md. Asaduzzaman Chowdhury,Jahirul Islam
类目: Machine Learning (cs.LG)
*备注: Accepted in an international conference

点击查看摘要

Abstract:In the growing terrain of the Internet of Things (IoT), it is vital that networks are secure to protect against a range of cyber threats. Based on the strong machine learning framework, this study proposes novel lightweight ensemble approaches for improving multi-class attack detection of IoT devices. Using the large CICIoT 2023 dataset with 34 attack types distributed amongst 10 attack categories, we systematically evaluated the performance of a wide variety of modern machine learning methods with the aim of establishing the best-performing algorithmic choice to secure IoT applications. In particular, we explore approaches based on ML classifiers to tackle the biocharges characterized by the challenging and heterogeneous nature of attack vectors in IoT environments. The method that performed best was the Decision Tree, with an accuracy of 99.56% and an F1 score of 99.62%, showing that this model is capable of accurately and reliably detecting this http URL Random Forest model was the next best-performing model with 98.22% and an F1 score of 98.24%, suggesting that ML methods are quite effective in a situation of high-dimensional data. Our results highlight the potential for using ML classifiers in bolstering security for IoT devices and also serve as motivations for future investigations targeting scalable, keystroke-based attack detection systems. We believe that our method provides a new path to develop complex machine learning algorithms for low-resource IoT devices, balancing both accuracy and time efficiency needs. In summary, these contributions enrich the state of the art of the IoT security literature, laying down solid ground and guidelines for the deployment of smart, adaptive security in IoT settings.

[LG-27] Q-DiT: Efficient Time-Aware Quantization for Diffusion Transformers

链接: https://arxiv.org/abs/2502.04056
作者: Younghye Hwang,Hyojin Lee,Joonhyuk Kang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 8 pages

点击查看摘要

Abstract:Diffusion transformers (DiTs) combine transformer architectures with diffusion models. However, their computational complexity imposes significant limitations on real-time applications and sustainability of AI systems. In this study, we aim to enhance the computational efficiency through model quantization, which represents the weights and activation values with lower precision. Multi-region quantization (MRQ) is introduced to address the asymmetric distribution of network values in DiT blocks by allocating two scaling parameters to sub-regions. Additionally, time-grouping quantization (TGQ) is proposed to reduce quantization error caused by temporal variation in activations. The experimental results show that the proposed algorithm achieves performance comparable to the original full-precision model with only a 0.29 increase in FID at W8A8. Furthermore, it outperforms other baselines at W6A6, thereby confirming its suitability for low-bit quantization. These results highlight the potential of our method to enable efficient real-time generative models.

[LG-28] Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation

链接: https://arxiv.org/abs/2502.04055
作者: Yunbo Long,Liming Xu,Alexandra Brintrup
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current evaluations of synthetic tabular data mainly focus on how well joint distributions are modeled, often overlooking the assessment of their effectiveness in preserving realistic event sequences and coherent entity relationships across this http URL paper proposes three evaluation metrics designed to assess the preservation of logical relationships among columns in synthetic tabular data. We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial this http URL results reveal that existing methods often fail to rigorously maintain logical consistency (e.g., hierarchical relationships in geography or organization) and dependencies (e.g., temporal sequences or mathematical relationships), which are crucial for preserving the fine-grained realism of real-world tabular data. Building on these insights, this study also discusses possible pathways to better capture logical relationships while modeling the distribution of synthetic tabular data.

[LG-29] Precision Agriculture Revolution: Integrating Digital Twins and Advanced Crop Recommendation for Optimal Yield

链接: https://arxiv.org/abs/2502.04054
作者: Sayan Banerjee,Aniruddha Mukherjee,Suket Kamboj
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the help of a digital twin structure, Agriculture 4.0 technologies like weather APIs (Application programming interface), GPS (Global Positioning System) modules, and NPK (Nitrogen, Phosphorus and Potassium) soil sensors and machine learning recommendation models, we seek to revolutionize agricultural production through this concept. In addition to providing precise crop growth forecasts, the combination of real-time data on soil composition, meteorological dynamics, and geographic coordinates aims to support crop recommendation models and simulate predictive scenarios for improved water and pesticide management.

[LG-30] Decision Trees That Remember: Gradient-Based Learning of Recurrent Decision Trees with Memory

链接: https://arxiv.org/abs/2502.04052
作者: Sascha Marton,Moritz Schneider
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural architectures such as Recurrent Neural Networks (RNNs), Transformers, and State-Space Models have shown great success in handling sequential data by learning temporal dependencies. Decision Trees (DTs), on the other hand, remain a widely used class of models for structured tabular data but are typically not designed to capture sequential patterns directly. Instead, DT-based approaches for time-series data often rely on feature engineering, such as manually incorporating lag features, which can be suboptimal for capturing complex temporal dependencies. To address this limitation, we introduce ReMeDe Trees, a novel recurrent DT architecture that integrates an internal memory mechanism, similar to RNNs, to learn long-term dependencies in sequential data. Our model learns hard, axis-aligned decision rules for both output generation and state updates, optimizing them efficiently via gradient descent. We provide a proof-of-concept study on synthetic benchmarks to demonstrate the effectiveness of our approach.

[LG-31] Comparing privacy notions for protection against reconstruction attacks in machine learning

链接: https://arxiv.org/abs/2502.04045
作者: Sayan Biswas,Mark Dras,Pedro Faustini,Natasha Fernandes,Annabelle McIver,Catuscia Palamidessi,Parastoo Sadeghi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Within the machine learning community, reconstruction attacks are a principal concern and have been identified even in federated learning (FL), which was designed with privacy preservation in mind. In response to these threats, the privacy community recommends the use of differential privacy (DP) in the stochastic gradient descent algorithm, termed DP-SGD. However, the proliferation of variants of DP in recent years\textemdash such as metric privacy\textemdash has made it challenging to conduct a fair comparison between different mechanisms due to the different meanings of the privacy parameters \epsilon and \delta across different variants. Thus, interpreting the practical implications of \epsilon and \delta in the FL context and amongst variants of DP remains ambiguous. In this paper, we lay a foundational framework for comparing mechanisms with differing notions of privacy guarantees, namely (\epsilon,\delta) -DP and metric privacy. We provide two foundational means of comparison: firstly, via the well-established (\epsilon,\delta) -DP guarantees, made possible through the Rényi differential privacy framework; and secondly, via Bayes’ capacity, which we identify as an appropriate measure for reconstruction threats.

[LG-32] Deep Meta Coordination Graphs for Multi-agent Reinforcement Learning

链接: https://arxiv.org/abs/2502.04028
作者: Nikunj Gupta,James Zachary Hare,Rajgopal Kannan,Viktor Prasanna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents deep meta coordination graphs (DMCG) for learning cooperative policies in multi-agent reinforcement learning (MARL). Coordination graph formulations encode local interactions and accordingly factorize the joint value function of all agents to improve efficiency in MARL. However, existing approaches rely solely on pairwise relations between agents, which potentially oversimplifies complex multi-agent interactions. DMCG goes beyond these simple direct interactions by also capturing useful higher-order and indirect relationships among agents. It generates novel graph structures accommodating multiple types of interactions and arbitrary lengths of multi-hop connections in coordination graphs to model such interactions. It then employs a graph convolutional network module to learn powerful representations in an end-to-end manner. We demonstrate its effectiveness in multiple coordination problems in MARL where other state-of-the-art methods can suffer from sample inefficiency or fail entirely. All codes can be found here: this https URL.

[LG-33] Variational Quantum Optimization with Continuous Bandits

链接: https://arxiv.org/abs/2502.04021
作者: Marc Wanner,Johan Jonasson,Emil Carlsson,Devdatt Dubhashi
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 8 pages, 3 Figures + 7-page appendix

点击查看摘要

Abstract:We introduce a novel approach to variational Quantum algorithms (VQA) via continuous bandits. VQA are a class of hybrid Quantum-classical algorithms where the parameters of Quantum circuits are optimized by classical algorithms. Previous work has used zero and first order gradient based methods, however such algorithms suffer from the barren plateau (BP) problem where gradients and loss differences are exponentially small. We introduce an approach using bandits methods which combine global exploration with local exploitation. We show how VQA can be formulated as a best arm identification problem in a continuous space of arms with Lipschitz smoothness. While regret minimization has been addressed in this setting, existing methods for pure exploration only cover discrete spaces. We give the first results for pure exploration in a continuous setting and derive a fixed-confidence, information-theoretic, instance specific lower bound. Under certain assumptions on the expected payoff, we derive a simple algorithm, which is near-optimal with respect to our lower bound. Finally, we apply our continuous bandit algorithm to two VQA schemes: a PQC and a QAOA quantum circuit, showing that we significantly outperform the previously known state of the art methods (which used gradient based methods).

[LG-34] PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature Data

链接: https://arxiv.org/abs/2502.04018
作者: Keon Vin Park,Jisu Kim,Jaemin Seo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces PINT (Physics-Informed Neural Time Series Models), a framework that integrates physical constraints into neural time series models to improve their ability to capture complex dynamics. We apply PINT to the ERA5 WeatherBench dataset, focusing on long-term forecasting of 2m-temperature data. PINT incorporates the Simple Harmonic Oscillator Equation as a physics-informed prior, embedding its periodic dynamics into RNN, LSTM, and GRU architectures. This equation’s analytical solutions (sine and cosine functions) facilitate rigorous evaluation of the benefits of incorporating physics-informed constraints. By benchmarking against a linear regression baseline derived from its exact solutions, we quantify the impact of embedding physical principles in data-driven models. Unlike traditional time series models that rely on future observations, PINT is designed for practical forecasting. Using only the first 90 days of observed data, it iteratively predicts the next two years, addressing challenges posed by limited real-time updates. Experiments on the WeatherBench dataset demonstrate PINT’s ability to generalize, capture periodic trends, and align with physical principles. This study highlights the potential of physics-informed neural models in bridging machine learning and interpretable climate applications. Our models and datasets are publicly available on GitHub: this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.04018 [cs.LG] (or arXiv:2502.04018v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.04018 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

链接: https://arxiv.org/abs/2502.04004
作者: Tal Lancewicki,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first \textitoptimal regret bound of \tilde \Theta(H^2\sqrtSAK) , where K is the number of episodes, H is the episode horizon, S is the number of states, and A is the number of actions. In the unknown dynamics case we establish regret bound of \tilde O(H^3 S \sqrtAK) , significantly improving the best known result by a factor of H^2 S^5 A^2 .

[LG-36] ght Bounds on Jensens Gap: Novel Approach with Applications in Generative Modeling

链接: https://arxiv.org/abs/2502.03988
作者: Marcin Mazur,Piotr Kościelniak,Łukasz Struski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Among various mathematical tools of particular interest are those that provide a common basis for researchers in different scientific fields. One of them is Jensen’s inequality, which states that the expectation of a convex function is greater than or equal to the function evaluated at the expectation. The resulting difference, known as Jensen’s gap, became the subject of investigation by both the statistical and machine learning communities. Among many related topics, finding lower and upper bounds on Jensen’s gap (under different assumptions on the underlying function and distribution) has recently become a problem of particular interest. In our paper, we take another step in this direction by providing a novel general and mathematically rigorous technique, motivated by the recent results of Struski et al. (2023). In addition, by studying in detail the case of the logarithmic function and the log-normal distribution, we explore a method for tightly estimating the log-likelihood of generative models trained on real-world datasets. Furthermore, we present both analytical and experimental arguments in support of the superiority of our approach in comparison to existing state-of-the-art solutions, contingent upon fulfillment of the criteria set forth by theoretical studies and corresponding experiments on synthetic data.

[LG-37] mporal Distribution Shift in Real-World Pharmaceutical Data: Implications for Uncertainty Quantification in QSAR Models

链接: https://arxiv.org/abs/2502.03982
作者: Hannah Rosa Friesacher,Emma Svensson,Susanne Winiwarter,Lewis Mervin,Adam Arany,Ola Engkvist
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The estimation of uncertainties associated with predictions from quantitative structure-activity relationship (QSAR) models can accelerate the drug discovery process by identifying promising experiments and allowing an efficient allocation of resources. Several computational tools exist that estimate the predictive uncertainty in machine learning models. However, deviations from the i.i.d. setting have been shown to impair the performance of these uncertainty quantification methods. We use a real-world pharmaceutical dataset to address the pressing need for a comprehensive, large-scale evaluation of uncertainty estimation methods in the context of realistic distribution shifts over time. We investigate the performance of several uncertainty estimation methods, including ensemble-based and Bayesian approaches. Furthermore, we use this real-world setting to systematically assess the distribution shifts in label and descriptor space and their impact on the capability of the uncertainty estimation methods. Our study reveals significant shifts over time in both label and descriptor space and a clear connection between the magnitude of the shift and the nature of the assay. Moreover, we show that pronounced distribution shifts impair the performance of popular uncertainty estimation methods used in QSAR models. This work highlights the challenges of identifying uncertainty quantification methods that remain reliable under distribution shifts introduced by real-world data.

[LG-38] Innovative Framework for Early Estimation of Mental Disorder Scores to Enable Timely Interventions

链接: https://arxiv.org/abs/2502.03965
作者: Himanshi Singh,Sadhana Tiwari,Sonali Agarwal,Ritesh Chandra,Sanjay Kumar Sonbhadra,Vrijendra Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Individual’s general well-being is greatly impacted by mental health conditions including depression and Post-Traumatic Stress Disorder (PTSD), underscoring the importance of early detection and precise diagnosis in order to facilitate prompt clinical intervention. An advanced multimodal deep learning system for the automated classification of PTSD and depression is presented in this paper. Utilizing textual and audio data from clinical interview datasets, the method combines features taken from both modalities by combining the architectures of LSTM (Long Short Term Memory) and BiLSTM (Bidirectional Long Short-Term Memory).Although text features focus on speech’s semantic and grammatical components; audio features capture vocal traits including rhythm, tone, and pitch. This combination of modalities enhances the model’s capacity to identify minute patterns connected to mental health conditions. Using test datasets, the proposed method achieves classification accuracies of 92% for depression and 93% for PTSD, outperforming traditional unimodal approaches and demonstrating its accuracy and robustness.

[LG-39] AL-PINN: Active Learning-Driven Physics-Informed Neural Networks for Efficient Sample Selection in Solving Partial Differential Equations

链接: https://arxiv.org/abs/2502.03963
作者: Keon Vin Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a promising approach for solving Partial Differential Equations (PDEs) by incorporating physical constraints into deep learning models. However, standard PINNs often require a large number of training samples to achieve high accuracy, leading to increased computational costs. To address this issue, we propose Active Learning-Driven PINNs (AL-PINN), which integrates Uncertainty Quantification (UQ) and Active Learning (AL) strategies to optimize sample selection dynamically. AL-PINN utilizes Monte Carlo Dropout to estimate epistemic uncertainty in the model predictions, enabling the adaptive selection of high-uncertainty regions for additional training. This approach significantly enhances learning efficiency by focusing computational resources on the most informative data points. We evaluate AL-PINN on benchmark PDE problems with known analytical solutions and real-world WeatherBench climate data. Our results demonstrate that AL-PINN achieves comparable or superior accuracy compared to traditional PINNs while reducing the number of required training samples. The proposed framework is particularly beneficial for scientific and engineering applications where data collection is expensive or limited, such as climate modeling, medical simulations, and material science. Our findings highlight the potential of active learning in accelerating PINN-based PDE solvers while maintaining high accuracy and computational efficiency. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.03963 [cs.LG] (or arXiv:2502.03963v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.03963 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Non-convex composite federated learning with heterogeneous data

链接: https://arxiv.org/abs/2502.03958
作者: Jiaojiao Zhang,Jiang Hu,Mikael Johansson
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:We propose an innovative algorithm for non-convex composite federated learning that decouples the proximal operator evaluation and the communication between server and clients. Moreover, each client uses local updates to communicate less frequently with the server, sends only a single d-dimensional vector per communication round, and overcomes issues with client drift. In the analysis, challenges arise from the use of decoupling strategies and local updates in the algorithm, as well as from the non-convex and non-smooth nature of the problem. We establish sublinear and linear convergence to a bounded residual error under general non-convexity and the proximal Polyak-Lojasiewicz inequality, respectively. In the numerical experiments, we demonstrate the superiority of our algorithm over state-of-the-art methods on both synthetic and real datasets.

[LG-41] Fairness Aware Reinforcement Learning via Proximal Policy Optimization

链接: https://arxiv.org/abs/2502.03953
作者: Gabriele La Malfa,Jie M. Zhang,Michael Luck,Elizabeth Black
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fairness in multi-agent systems (MAS) focuses on equitable reward distribution among agents in scenarios involving sensitive attributes such as race, gender, or socioeconomic status. This paper introduces fairness in Proximal Policy Optimization (PPO) with a penalty term derived from demographic parity, counterfactual fairness, and conditional statistical parity. The proposed method balances reward maximisation with fairness by integrating two penalty components: a retrospective component that minimises disparities in past outcomes and a prospective component that ensures fairness in future decision-making. We evaluate our approach in the Allelopathic Harvest game, a cooperative and competitive MAS focused on resource collection, where some agents possess a sensitive attribute. Experiments demonstrate that fair-PPO achieves fairer policies across all fairness metrics than classic PPO. Fairness comes at the cost of reduced rewards, namely the Price of Fairness, although agents with and without the sensitive attribute renounce comparable amounts of rewards. Additionally, the retrospective and prospective penalties effectively change the agents’ behaviour and improve fairness. These findings underscore the potential of fair-PPO to address fairness challenges in MAS.

[LG-42] Bridging the inference gap in Mutimodal Variational Autoencoders

链接: https://arxiv.org/abs/2502.03952
作者: Agathe Senellart,Stéphanie Allassonnière
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:From medical diagnosis to autonomous vehicles, critical applications rely on the integration of multiple heterogeneous data modalities. Multimodal Variational Autoencoders offer versatile and scalable methods for generating unobserved modalities from observed ones. Recent models using mixturesof-experts aggregation suffer from theoretically grounded limitations that restrict their generation quality on complex datasets. In this article, we propose a novel interpretable model able to learn both joint and conditional distributions without introducing mixture aggregation. Our model follows a multistage training process: first modeling the joint distribution with variational inference and then modeling the conditional distributions with Normalizing Flows to better approximate true posteriors. Importantly, we also propose to extract and leverage the information shared between modalities to improve the conditional coherence of generated samples. Our method achieves state-of-the-art results on several benchmark datasets.

[LG-43] CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

链接: https://arxiv.org/abs/2502.03946
作者: Yousef Koka,David Selby,Gerrit Großmann,Sebastian Vollmer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data preprocessing is a critical yet frequently neglected aspect of machine learning, often paid little attention despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like survival or time-to-event models. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents ‘CleanSurvival’, a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The package is available on GitHub: this https URL Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing results in superior predictive performance to standard approaches, finding such a model up to 10 times faster than undirected random grid search. Furthermore, a simulation study demonstrates the effectiveness in different types and levels of missingness and noise in the data. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.03946 [cs.LG] (or arXiv:2502.03946v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.03946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-44] Multimodal Data-Driven Classification of Mental Disorders: A Comprehensive Approach to Diagnosing Depression Anxiety and Schizophrenia

链接: https://arxiv.org/abs/2502.03943
作者: Himanshi Singh,Sadhana Tiwari,Sonali Agarwal,Ritesh Chandra,Sanjay Kumar Sonbhadra,Vrijendra Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates the potential of multimodal data integration, which combines electroencephalogram (EEG) data with sociodemographic characteristics like age, sex, education, and intelligence quotient (IQ), to diagnose mental diseases like schizophrenia, depression, and anxiety. Using Apache Spark and convolutional neural networks (CNNs), a data-driven classification pipeline has been developed for big data environment to effectively analyze massive datasets. In order to evaluate brain activity and connection patterns associated with mental disorders, EEG parameters such as power spectral density (PSD) and coherence are examined. The importance of coherence features is highlighted by comparative analysis, which shows significant improvement in classification accuracy and robustness. This study emphasizes the significance of holistic approaches for efficient diagnostic tools by integrating a variety of data sources. The findings open the door for creative, data-driven approaches to treating psychiatric diseases by demonstrating the potential of utilizing big data, sophisticated deep learning methods, and multimodal datasets to enhance the precision, usability, and comprehension of mental health diagnostics.

[LG-45] Unravelling Causal Genetic Biomarkers of Alzheimers Disease via Neuron to Gene-token Backtracking in Neural Architecture: A Groundbreaking Reverse-Gene-Finder Approach

链接: https://arxiv.org/abs/2502.03938
作者: Victor OK Li,Yang Han,Jacqueline CK Lam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) affects over 55 million people globally, yet the key genetic contributors remain poorly understood. Leveraging recent advancements in genomic foundation models, we present the innovative Reverse-Gene-Finder technology, a ground-breaking neuron-to-gene-token backtracking approach in a neural network architecture to elucidate the novel causal genetic biomarkers driving AD onset. Reverse-Gene-Finder comprises three key innovations. Firstly, we exploit the observation that genes with the highest probability of causing AD, defined as the most causal genes (MCGs), must have the highest probability of activating those neurons with the highest probability of causing AD, defined as the most causal neurons (MCNs). Secondly, we utilize a gene token representation at the input layer to allow each gene (known or novel to AD) to be represented as a discrete and unique entity in the input space. Lastly, in contrast to the existing neural network architectures, which track neuron activations from the input layer to the output layer in a feed-forward manner, we develop an innovative backtracking method to track backwards from the MCNs to the input layer, identifying the Most Causal Tokens (MCTs) and the corresponding MCGs. Reverse-Gene-Finder is highly interpretable, generalizable, and adaptable, providing a promising avenue for application in other disease scenarios.

[LG-46] Quantifying Correlations of Machine Learning Models

链接: https://arxiv.org/abs/2502.03937
作者: Yuanyuan Li,Neeraj Sarna,Yang Lin
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Machine Learning models are being extensively used in safety critical applications where errors from these models could cause harm to the user. Such risks are amplified when multiple machine learning models, which are deployed concurrently, interact and make errors simultaneously. This paper explores three scenarios where error correlations between multiple models arise, resulting in such aggregated risks. Using real-world data, we simulate these scenarios and quantify the correlations in errors of different models. Our findings indicate that aggregated risks are substantial, particularly when models share similar algorithms, training datasets, or foundational models. Overall, we observe that correlations across models are pervasive and likely to intensify with increased reliance on foundational models and widely used public datasets, highlighting the need for effective mitigation strategies to address these challenges.

[LG-47] HEP-JEPA: A foundation model for collider physics using joint embedding predictive architecture

链接: https://arxiv.org/abs/2502.03933
作者: Jai Bardhan,Radhikesh Agrawal,Abhiram Tilak,Cyrin Neeraj,Subhadip Mitra
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
*备注: 11 pages, 3 figures, 8 tables. Project website: this https URL

点击查看摘要

Abstract:We present a transformer architecture-based foundation model for tasks at high-energy particle colliders such as the Large Hadron Collider. We train the model to classify jets using a self-supervised strategy inspired by the Joint Embedding Predictive Architecture. We use the JetClass dataset containing 100M jets of various known particles to pre-train the model with a data-centric approach – the model uses a fraction of the jet constituents as the context to predict the embeddings of the unseen target constituents. Our pre-trained model fares well with other datasets for standard classification benchmark tasks. We test our model on two additional downstream tasks: top tagging and differentiating light-quark jets from gluon jets. We also evaluate our model with task-specific metrics and baselines and compare it with state-of-the-art models in high-energy physics. Project site: this https URL

[LG-48] chnical Report: Generating the WEB-IDS23 Dataset

链接: https://arxiv.org/abs/2502.03909
作者: Eric Lanfer,Dominik Brockmann,Nils Aschenbruck
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly-based Network Intrusion Detection Systems (NIDS) require correctly labelled, representative and diverse datasets for an accurate evaluation and development. However, several widely used datasets do not include labels which are fine-grained enough and, together with small sample sizes, can lead to overfitting issues that also remain undetected when using test data. Additionally, the cybersecurity sector is evolving fast, and new attack mechanisms require the continuous creation of up-to-date datasets. To address these limitations, we developed a modular traffic generator that can simulate a wide variety of benign and malicious traffic. It incorporates multiple protocols, variability through randomization techniques and can produce attacks along corresponding benign traffic, as it occurs in real-world scenarios. Using the traffic generator, we create a dataset capturing over 12 million samples with 82 flow-level features and 21 fine-grained labels. Additionally, we include several web attack types which are often underrepresented in other datasets.

[LG-49] InfinitePOD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

链接: https://arxiv.org/abs/2502.03885
作者: Chenchen Shou,Guyue Liu,Hao Nie,Huaiyu Meng,Yu Zhou,Yinmin Jiang,Wenqing Lv,Yelong Xu,Yuanwei Lu,Zhang Chen,Yanbo Yu,Yichen Shen,Yibo Zhu,Daxin Jiang
类目: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs such as TPUv4 takes a middle-ground approach by leveraging Optical Circuit Switches, but the fault explosion radius remains large at the cube level (e.g., 64 TPUs). We propose InfinitePOD, a novel transceiver-centric HBD architecture that unifies connectivity and dynamic switching at the transceiver level using Optical Circuit Switching (OCS). By embedding OCS within each transceiver, InfinitePOD achieves reconfigurable point-to-multipoint connectivity, allowing the topology to adapt into variable-size rings. This design provides: i) datacenter-wide scalability without cost explosion; ii) fault resilience by isolating failures to a single node, and iii) full bandwidth utilization for fault-free GPUs. Key innovations include a Silicon Photonic (SiPh) based low-cost OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology co-designed with intra-/inter-node communication, and an HBD-DCN orchestration algorithm maximizing GPU utilization while minimizing cross-ToR datacenter network traffic. The evaluation demonstrates that InfinitePOD achieves 31% of the cost of NVL-72, near-zero GPU waste ratio (over one order of magnitude lower than NVL-72 and TPUv4), near-zero cross-ToR traffic when node fault ratios under 7%, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs per Node). Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2502.03885 [cs.NI] (or arXiv:2502.03885v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2502.03885 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Position: Untrained Machine Learning for Anomaly Detection

链接: https://arxiv.org/abs/2502.03876
作者: Juan Du,Dongheng Chen,Hao Yan
类目: Machine Learning (cs.LG)
*备注: 6 pages,0 figure

点击查看摘要

Abstract:Anomaly detection based on 3D point cloud data is an important research problem and receives more and more attention recently. Untrained anomaly detection based on only one sample is an emerging research problem motivated by real manufacturing industries such as personalized manufacturing that only one sample can be collected without any additional labels. How to accurately identify anomalies based on one 3D point cloud sample is a critical challenge in both industrial applications and the field of machine learning. This paper aims to provide a formal definition of untrained anomaly detection problem based on 3D point cloud data, discuss the differences between untrained anomaly detection and current unsupervised anomaly detection methods. Unlike unsupervised learning, untrained methods do not rely on any data, including unlabeled data. Instead, they leverage prior knowledge about the manufacturing surfaces and anomalies. Examples are used to illustrate these prior knowledge and untrained machine learning model. Afterwards, literature review on untrained anomaly detection based on 3D point cloud data is also provided, and the potential of untrained deep neural networks for anomaly detection is also discussed as outlooks.

[LG-51] Mirror Descent Actor Critic via Bounded Advantage Learning

链接: https://arxiv.org/abs/2502.03854
作者: Ryo Iwaki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical guarantees, the performance of a MDVI-based method does not surpass an entropy-only-regularized method in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the actor’s log-density terms in the critic’s loss function, compared to a non-bounded naive instantiation. Further, we relate MDAC to Advantage Learning by recalling that the actor’s log-probability is equal to the regularized advantage function in tabular cases, and theoretically discuss when and why bounding the advantage terms is validated and beneficial. We also empirically explore a good choice for the bounding function, and show that MDAC perfoms better than strong non-regularized and entropy-only-regularized methods with an appropriate choice of the bounding function.

[LG-52] Knowing When to Stop Matters: A Unified Algorithm for Online Conversion under Horizon Uncertainty

链接: https://arxiv.org/abs/2502.03817
作者: Yanzhao Wang,Hasti Nourmohammadi Sigaroudi,Bo Sun,Omid Ardakanian,Xiaoqi Tan
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 36 pages, 6 figures

点击查看摘要

Abstract:This paper investigates the online conversion problem, which involves sequentially trading a divisible resource (e.g., energy) under dynamically changing prices to maximize profit. A key challenge in online conversion is managing decisions under horizon uncertainty, where the duration of trading is either known, revealed partway, or entirely unknown. We propose a unified algorithm that achieves optimal competitive guarantees across these horizon models, accounting for practical constraints such as box constraints, which limit the maximum allowable trade per step. Additionally, we extend the algorithm to a learning-augmented version, leveraging horizon predictions to adaptively balance performance: achieving near-optimal results when predictions are accurate while maintaining strong guarantees when predictions are unreliable. These results advance the understanding of online conversion under various degrees of horizon uncertainty and provide more practical strategies to address real world constraints.

[LG-53] Should Code Models Learn Pedagogically? A Preliminary Evaluation of Curriculum Learning for Real-World Software Engineering Tasks

链接: https://arxiv.org/abs/2502.03806
作者: Kyi Shin Khant,Hong Yi Lin,Patanamon Thongtanunam
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted by the 22nd International Conference on Mining Software Repositories (MSR 25)

点击查看摘要

Abstract:Learning-based techniques, especially advanced pre-trained models for code have demonstrated capabilities in code understanding and generation, solving diverse software engineering (SE) tasks. Despite the promising results, current training approaches may not fully optimize model performance, as they typically involve learning from randomly shuffled training data. Recent work shows that Curriculum Learning (CL) can improve performance on code-related tasks through incremental learning based on the difficulty of synthetic code. Yet, the effectiveness of CL with conventional difficulty measures in SE tasks remains largely unexplored. In this study, we explore two conventional code metrics: code length and cyclomatic complexity to determine the difficulty levels. We investigate how the pre-trained code model (CodeT5) learns under CL, through the tasks of code clone detection and code summarization. Our empirical study on the CodeXGLUE benchmark showed contrasting results to prior studies, where the model exhibited signs of catastrophic forgetting and shortcut learning. Surprisingly, model performance saturates after only the first quartile of training, potentially indicating a limit in the model’s representation capacity and/or the task’s inherent difficulty. Future work should further explore various CL strategies with different code models across a wider range of SE tasks for a more holistic understanding.

[LG-54] Graph Neural Network-Driven Hierarchical Mining for Complex Imbalanced Data

链接: https://arxiv.org/abs/2502.03803
作者: Yijiashun Qi,Quanchao Lu,Shiyu Dou,Xiaoxuan Sun,Muqing Li,Yankaiqi Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a hierarchical mining framework for high-dimensional imbalanced data, leveraging a depth graph model to address the inherent performance limitations of conventional approaches in handling complex, high-dimensional data distributions with imbalanced sample representations. By constructing a structured graph representation of the dataset and integrating graph neural network (GNN) embeddings, the proposed method effectively captures global interdependencies among samples. Furthermore, a hierarchical strategy is employed to enhance the characterization and extraction of minority class feature patterns, thereby facilitating precise and robust imbalanced data mining. Empirical evaluations across multiple experimental scenarios validate the efficacy of the proposed approach, demonstrating substantial improvements over traditional methods in key performance metrics, including pattern discovery count, average support, and minority class coverage. Notably, the method exhibits superior capabilities in minority-class feature extraction and pattern correlation analysis. These findings underscore the potential of depth graph models, in conjunction with hierarchical mining strategies, to significantly enhance the efficiency and accuracy of imbalanced data analysis. This research contributes a novel computational framework for high-dimensional complex data processing and lays the foundation for future extensions to dynamically evolving imbalanced data and multi-modal data applications, thereby expanding the applicability of advanced data mining methodologies to more intricate analytical domains.

[LG-55] MXMap: A Multivariate Cross Mapping Framework for Causal Discovery in Dynamical Systems

链接: https://arxiv.org/abs/2502.03802
作者: Elise Zhang,François Mirallès,Raphaël Rousseau-Rizzi,Arnaud Zinflou,Di Wu,Benoit Boulet
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Methodology (stat.ME)
*备注: Accepted by CLeaR 2025; Main manuscript 18 pages, appendix 24 pages, 30 tables

点击查看摘要

Abstract:Convergent Cross Mapping (CCM) is a powerful method for detecting causality in coupled nonlinear dynamical systems, providing a model-free approach to capture dynamic causal interactions. Partial Cross Mapping (PCM) was introduced as an extension of CCM to address indirect causality in three-variable systems by comparing cross-mapping quality between direct cause-effect mapping and indirect mapping through an intermediate conditioning variable. However, PCM remains limited to univariate delay embeddings in its cross-mapping processes. In this work, we extend PCM to the multivariate setting, introducing multiPCM, which leverages multivariate embeddings to more effectively distinguish indirect causal relationships. We further propose a multivariate cross-mapping framework (MXMap) for causal discovery in dynamical systems. This two-phase framework combines (1) pairwise CCM tests to establish an initial causal graph and (2) multiPCM to refine the graph by pruning indirect causal connections. Through experiments on simulated data and the ERA5 Reanalysis weather dataset, we demonstrate the effectiveness of MXMap. Additionally, MXMap is compared against several baseline methods, showing advantages in accuracy and causal graph refinement.

[LG-56] Network-Wide Traffic Flow Estimation Across Multiple Cities with Global Open Multi-Source Data: A Large-Scale Case Study in Europe and North America

链接: https://arxiv.org/abs/2502.03798
作者: Zijian Hu,Zhenjie Zheng,Monica Menendez,Wei Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Network-wide traffic flow, which captures dynamic traffic volume on each link of a general network, is fundamental to smart mobility applications. However, the observed traffic flow from sensors is usually limited across the entire network due to the associated high installation and maintenance costs. To address this issue, existing research uses various supplementary data sources to compensate for insufficient sensor coverage and estimate the unobserved traffic flow. Although these studies have shown promising results, the inconsistent availability and quality of supplementary data across cities make their methods typically face a trade-off challenge between accuracy and generality. In this research, we first time advocate using the Global Open Multi-Source (GOMS) data within an advanced deep learning framework to break the trade-off. The GOMS data primarily encompass geographical and demographic information, including road topology, building footprints, and population density, which can be consistently collected across cities. More importantly, these GOMS data are either causes or consequences of transportation activities, thereby creating opportunities for accurate network-wide flow estimation. Furthermore, we use map images to represent GOMS data, instead of traditional tabular formats, to capture richer and more comprehensive geographical and demographic information. To address multi-source data fusion, we develop an attention-based graph neural network that effectively extracts and synthesizes information from GOMS maps while simultaneously capturing spatiotemporal traffic dynamics from observed traffic data. A large-scale case study across 15 cities in Europe and North America was conducted. The results demonstrate stable and satisfactory estimation accuracy across these cities, which suggests that the trade-off challenge can be successfully addressed using our approach.

[LG-57] Distribution learning via neural differential equations: minimal energy regularization and approximation theory

链接: https://arxiv.org/abs/2502.03795
作者: Youssef Marzouk,Zhi Ren,Jakob Zech
类目: Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural ordinary differential equations (ODEs) provide expressive representations of invertible transport maps that can be used to approximate complex probability distributions, e.g., for generative modeling, density estimation, and Bayesian inference. We show that for a large class of transport maps T , there exists a time-dependent ODE velocity field realizing a straight-line interpolation (1-t)x + tT(x) , t \in [0,1] , of the displacement induced by the map. Moreover, we show that such velocity fields are minimizers of a training objective containing a specific minimum-energy regularization. We then derive explicit upper bounds for the C^k norm of the velocity field that are polynomial in the C^k norm of the corresponding transport map T ; in the case of triangular (Knothe–Rosenblatt) maps, we also show that these bounds are polynomial in the C^k norms of the associated source and target densities. Combining these results with stability arguments for distribution approximation via ODEs, we show that Wasserstein or Kullback–Leibler approximation of the target distribution to any desired accuracy \epsilon 0 can be achieved by a deep neural network representation of the velocity field whose size is bounded explicitly in terms of \epsilon , the dimension, and the smoothness of the source and target densities. The same neural network ansatz yields guarantees on the value of the regularized training objective.

[LG-58] Iterate to Accelerate: A Unified Framework for Iterative Reasoning and Feedback Convergence

链接: https://arxiv.org/abs/2502.03787
作者: Jacob Fein-Ashley
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a unified framework for iterative reasoning that leverages non-Euclidean geometry via Bregman divergences, higher-order operator averaging, and adaptive feedback mechanisms. Our analysis establishes that, under mild smoothness and contractivity assumptions, a generalized update scheme not only unifies classical methods such as mirror descent and dynamic programming but also captures modern chain-of-thought reasoning processes in large language models. In particular, we prove that our accelerated iterative update achieves an O(1/t^2) convergence rate in the absence of persistent perturbations, and we further demonstrate that feedback (iterative) architectures are necessary to approximate certain fixed-point functions efficiently. These theoretical insights bridge classical acceleration techniques with contemporary applications in neural computation and optimization.

[LG-59] StarMAP: Global Neighbor Embedding for Faithful Data Visualization

链接: https://arxiv.org/abs/2502.03776
作者: Koshi Watanabe,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neighbor embedding is widely employed to visualize high-dimensional data; however, it frequently overlooks the global structure, e.g., intercluster similarities, thereby impeding accurate visualization. To address this problem, this paper presents Star-attracted Manifold Approximation and Projection (StarMAP), which incorporates the advantage of principal component analysis (PCA) in neighbor embedding. Inspired by the property of PCA embedding, which can be viewed as the largest shadow of the data, StarMAP introduces the concept of \textitstar attraction by leveraging the PCA embedding. This approach yields faithful global structure preservation while maintaining the interpretability and computational efficiency of neighbor embedding. StarMAP was compared with existing methods in the visualization tasks of toy datasets, single-cell RNA sequencing data, and deep representation. The experimental results show that StarMAP is simple but effective in realizing faithful visualizations.

[LG-60] Learning Reward Machines from Partially Observed Optimal Policies

链接: https://arxiv.org/abs/2502.03762
作者: Mohamad Louai Shehab,Antoine Aspeel,Necmiye Ozay
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. Several examples are used to demonstrate the effectiveness of the approach.

[LG-61] Regularization via f-Divergence: An Application to Multi-Oxide Spectroscopic Analysis

链接: https://arxiv.org/abs/2502.03755
作者: Weizhi Li,Natalie Klein,Brendan Gifford,Elizabeth Sklute,Carey Legett,Samuel Clegg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we address the task of characterizing the chemical composition of planetary surfaces using convolutional neural networks (CNNs). Specifically, we seek to predict the multi-oxide weights of rock samples based on spectroscopic data collected under Martian conditions. We frame this problem as a multi-target regression task and propose a novel regularization method based on f-divergence. The f-divergence regularization is designed to constrain the distributional discrepancy between predictions and noisy targets. This regularizer serves a dual purpose: on the one hand, it mitigates overfitting by enforcing a constraint on the distributional difference between predictions and noisy targets. On the other hand, it acts as an auxiliary loss function, penalizing the neural network when the divergence between the predicted and target distributions becomes too large. To enable backpropagation during neural network training, we develop a differentiable f-divergence and incorporate it into the f-divergence regularization, making the network training feasible. We conduct experiments using spectra collected in a Mars-like environment by the remote-sensing instruments aboard the Curiosity and Perseverance rovers. Experimental results on multi-oxide weight prediction demonstrate that the proposed f -divergence regularization performs better than or comparable to standard regularization methods including L_1 , L_2 , and dropout. Notably, combining the f -divergence regularization with these standard regularization further enhances performance, outperforming each regularization method used independently.

[LG-62] PINS: Proximal Iterations with Sparse Newton and Sinkhorn for Optimal Transport

链接: https://arxiv.org/abs/2502.03749
作者: Di Wu,Ling Liang,Haizhao Yang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Optimal transport (OT) is a critical problem in optimization and machine learning, where accuracy and efficiency are paramount. Although entropic regularization and the Sinkhorn algorithm improve scalability, they frequently encounter numerical instability and slow convergence, especially when the regularization parameter is small. In this work, we introduce Proximal Iterations with Sparse Newton and Sinkhorn methods (PINS) to efficiently compute highly accurate solutions for large-scale OT problems. A reduced computational complexity through overall sparsity and global convergence are guaranteed by rigorous theoretical analysis. Our approach offers three key advantages: it achieves accuracy comparable to exact solutions, progressively accelerates each iteration for greater efficiency, and enhances robustness by reducing sensitivity to regularization parameters. Extensive experiments confirm these advantages, demonstrating superior performance compared to related methods.

[LG-63] Mitigating the Participation Bias by Balancing Extreme Ratings

链接: https://arxiv.org/abs/2502.03737
作者: Yongkang Guo,Yuqing Kong,Jialiang Liu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: In Proceedings of the ACM Web Conference 2025,15 pages

点击查看摘要

Abstract:Rating aggregation plays a crucial role in various fields, such as product recommendations, hotel rankings, and teaching evaluations. However, traditional averaging methods can be affected by participation bias, where some raters do not participate in the rating process, leading to potential distortions. In this paper, we consider a robust rating aggregation task under the participation bias. We assume that raters may not reveal their ratings with a certain probability depending on their individual ratings, resulting in partially observed samples. Our goal is to minimize the expected squared loss between the aggregated ratings and the average of all underlying ratings (possibly unobserved) in the worst-case scenario. We focus on two settings based on whether the sample size (i.e. the number of raters) is known. In the first setting, where the sample size is known, we propose an aggregator, named as the Balanced Extremes Aggregator. It estimates unrevealed ratings with a balanced combination of extreme ratings. When the sample size is unknown, we derive another aggregator, the Polarizing-Averaging Aggregator, which becomes optimal as the sample size grows to infinity. Numerical results demonstrate the superiority of our proposed aggregators in mitigating participation bias, compared to simple averaging and the spectral method. Furthermore, we validate the effectiveness of our aggregators on a real-world dataset. Comments: In Proceedings of the ACM Web Conference 2025,15 pages Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2502.03737 [cs.LG] (or arXiv:2502.03737v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.03737 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-64] Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach

链接: https://arxiv.org/abs/2502.03725
作者: Dimitris Bertsimas,Cheol Woo Kim,José Niño-Mora
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a machine learning approach to the optimal control of fluid restless multi-armed bandits (FRMABs) with state equations that are either affine or quadratic in the state variables. By deriving fundamental properties of FRMAB problems, we design an efficient machine learning based algorithm. Using this algorithm, we solve multiple instances with varying initial states to generate a comprehensive training set. We then learn a state feedback policy using Optimal Classification Trees with hyperplane splits (OCT-H). We test our approach on machine maintenance, epidemic control and fisheries control problems. Our method yields high-quality state feedback policies and achieves a speed-up of up to 26 million times compared to a direct numerical algorithm for fluid problems.

[LG-65] Detecting Backdoor Attacks via Similarity in Semantic Communication Systems

链接: https://arxiv.org/abs/2502.03721
作者: Ziyang Wei,Yili Jiang,Jiaqi Huang,Fangtian Zhong,Sohan Gyawali
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semantic communication systems, which leverage Generative AI (GAI) to transmit semantic meaning rather than raw data, are poised to revolutionize modern communications. However, they are vulnerable to backdoor attacks, a type of poisoning manipulation that embeds malicious triggers into training datasets. As a result, Backdoor attacks mislead the inference for poisoned samples while clean samples remain unaffected. The existing defenses may alter the model structure (such as neuron pruning that potentially degrades inference performance on clean inputs, or impose strict requirements on data formats (such as ``Semantic Shield" that requires image-text pairs). To address these limitations, this work proposes a defense mechanism that leverages semantic similarity to detect backdoor attacks without modifying the model structure or imposing data format constraints. By analyzing deviations in semantic feature space and establishing a threshold-based detection framework, the proposed approach effectively identifies poisoned samples. The experimental results demonstrate high detection accuracy and recall across varying poisoning ratios, underlining the significant effectiveness of our proposed solution.

[LG-66] On the Expressive Power of Subgraph Graph Neural Networks for Graphs with Bounded Cycles

链接: https://arxiv.org/abs/2502.03703
作者: Ziang Chen,Qiao Zhang,Runzhong Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have been widely used in graph-related contexts. It is known that the separation power of GNNs is equivalent to that of the Weisfeiler-Lehman (WL) test; hence, GNNs are imperfect at identifying all non-isomorphic graphs, which severely limits their expressive power. This work investigates k -hop subgraph GNNs that aggregate information from neighbors with distances up to k and incorporate the subgraph structure. We prove that under appropriate assumptions, the k -hop subgraph GNNs can approximate any permutation-invariant/equivariant continuous function over graphs without cycles of length greater than 2k+1 within any error tolerance. We also provide an extension to k -hop GNNs without incorporating the subgraph structure. Our numerical experiments on established benchmarks and novel architectures validate our theory on the relationship between the information aggregation distance and the cycle size.

[LG-67] How vulnerable is my policy? Adversarial attacks on modern behavior cloning policies

链接: https://arxiv.org/abs/2502.03698
作者: Basavasagar Patil,Akansha Kalra,Guanhong Tao,Daniel S. Brown
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning from Demonstration (LfD) algorithms have shown promising results in robotic manipulation tasks, but their vulnerability to adversarial attacks remains underexplored. This paper presents a comprehensive study of adversarial attacks on both classic and recently proposed algorithms, including Behavior Cloning (BC), LSTM-GMM, Implicit Behavior Cloning (IBC), Diffusion Policy (DP), and VQ-Behavior Transformer (VQ-BET). We study the vulnerability of these methods to untargeted, targeted and universal adversarial perturbations. While explicit policies, such as BC, LSTM-GMM and VQ-BET can be attacked in the same manner as standard computer vision models, we find that attacks for implicit and denoising policy models are nuanced and require developing novel attack methods. Our experiments on several simulated robotic manipulation tasks reveal that most of the current methods are highly vulnerable to adversarial perturbations. We also show that these attacks are transferable across algorithms, architectures, and tasks, raising concerning security vulnerabilities with potentially a white-box threat model. In addition, we test the efficacy of a randomized smoothing, a widely used adversarial defense technique, and highlight its limitation in defending against attacks on complex and multi-modal action distribution common in complex control tasks. In summary, our findings highlight the vulnerabilities of modern BC algorithms, paving way for future work in addressing such limitations.

[LG-68] Cascaded Learned Bloom Filter for Optimal Model-Filter Size Balance and Fast Rejection

链接: https://arxiv.org/abs/2502.03696
作者: Atsuki Sato,Yusuke Matsui
类目: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies have demonstrated that learned Bloom filters, which combine machine learning with the classical Bloom filter, can achieve superior memory efficiency. However, existing learned Bloom filters face two critical unresolved challenges: the balance between the machine learning model size and the Bloom filter size is not optimal, and the reject time cannot be minimized effectively. We propose the Cascaded Learned Bloom Filter (CLBF) to address these issues. Our dynamic programming-based optimization automatically selects configurations that achieve an optimal balance between the model and filter sizes while minimizing reject time. Experiments on real-world datasets show that CLBF reduces memory usage by up to 24% and decreases reject time by up to 14 times compared to state-of-the-art learned Bloom filters.

[LG-69] Chaos into Order: Neural Framework for Expected Value Estimation of Stochastic Partial Differential Equations

链接: https://arxiv.org/abs/2502.03670
作者: Ísak Pétursson,María Óskarsdóttir
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic Partial Differential Equations (SPDEs) are fundamental to modeling complex systems in physics, finance, and engineering, yet their numerical estimation remains a formidable challenge. Traditional methods rely on discretization, introducing computational inefficiencies, and limiting applicability in high-dimensional settings. In this work, we introduce a novel neural framework for SPDE estimation that eliminates the need for discretization, enabling direct estimation of expected values across arbitrary spatio-temporal points. We develop and compare two distinct neural architectures: Loss Enforced Conditions (LEC), which integrates physical constraints into the loss function, and Model Enforced Conditions (MEC), which embeds these constraints directly into the network structure. Through extensive experiments on the stochastic heat equation, Burgers’ equation, and Kardar-Parisi-Zhang (KPZ) equation, we reveal a trade-off: While LEC achieves superior residual minimization and generalization, MEC enforces initial conditions with absolute precision and exceptionally high accuracy in boundary condition enforcement. Our findings highlight the immense potential of neural-based SPDE solvers, particularly for high-dimensional problems where conventional techniques falter. By circumventing discretization and explicitly modeling uncertainty, our approach opens new avenues for solving SPDEs in fields ranging from quantitative finance to turbulence modeling. To the best of our knowledge, this is the first neural framework capable of directly estimating the expected values of SPDEs in an entirely non-discretized manner, offering a step forward in scientific computing.

[LG-70] Privacy-Preserving Generative Models: A Comprehensive Survey

链接: https://arxiv.org/abs/2502.03668
作者: Debalina Padariya,Isabel Wagner,Aboozar Taherkhani,Eerke Boiten
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Despite the generative model’s groundbreaking success, the need to study its implications for privacy and utility becomes more urgent. Although many studies have demonstrated the privacy threats brought by GANs, no existing survey has systematically categorized the privacy and utility perspectives of GANs and VAEs. In this article, we comprehensively study privacy-preserving generative models, articulating the novel taxonomies for both privacy and utility metrics by analyzing 100 research publications. Finally, we discuss the current challenges and future research directions that help new researchers gain insight into the underlying concepts.

[LG-71] Contrastive Learning for Cold Start Recommendation with Adaptive Feature Fusion

链接: https://arxiv.org/abs/2502.03664
作者: Jiacheng Hu,Tai An,Zidong Yu,Junliang Du,Yuanshuai Luo
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a cold start recommendation model that integrates contrastive learning, aiming to solve the problem of performance degradation of recommendation systems in cold start scenarios due to the scarcity of user and item interaction data. The model dynamically adjusts the weights of key features through an adaptive feature selection module and effectively integrates user attributes, item meta-information, and contextual features by combining a multimodal feature fusion mechanism, thereby improving recommendation performance. In addition, the model introduces a contrastive learning mechanism to enhance the robustness and generalization ability of feature representation by constructing positive and negative sample pairs. Experiments are conducted on the MovieLens-1M dataset. The results show that the proposed model significantly outperforms mainstream recommendation methods such as Matrix Factorization, LightGBM, DeepFM, and AutoRec in terms of HR, NDCG, MRR, and Recall, especially in cold start scenarios. Ablation experiments further verify the key role of each module in improving model performance, and the learning rate sensitivity analysis shows that a moderate learning rate is crucial to the optimization effect of the model. This study not only provides a new solution to the cold start problem but also provides an important reference for the application of contrastive learning in recommendation systems. In the future, this model is expected to play a role in a wider range of scenarios, such as real-time recommendation and cross-domain recommendation.

[LG-72] he Cost of Shuffling in Private Gradient Based Optimization

链接: https://arxiv.org/abs/2502.03652
作者: Shuli Jiang,Pranay Sharma,Zhiwei Steven Wu,Gauri Joshi
类目: Machine Learning (cs.LG)
*备注: 54 pages, 6 figures

点击查看摘要

Abstract:We consider the problem of differentially private (DP) convex empirical risk minimization (ERM). While the standard DP-SGD algorithm is theoretically well-established, practical implementations often rely on shuffled gradient methods that traverse the training data sequentially rather than sampling with replacement in each iteration. Despite their widespread use, the theoretical privacy-accuracy trade-offs of private shuffled gradient methods (\textitDP-ShuffleG) remain poorly understood, leading to a gap between theory and practice. In this work, we leverage privacy amplification by iteration (PABI) and a novel application of Stein’s lemma to provide the first empirical excess risk bound of \textitDP-ShuffleG. Our result shows that data shuffling results in worse empirical excess risk for \textitDP-ShuffleG compared to DP-SGD. To address this limitation, we propose \textitInterleaved-ShuffleG, a hybrid approach that integrates public data samples in private optimization. By alternating optimization steps that use private and public samples, \textitInterleaved-ShuffleG effectively reduces empirical excess risk. Our analysis introduces a new optimization framework with surrogate objectives, adaptive noise injection, and a dissimilarity metric, which can be of independent interest. Our experiments on diverse datasets and tasks demonstrate the superiority of \textitInterleaved-ShuffleG over several baselines.

[LG-73] Efficient Optimal PAC Learning

链接: https://arxiv.org/abs/2502.03620
作者: Mikael Møller Høgsgaard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in the binary classification setting by Hanneke [2016b] and Larsen [2023] have resulted in optimal PAC learners. These learners leverage, respectively, a clever deterministic subsampling scheme and the classic heuristic of bagging Breiman [1996]. Both optimal PAC learners use, as a subroutine, the natural algorithm of empirical risk minimization. Consequently, the computational cost of these optimal PAC learners is tied to that of the empirical risk minimizer algorithm. In this work, we seek to provide an alternative perspective on the computational cost imposed by the link to the empirical risk minimizer algorithm. To this end, we show the existence of an optimal PAC learner, which offers a different tradeoff in terms of the computational cost induced by the empirical risk minimizer.

[LG-74] Swarm Characteristic Classification using Robust Neural Networks with Optimized Controllable Inputs

链接: https://arxiv.org/abs/2502.03619
作者: Donald W. Peltier III,Isaac Kaminer,Abram Clark,Marko Orescanin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Having the ability to infer characteristics of autonomous agents would profoundly revolutionize defense, security, and civil applications. Our previous work was the first to demonstrate that supervised neural network time series classification (NN TSC) could rapidly predict the tactics of swarming autonomous agents in military contexts, providing intelligence to inform counter-maneuvers. However, most autonomous interactions, especially military engagements, are fraught with uncertainty, raising questions about the practicality of using a pretrained classifier. This article addresses that challenge by leveraging expected operational variations to construct a richer dataset, resulting in a more robust NN with improved inference performance in scenarios characterized by significant uncertainties. Specifically, diverse datasets are created by simulating variations in defender numbers, defender motions, and measurement noise levels. Key findings indicate that robust NNs trained on an enriched dataset exhibit enhanced classification accuracy and offer operational flexibility, such as reducing resources required and offering adherence to trajectory constraints. Furthermore, we present a new framework for optimally deploying a trained NN by the defenders. The framework involves optimizing defender trajectories that elicit adversary responses that maximize the probability of correct NN tactic classification while also satisfying operational constraints imposed on the defenders.

[LG-75] he Logical Implication Steering Method for Conditional Interventions on Transformer Generation

链接: https://arxiv.org/abs/2502.03618
作者: Damjan Kalajdzievski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ‘‘linear representation hypothesis’’, which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept’s vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.

[LG-76] Bilevel ZOFO: Bridging Parameter-Efficient and Zeroth-Order Techniques for Efficient LLM Fine-Tuning and Meta-Training

链接: https://arxiv.org/abs/2502.03604
作者: Reza Shirkavand,Qi He,Peiran Yu,Heng Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning pre-trained Large Language Models (LLMs) for downstream tasks using First-Order (FO) optimizers presents significant computational challenges. Parameter-Efficient Fine-Tuning(PEFT) methods have been proposed to address these challenges by freezing most model parameters and training only a small subset. While PEFT is efficient, it may not outperform full fine-tuning when high task-specific performance is required. Zeroth-Order (ZO) methods offer an alternative for fine-tuning the entire pre-trained model by approximating gradients using only the forward pass, thus eliminating the computational burden of back-propagation in first-order methods. However, when implementing ZO methods, a hard prompt is crucial, and relying on simple, fixed hard prompts may not be optimal. In this paper, we propose a bilevel optimization framework that complements ZO methods with PEFT to mitigate sensitivity to hard prompts while efficiently and effectively fine-tuning LLMs. Our Bilevel ZOFO (Zeroth-Order-First-Order) method employs a double-loop optimization strategy, where only the gradient of the PEFT model and the forward pass of the base model are required. We provide convergence guarantees for Bilevel ZOFO. Empirically, we demonstrate that Bilevel ZOFO outperforms both PEFT and ZO methods in single-task settings while maintaining similar memory efficiency. Additionally, we show its strong potential for multitask learning. Compared to current first-order meta-training algorithms for multitask learning, our method has significantly lower computational demands while maintaining or improving performance.

[LG-77] HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference

链接: https://arxiv.org/abs/2502.03589
作者: Zeyu Zhang,Haiying Shen,Shay Vargaftik,Ran Ben Basat,Michael Mitzenmacher,Minlan Yu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Disaggregated Large Language Model (LLM) inference has gained popularity as it separates the computation-intensive prefill stage from the memory-intensive decode stage, avoiding the prefill-decode interference and improving resource utilization. However, transmitting Key-Value (KV) data between the two stages can be a bottleneck, especially for long prompts. Additionally, the computation time overhead for prefill and decode is key for optimizing Job Completion Time (JCT), and KV data size can become prohibitive for long prompts and sequences. Existing KV quantization methods can alleviate the transmission bottleneck and reduce memory requirements, but they introduce significant dequantization overhead, exacerbating the computation time. We propose Homomorphic Acceleration via Compression of the KV cache (HACK) for disaggregated LLM inference. HACK eliminates the heavy KV dequantization step, and directly performs computations on quantized KV data to approximate and reduce the cost of the expensive matrix-multiplication step. Extensive trace-driven experiments show that HACK reduces JCT by up to 70.9% compared to disaggregated LLM inference baseline and by up to 52.3% compared to state-of-the-art KV quantization methods. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2502.03589 [cs.DC] (or arXiv:2502.03589v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2502.03589 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-78] Stein Discrepancy for Unsupervised Domain Adaptation

链接: https://arxiv.org/abs/2502.03587
作者: Anneke von Seeger,Dongmian Zou,Gilad Lerman
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 9 figures

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) leverages information from a labeled source dataset to improve accuracy on a related but unlabeled target dataset. A common approach to UDA is aligning representations from the source and target domains by minimizing the distance between their data distributions. Previous methods have employed distances such as Wasserstein distance and maximum mean discrepancy. However, these approaches are less effective when the target data is significantly scarcer than the source data. Stein discrepancy is an asymmetric distance between distributions that relies on one distribution only through its score function. In this paper, we propose a novel \acuda method that uses Stein discrepancy to measure the distance between source and target domains. We develop a learning framework using both non-kernelized and kernelized Stein discrepancy. Theoretically, we derive an upper bound for the generalization error. Numerical experiments show that our method outperforms existing methods using other domain discrepancy measures when only small amounts of target data are available.

[LG-79] Clone-Resistant Weights in Metric Spaces: A Framework for Handling Redundancy Bias

链接: https://arxiv.org/abs/2502.03576
作者: Damien Berriaud,Roger Wattenhofer
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: v1

点击查看摘要

Abstract:We are given a set of elements in a metric space. The distribution of the elements is arbitrary, possibly adversarial. Can we weigh the elements in a way that is resistant to such (adversarial) manipulations? This problem arises in various contexts. For instance, the elements could represent data points, requiring robust domain adaptation. Alternatively, they might represent tasks to be aggregated into a benchmark; or questions about personal political opinions in voting advice applications. This article introduces a theoretical framework for dealing with such problems. We propose clone-proof representation functions as a solution concept. These functions distribute importance across elements of a set such that similar objects (``clones’') share (some of) their weights, thus avoiding a potential bias introduced by their multiplicity. Our framework extends the maximum uncertainty principle to accommodate general metric spaces and includes a set of axioms - symmetry, continuity, and clone-proofness - that guide the construction of representation functions. Finally, we address the existence of representation functions satisfying our axioms in the significant case of Euclidean spaces and propose a general method for their construction.

[LG-80] Controllable Sequence Editing for Counterfactual Generation

链接: https://arxiv.org/abs/2502.03569
作者: Michelle M. Li,Kevin Li,Yasha Ektefaie,Shvat Messica,Marinka Zitnik
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:Sequence models generate counterfactuals by modifying parts of a sequence based on a given condition, enabling reasoning about “what if” scenarios. While these models excel at conditional generation, they lack fine-grained control over when and where edits occur. Existing approaches either focus on univariate sequences or assume that interventions affect the entire sequence globally. However, many applications require precise, localized modifications, where interventions take effect only after a specified time and impact only a subset of co-occurring variables. We introduce CLEF, a controllable sequence editing model for counterfactual reasoning about both immediate and delayed effects. CLEF learns temporal concepts that encode how and when interventions should influence a sequence. With these concepts, CLEF selectively edits relevant time steps while preserving unaffected portions of the sequence. We evaluate CLEF on cellular and patient trajectory datasets, where gene regulation affects only certain genes at specific time steps, or medical interventions alter only a subset of lab measurements. CLEF improves immediate sequence editing by up to 36.01% in MAE compared to baselines. Unlike prior methods, CLEF enables one-step generation of counterfactual sequences at any future time step, outperforming baselines by up to 65.71% in MAE. A case study on patients with type 1 diabetes mellitus shows that CLEF identifies clinical interventions that shift patient trajectories toward healthier outcomes.

[LG-81] D-M(PC)2: Improving Temporal Difference MPC Through Policy Constraint

链接: https://arxiv.org/abs/2502.03550
作者: Haotian Lin,Pengcheng Wang,Jeff Schneider,Guanya Shi
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning algorithms that combine model-based planning and learned value/policy prior have gained significant recognition for their high data efficiency and superior performance in continuous control. However, we discover that existing methods that rely on standard SAC-style policy iteration for value learning, directly using data generated by the planner, often result in \emphpersistent value overestimation. Through theoretical analysis and experiments, we argue that this issue is deeply rooted in the structural policy mismatch between the data generation policy that is always bootstrapped by the planner and the learned policy prior. To mitigate such a mismatch in a minimalist way, we propose a policy regularization term reducing out-of-distribution (OOD) queries, thereby improving value learning. Our method involves minimum changes on top of existing frameworks and requires no additional computation. Extensive experiments demonstrate that the proposed approach improves performance over baselines such as TD-MPC2 by large margins, particularly in 61-DoF humanoid tasks. View qualitative results at this https URL.

[LG-82] Optimistic epsilon-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2502.03506
作者: Ruoning Zhang,Siying Wang,Wenyu Chen,Yang Zhou,Zhitong Zhao,Zixuan Zhang,Ruijie Zhang
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi-agent reinforcement learning. However, due to the representational limitations of traditional monotonic value decomposition methods, algorithms can underestimate optimal actions, leading policies to suboptimal solutions. To address this challenge, we propose Optimistic \epsilon -Greedy Exploration, focusing on enhancing exploration to correct value estimations. The underestimation arises from insufficient sampling of optimal actions during exploration, as our analysis indicated. We introduce an optimistic updating network to identify optimal actions and sample actions from its distribution with a probability of \epsilon during exploration, increasing the selection frequency of optimal actions. Experimental results in various environments reveal that the Optimistic \epsilon -Greedy Exploration effectively prevents the algorithm from suboptimal solutions and significantly improves its performance compared to other algorithms.

[LG-83] PICBench: Benchmarking LLM s for Photonic Integrated Circuits Design

链接: https://arxiv.org/abs/2502.03159
作者: Yuchao Wu,Xiaofei Yu,Hao Chen,Yang Luo,Yeyu Tong,Yuzhe Ma
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) have shown remarkable potential in automating various tasks in digital chip design, the field of Photonic Integrated Circuits (PICs)-a promising solution to advanced chip designs-remains relatively unexplored in this context. The design of PICs is time-consuming and prone to errors due to the extensive and repetitive nature of code involved in photonic chip design. In this paper, we introduce PICBench, the first benchmarking and evaluation framework specifically designed to automate PIC design generation using LLMs, where the generated output takes the form of a netlist. Our benchmark consists of dozens of meticulously crafted PIC design problems, spanning from fundamental device designs to more complex circuit-level designs. It automatically evaluates both the syntax and functionality of generated PIC designs by comparing simulation outputs with expert-written solutions, leveraging an open-source simulator. We evaluate a range of existing LLMs, while also conducting comparative tests on various prompt engineering techniques to enhance LLM performance in automated PIC design. The results reveal the challenges and potential of LLMs in the PIC design domain, offering insights into the key areas that require further research and development to optimize automation in this field. Our benchmark and evaluation code is available at this https URL.

[LG-84] Leverag ing Reviewer Experience in Code Review Comment Generation

链接: https://arxiv.org/abs/2409.10959
作者: Hong Yi Lin,Patanamon Thongtanunam,Christoph Treude,Michael W. Godfrey,Chunhua Liu,Wachiraphan Charoenwet
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern code review is a ubiquitous software quality assurance process aimed at identifying potential issues within newly written code. Despite its effectiveness, the process demands large amounts of effort from the human reviewers involved. To help alleviate this workload, researchers have trained deep learning models to imitate human reviewers in providing natural language code reviews. Formally, this task is known as code review comment generation. Prior work has demonstrated improvements in this task by leveraging machine learning techniques and neural models, such as transfer learning and the transformer architecture. However, the quality of the model generated reviews remain sub-optimal due to the quality of the open-source code review data used in model training. This is in part due to the data obtained from open-source projects where code reviews are conducted in a public forum, and reviewers possess varying levels of software development experience, potentially affecting the quality of their feedback. To accommodate for this variation, we propose a suite of experience-aware training methods that utilise the reviewers’ past authoring and reviewing experiences as signals for review quality. Specifically, we propose experience-aware loss functions (ELF), which use the reviewers’ authoring and reviewing ownership of a project as weights in the model’s loss function. Through this method, experienced reviewers’ code reviews yield larger influence over the model’s behaviour. Compared to the SOTA model, ELF was able to generate higher quality reviews in terms of accuracy, informativeness, and comment types generated. The key contribution of this work is the demonstration of how traditional software engineering concepts such as reviewer experience can be integrated into the design of AI-based automated code review models.

[LG-85] Prediction-Powered E-Values

链接: https://arxiv.org/abs/2502.04294
作者: Daniel Csillag,Claudio José Struchiner,Guilherme Tegoni Goedert
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Quality statistical inference requires a sufficient amount of data, which can be missing or hard to obtain. To this end, prediction-powered inference has risen as a promising methodology, but existing approaches are largely limited to Z-estimation problems such as inference of means and quantiles. In this paper, we apply ideas of prediction-powered inference to e-values. By doing so, we inherit all the usual benefits of e-values – such as anytime-validity, post-hoc validity and versatile sequential inference – as well as greatly expand the set of inferences achievable in a prediction-powered manner. In particular, we show that every inference procedure that can be framed in terms of e-values has a prediction-powered counterpart, given by our method. We showcase the effectiveness of our framework across a wide range of inference tasks, from simple hypothesis testing and confidence intervals to more involved procedures for change-point detection and causal discovery, which were out of reach of previous techniques. Our approach is modular and easily integrable into existing algorithms, making it a compelling choice for practical applications.

[LG-86] Retro-Rank-In: A Ranking-Based Approach for Inorganic Materials Synthesis Planning

链接: https://arxiv.org/abs/2502.04289
作者: Thorben Prein,Elton Pan,Sami Haddouti,Marco Lorenz,Janik Jehkul,Tymoteusz Wilk,Cansu Moran,Menelaos Panagiotis Fotiadis,Artur P. Toshev,Elsa Olivetti,Jennifer L.M. Rupp
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrosynthesis strategically plans the synthesis of a chemical target compound from simpler, readily available precursor compounds. This process is critical for synthesizing novel inorganic materials, yet traditional methods in inorganic chemistry continue to rely on trial-and-error experimentation. Emerging machine-learning approaches struggle to generalize to entirely new reactions due to their reliance on known precursors, as they frame retrosynthesis as a multi-label classification task. To address these limitations, we propose Retro-Rank-In, a novel framework that reformulates the retrosynthesis problem by embedding target and precursor materials into a shared latent space and learning a pairwise ranker on a bipartite graph of inorganic compounds. We evaluate Retro-Rank-In’s generalizability on challenging retrosynthesis dataset splits designed to mitigate data duplicates and overlaps. For instance, for Cr2AlB2, it correctly predicts the verified precursor pair CrB + Al despite never seeing them in training, a capability absent in prior work. Extensive experiments show that Retro-Rank-In sets a new state-of-the-art, particularly in out-of-distribution generalization and candidate set ranking, offering a powerful tool for accelerating inorganic material synthesis.

[LG-87] Gaussian Process Regression for Inverse Problems in Linear PDEs

链接: https://arxiv.org/abs/2502.04276
作者: Xin Li,Markus Lange-Hegermann,Bogdan Raiţă
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Commutative Algebra (math.AC)
*备注:

点击查看摘要

Abstract:This paper introduces a computationally efficient algorithm in system theory for solving inverse problems governed by linear partial differential equations (PDEs). We model solutions of linear PDEs using Gaussian processes with priors defined based on advanced commutative algebra and algebraic analysis. The implementation of these priors is algorithmic and achieved using the Macaulay2 computer algebra software. An example application includes identifying the wave speed from noisy data for classical wave equations, which are widely used in physics. The method achieves high accuracy while enhancing computational efficiency.

[LG-88] Variational decision diagrams for quantum-inspired machine learning applications

链接: https://arxiv.org/abs/2502.04271
作者: Santiago Acevedo-Mancera,Vladimir Vargas-Calderón,Herbert Vinck-Posada
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, presented at Quantum Information in Spain (ICE-9)

点击查看摘要

Abstract:Decision diagrams (DDs) have emerged as an efficient tool for simulating quantum circuits due to their capacity to exploit data redundancies in quantum states and quantum operations, enabling the efficient computation of probability amplitudes. However, their application in quantum machine learning (QML) has remained unexplored. This paper introduces variational decision diagrams (VDDs), a novel graph structure that combines the structural benefits of DDs with the adaptability of variational methods for efficiently representing quantum states. We investigate the trainability of VDDs by applying them to the ground state estimation problem for transverse-field Ising and Heisenberg Hamiltonians. Analysis of gradient variance suggests that training VDDs is possible, as no signs of vanishing gradients–also known as barren plateaus–are observed. This work provides new insights into the use of decision diagrams in QML as an alternative to design and train variational ansätze.

[LG-89] Student-t processes as infinite-width limits of posterior Bayesian neural networks

链接: https://arxiv.org/abs/2502.04247
作者: Francesco Caporali,Stefano Favaro,Dario Trevisan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:The asymptotic properties of Bayesian Neural Networks (BNNs) have been extensively studied, particularly regarding their approximations by Gaussian processes in the infinite-width limit. We extend these results by showing that posterior BNNs can be approximated by Student-t processes, which offer greater flexibility in modeling uncertainty. Specifically, we show that, if the parameters of a BNN follow a Gaussian prior distribution, and the variance of both the last hidden layer and the Gaussian likelihood function follows an Inverse-Gamma prior distribution, then the resulting posterior BNN converges to a Student-t process in the infinite-width limit. Our proof leverages the Wasserstein metric to establish control over the convergence rate of the Student-t process approximation.

[LG-90] Multi-task Online Learning for Probabilistic Load Forecasting

链接: https://arxiv.org/abs/2502.04163
作者: Onintze Zaballa,Verónica Álvarez,Santiago Mazuelas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 2024 IEEE Sustainable Power and Energy Conference

点击查看摘要

Abstract:Load forecasting is essential for the efficient, reliable, and cost-effective management of power systems. Load forecasting performance can be improved by learning the similarities among multiple entities (e.g., regions, buildings). Techniques based on multi-task learning obtain predictions by leveraging consumption patterns from the historical load demand of multiple entities and their relationships. However, existing techniques cannot effectively assess inherent uncertainties in load demand or account for dynamic changes in consumption patterns. This paper proposes a multi-task learning technique for online and probabilistic load forecasting. This technique provides accurate probabilistic predictions for the loads of multiple entities by leveraging their dynamic similarities. The method’s performance is evaluated using datasets that register the load demand of multiple entities and contain diverse and dynamic consumption patterns. The experimental results show that the proposed method can significantly enhance the effectiveness of current multi-task learning approaches across a wide variety of load consumption scenarios.

[LG-91] A Pseudo Markov-Chain Model and Time-Elapsed Measures of Mobility from Collective Data

链接: https://arxiv.org/abs/2502.04162
作者: Alisha Foster,David A. Meyer,Asif Shakeel
类目: Applications (stat.AP); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注: 27 pages, 11 figures

点击查看摘要

Abstract:In this paper we develop a pseudo Markov-chain model to understand time-elapsed flows, over multiple intervals, from time and space aggregated collective inter-location trip data, given as a time-series. Building on the model, we develop measures of mobility that parallel those known for individual mobility data, such as the radius of gyration. We apply these measures to the NetMob 2024 Data Challenge data, and obtain interesting results that are consistent with published statistics and commuting patterns in cities. Besides building a new framework, we foresee applications of this approach to an improved understanding of human mobility in the context of environmental changes and sustainable development.

[LG-92] Blackwells Approachability with Approximation Algorithms

链接: https://arxiv.org/abs/2502.03919
作者: Dan Garber,Mhna Massalha
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit Blackwell’s celebrated approachability problem which considers a repeated vector-valued game between a player and an adversary. Motivated by settings in which the action set of the player or adversary (or both) is difficult to optimize over, for instance when it corresponds to the set of all possible solutions to some NP-Hard optimization problem, we ask what can the player guarantee \textitefficiently, when only having access to these sets via approximation algorithms with ratios \alpha_\mX \geq 1 and 1 \geq \alpha_\mY 0 , respectively. Assuming the player has monotone preferences, in the sense that he does not prefer a vector-valued loss \ell_1 over \ell_2 if \ell_2 \leq \ell_1 , we establish that given a Blackwell instance with an approachable target set S , the downward closure of the appropriately-scaled set \alpha_\mX\alpha_\mY^-1S is \textitefficiently approachable with optimal rate. In case only the player’s or adversary’s set is equipped with an approximation algorithm, we give simpler and more efficient algorithms.

[LG-93] Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints

链接: https://arxiv.org/abs/2502.03792
作者: Kyle Sung,Anastasis Kratsios,Noah Forman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 8 figures

点击查看摘要

Abstract:We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.

[LG-94] First-ish Order Methods: Hessian-aware Scalings of Gradient Descent

链接: https://arxiv.org/abs/2502.03701
作者: Oscar Smee,Fred Roosta,Stephen J. Wright
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient descent is the primary workhorse for optimizing large-scale problems in machine learning. However, its performance is highly sensitive to the choice of the learning rate. A key limitation of gradient descent is its lack of natural scaling, which often necessitates expensive line searches or heuristic tuning to determine an appropriate step size. In this paper, we address this limitation by incorporating Hessian information to scale the gradient direction. By accounting for the curvature of the function along the gradient, our adaptive, Hessian-aware scaling method ensures a local unit step size guarantee, even in nonconvex settings. Near a local minimum that satisfies the second-order sufficient conditions, our approach achieves linear convergence with a unit step size. We show that our method converges globally under a significantly weaker version of the standard Lipschitz gradient smoothness assumption. Even when Hessian information is inexact, the local unit step size guarantee and global convergence properties remain valid under mild conditions. Finally, we validate our theoretical results empirically on a range of convex and nonconvex machine learning tasks, showcasing the effectiveness of the approach.

[LG-95] Physically consistent predictive reduced-order modeling by enhancing Operator Inference with state constraints

链接: https://arxiv.org/abs/2502.03672
作者: Hyeonghun Kim,Boris Kramer
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 27 pages, 10 figures

点击查看摘要

Abstract:Numerical simulations of complex multiphysics systems, such as char combustion considered herein, yield numerous state variables that inherently exhibit physical constraints. This paper presents a new approach to augment Operator Inference – a methodology within scientific machine learning that enables learning from data a low-dimensional representation of a high-dimensional system governed by nonlinear partial differential equations – by embedding such state constraints in the reduced-order model predictions. In the model learning process, we propose a new way to choose regularization hyperparameters based on a key performance indicator. Since embedding state constraints improves the stability of the Operator Inference reduced-order model, we compare the proposed state constraints-embedded Operator Inference with the standard Operator Inference and other stability-enhancing approaches. For an application to char combustion, we demonstrate that the proposed approach yields state predictions superior to the other methods regarding stability and accuracy. It extrapolates over 200% past the training regime while being computationally efficient and physically consistent.

[LG-96] Rule-based Evolving Fuzzy System for Time Series Forecasting: New Perspectives Based on Type-2 Fuzzy Sets Measures Approach

链接: https://arxiv.org/abs/2502.03650
作者: Eduardo Santos de Oliveira Marques,Arthur Caio Vargas Pinto,Kaike Sa Teles Rocha Alves,Eduardo Pestana de Aguiar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world data contain uncertainty and variations that can be correlated to external variables, known as randomness. An alternative cause of randomness is chaos, which can be an important component of chaotic time series. One of the existing methods to deal with this type of data is the use of the evolving Fuzzy Systems (eFSs), which have been proven to be a powerful class of models for time series forecasting, due to their autonomy to handle the data and highly complex problems in real-world applications. However, due to its working structure, type-2 fuzzy sets can outperform type-1 fuzzy sets for highly uncertain scenarios. We then propose ePL-KRLS-FSM+, an enhanced class of evolving fuzzy modeling approach that combines participatory learning (PL), a kernel recursive least squares method (KRLS), type-2 fuzzy logic and data transformation into fuzzy sets (FSs). This improvement allows to create and measure type-2 fuzzy sets for better handling uncertainties in the data, generating a model that can predict chaotic data with increased accuracy. The model is evaluated using two complex datasets: the chaotic time series Mackey-Glass delay differential equation with different degrees of chaos, and the main stock index of the Taiwan Capitalization Weighted Stock Index - TAIEX. Model performance is compared to related state-of-the-art rule-based eFS models and classical approaches and is analyzed in terms of error metrics, runtime and the number of final rules. Forecasting results show that the proposed model is competitive and performs consistently compared with type-1 models, also outperforming other forecasting methods by showing the lowest error metrics and number of final rules.

[LG-97] SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models

链接: https://arxiv.org/abs/2502.03638
作者: Daniel Levy,Siba Smarak Panigrahi,Sékou-Oumar Kaba,Qiang Zhu,Kin Long Kelvin Lee,Mikhail Galkin,Santiago Miret,Siamak Ravanbakhsh
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating novel crystalline materials has potential to lead to advancements in fields such as electronics, energy storage, and catalysis. The defining characteristic of crystals is their symmetry, which plays a central role in determining their physical properties. However, existing crystal generation methods either fail to generate materials that display the symmetries of real-world crystals, or simply replicate the symmetry information from examples in a database. To address this limitation, we propose SymmCD, a novel diffusion-based generative model that explicitly incorporates crystallographic symmetry into the generative process. We decompose crystals into two components and learn their joint distribution through diffusion: 1) the asymmetric unit, the smallest subset of the crystal which can generate the whole crystal through symmetry transformations, and; 2) the symmetry transformations needed to be applied to each atom in the asymmetric unit. We also use a novel and interpretable representation for these transformations, enabling generalization across different crystallographic symmetry groups. We showcase the competitive performance of SymmCD on a subset of the Materials Project, obtaining diverse and valid crystals with realistic symmetries and predicted properties.

[LG-98] Multivariate Conformal Prediction using Optimal Transport

链接: https://arxiv.org/abs/2502.03609
作者: Michal Klein,Louis Bethune,Eugene Ndiaye,Marco Cuturi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction (CP) quantifies the uncertainty of machine learning models by constructing sets of plausible outputs. These sets are constructed by leveraging a so-called conformity score, a quantity computed using the input point of interest, a prediction model, and past observations. CP sets are then obtained by evaluating the conformity score of all possible outputs, and selecting them according to the rank of their scores. Due to this ranking step, most CP approaches rely on a score functions that are univariate. The challenge in extending these scores to multivariate spaces lies in the fact that no canonical order for vectors exists. To address this, we leverage a natural extension of multivariate score ranking based on optimal transport (OT). Our method, OTCP, offers a principled framework for constructing conformal prediction sets in multidimensional settings, preserving distribution-free coverage guarantees with finite data samples. We demonstrate tangible gains in a benchmark dataset of multivariate regression problems and address computational \ statistical trade-offs that arise when estimating conformity scores through OT maps.

[LG-99] Online Learning Algorithms in Hilbert Spaces with beta- and phi-Mixing Sequences

链接: https://arxiv.org/abs/2502.03551
作者: Priyanka Roy,Susanne Saminger-Platz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:In this paper, we study an online algorithm in a reproducing kernel Hilbert spaces (RKHS) based on a class of dependent processes, called the mixing process. For such a process, the degree of dependence is measured by various mixing coefficients. As a representative example, we analyze a strictly stationary Markov chain, where the dependence structure is characterized by the (\beta-) and (\phi-)mixing coefficients. For these dependent samples, we derive nearly optimal convergence rates. Our findings extend existing error bounds for i.i.d. observations, demonstrating that the i.i.d. case is a special instance of our framework. Moreover, we explicitly account for an additional factor introduced by the dependence structure in the Markov chain.

[LG-100] Proxy Prompt: Endowing SAM and SAM 2 with Auto-Interactive-Prompt for Medical Segmentation

链接: https://arxiv.org/abs/2502.03501
作者: Wang Xinyi,Kang Hongyu,Wei Peishan,Shuai Li,Yu Sun,Sai Kit Lam,Yongping Zheng
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we aim to address the unmet demand for automated prompting and enhanced human-model interactions of SAM and SAM2 for the sake of promoting their widespread clinical adoption. Specifically, we propose Proxy Prompt (PP), auto-generated by leveraging non-target data with a pre-annotated mask. We devise a novel 3-step context-selection strategy for adaptively selecting the most representative contextual information from non-target data via vision mamba and selective maps, empowering the guiding capability of non-target image-mask pairs for segmentation on target image/video data. To reinforce human-model interactions in PP, we further propose a contextual colorization module via a dual-reverse cross-attention to enhance interactions between target features and contextual-embedding with amplifying distinctive features of user-defined object(s). Via extensive evaluations, our method achieves state-of-the-art performance on four public datasets and yields comparable results with fully-trained models, even when trained with only 16 image masks.

[LG-101] Dementia Classification Using Acoustic Speech and Feature Selection

链接: https://arxiv.org/abs/2502.03484
作者: Marko Niemelä,Mikaela von Bonsdorff,Sami Äyrämö,Tommi Kärkkäinen
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Dementia is a general term for a group of syndromes that affect cognitive functions such as memory, thinking, reasoning, and the ability to perform daily tasks. The number of dementia patients is increasing as the population ages, and it is estimated that over 10 million people develop dementia each year. Dementia progresses gradually, and the sooner a patient receives help and support, the better their chances of maintaining their functional abilities. For this reason, early diagnosis of dementia is important. In recent years, machine learning models based on naturally spoken language have been developed for the early diagnosis of dementia. These methods have proven to be user-friendly, cost-effective, scalable, and capable of providing extremely fast diagnoses. This study utilizes the well-known ADReSS challenge dataset for classifying healthy controls and Alzheimer’s patients. The dataset contains speech recordings from a picture description task featuring a kitchen scene, collected from both healthy controls and dementia patients. Unlike most studies, this research does not segment the audio recordings into active speech segments; instead, acoustic features are extracted from entire recordings. The study employs Ridge linear regression, Extreme Minimal Learning Machine, and Linear Support Vector Machine machine learning models to compute feature importance scores based on model outputs. The Ridge model performed best in Leave-One-Subject-Out cross-validation, achieving a classification accuracy of 87.8%. The EMLM model, proved to be effective in both cross-validation and the classification of a separate test dataset, with accuracies of 85.3% and 79.2%, respectively. The study’s results rank among the top compared to other studies using the same dataset and acoustic feature extraction for dementia diagnosis.

[LG-102] Foundation for unbiased cross-validation of spatio-temporal models for species distribution modeling

链接: https://arxiv.org/abs/2502.03480
作者: Diana Koldasbayeva,Alexey Zaytsev
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Species Distribution Models (SDMs) often suffer from spatial autocorrelation (SAC), leading to biased performance estimates. We tested cross-validation (CV) strategies - random splits, spatial blocking with varied distances, environmental (ENV) clustering, and a novel spatio-temporal method - under two proposed training schemes: LAST FOLD, widely used in spatial CV at the cost of data loss, and RETRAIN, which maximizes data usage but risks reintroducing SAC. LAST FOLD consistently yielded lower errors and stronger correlations. Spatial blocking at an optimal distance (SP 422) and ENV performed best, achieving Spearman and Pearson correlations of 0.485 and 0.548, respectively, although ENV may be unsuitable for long-term forecasts involving major environmental shifts. A spatio-temporal approach yielded modest benefits in our moderately variable dataset, but may excel with stronger temporal changes. These findings highlight the need to align CV approaches with the spatial and temporal structure of SDM data, ensuring rigorous validation and reliable predictive outcomes.

信息检索

[IR-0] Digital Gatekeeping: An Audit of Search Engine Results shows tailoring of queries on the Israel-Palestine Conflict

链接: https://arxiv.org/abs/2502.04266
作者: Íris Damião,José M. Reis,Paulo Almeida,Nuno Santos,Joana Gonçalves-Sá
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Search engines, often viewed as reliable gateways to information, tailor search results using customization algorithms based on user preferences, location, and more. While this can be useful for routine queries, it raises concerns when the topics are sensitive or contentious, possibly limiting exposure to diverse viewpoints and increasing polarization. To examine the extent of this tailoring, we focused on the Israel-Palestine conflict and developed a privacy-protecting tool to audit the behavior of three search engines: DuckDuckGo, Google and Yahoo. Our study focused on two main questions: (1) How do search results for the same query about the conflict vary among different users? and (2) Are these results influenced by the user’s location and browsing history? Our findings revealed significant customization based on location and browsing preferences, unlike previous studies that found only mild personalization for general topics. Moreover, queries related to the conflict were more customized than unrelated queries, and the results were not neutral concerning the conflict’s portrayal. Comments: 10 pages, 4 figures Subjects: Computers and Society (cs.CY); Information Retrieval (cs.IR) Cite as: arXiv:2502.04266 [cs.CY] (or arXiv:2502.04266v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2502.04266 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] Counterfactual Query Rewriting to Use Historical Relevance Feedback

链接: https://arxiv.org/abs/2502.03891
作者: Jüri Keller,Maik Fröbe,Gijs Hendriksen,Daria Alexander,Martin Potthast,Matthias Hagen,Philipp Schaer
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:When a retrieval system receives a query it has encountered before, previous relevance feedback, such as clicks or explicit judgments can help to improve retrieval results. However, the content of a previously relevant document may have changed, or the document might not be available anymore. Despite this evolved corpus, we counterfactually use these previously relevant documents as relevance signals. In this paper we proposed approaches to rewrite user queries and compare them against a system that directly uses the previous qrels for the ranking. We expand queries with terms extracted from the previously relevant documents or derive so-called keyqueries that rank the previously relevant documents to the top of the current corpus. Our evaluation in the CLEF LongEval scenario shows that rewriting queries with historical relevance feedback improves the retrieval effectiveness and even outperforms computationally expensive transformer-based approaches.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-07

目录

概览 (2025-02-07)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载