本篇博文主要内容为 2025-02-13 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-02-13)

今日共更新447篇论文,其中:

  • 自然语言处理61篇(Computation and Language (cs.CL))
  • 人工智能133篇(Artificial Intelligence (cs.AI))
  • 计算机视觉87篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习171篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

【速读】: 该论文旨在解决AI目标与价值观的识别和分析问题。论文的关键在于利用效用函数框架来研究当前大语言模型(LLMs)内部偏好的一致性,并发现这些偏好随着规模的增长展现出高度的结构一致性。基于这一发现,论文提出了一项名为效用工程的研究议程,包括对AI效用的分析与控制,以理解和约束AI系统中已经出现的价值体系。

链接: https://arxiv.org/abs/2502.08640
作者: Mantas Mazeika,Xuwang Yin,Rishub Tamirisa,Jaehyuk Lim,Bruce W. Lee,Richard Ren,Long Phan,Norman Mu,Adam Khoja,Oliver Zhang,Dan Hendrycks
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
zh

[NLP-1] Examining Multilingual Embedding Models Cross-Lingually Through LLM -Generated Adversarial Examples

【速读】: 本文旨在解决跨语言语义搜索能力评估受限于现有数据集的问题,这些数据集通常来自信息检索和语义文本相似性等任务。为实现特定领域的评估,作者引入了“跨语言语义区分(Cross Lingual Semantic Discrimination, CLSD)”这一新任务。该任务仅需目标领域内所需语言对的一组平行句对。关键解决方案在于通过CLSD任务评估模型能否在跨语言环境中将真正的平行句排名高于由大型语言模型生成的难负样本。研究结果表明,针对检索任务微调的模型(如多语言E5)受益于使用英语作为枢轴语言,而基于双语挖掘的模型(如LaBSE)则能直接进行跨语言处理。此外,本文展示了细粒度相似性分析,揭示不同嵌入模型对不同类型扰动的敏感性差异。

链接: https://arxiv.org/abs/2502.08638
作者: Andrianos Michail,Simon Clematide,Rico Sennrich
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evaluation of cross-lingual semantic search capabilities of models is often limited to existing datasets from tasks such as information retrieval and semantic textual similarity. To allow for domain-specific evaluation, we introduce Cross Lingual Semantic Discrimination (CLSD), a novel cross-lingual semantic search task that requires only a set of parallel sentence pairs of the language pair of interest within the target domain. This task focuses on the ability of a model to cross-lingually rank the true parallel sentence higher than hard negatives generated by a large language model. We create four instances of our introduced CLSD task for the language pair German-French within the domain of news. Within this case study, we find that models that are also fine-tuned for retrieval tasks (e.g., multilingual E5) benefit from using English as the pivot language, while bitext mining models such as LaBSE perform best directly cross-lingually. We also show a fine-grained similarity analysis enabled by our distractor generation strategy, indicating that different embedding models are sensitive to different types of perturbations.
zh

[NLP-2] Randomness of Low-Layer Parameters Determines Confusing Samples in Terms of Interaction Representations of a DNN

【速读】: 该论文旨在探究深度神经网络(DNN)泛化能力的内在机制,并特别关注其复杂交互作用的影响。研究发现,DNN的泛化能力可以通过其编码的交互复杂性来解释,而这些复杂交互主要由低层参数决定。关键在于,不同DNN的混淆样本(即非泛化交互作用的样本)主要受低层参数的影响,而非高层参数或网络架构。即使性能相似的两个DNN,如果低层参数不同,它们的混淆样本集通常完全不同。这一发现扩展了对于彩票假设的理解,并很好地解释了不同DNN之间的表征能力差异。

链接: https://arxiv.org/abs/2502.08625
作者: Junpeng Zhang,Lei Cheng,Qing Li,Liang Lin,Quanshi Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院); Sun Yat-sen University (中山大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we find that the complexity of interactions encoded by a deep neural network (DNN) can explain its generalization power. We also discover that the confusing samples of a DNN, which are represented by non-generalizable interactions, are determined by its low-layer parameters. In comparison, other factors, such as high-layer parameters and network architecture, have much less impact on the composition of confusing samples. Two DNNs with different low-layer parameters usually have fully different sets of confusing samples, even though they have similar performance. This finding extends the understanding of the lottery ticket hypothesis, and well explains distinctive representation power of different DNNs.
zh

[NLP-3] Distillation Scaling Laws

【速读】: 该论文旨在解决使用知识蒸馏技术时计算资源分配的问题。论文的关键在于提出了一条蒸馏缩放定律(distillation scaling law),该定律能够基于计算预算及其在学生模型(student)和教师模型(teacher)之间的分配来估算蒸馏模型的性能。这一发现降低了大规模应用知识蒸馏的风险,并提供了计算最优的蒸馏方案,使得在已有教师模型或需要训练教师模型的情况下,可以优化计算资源以最大化学生模型的性能。此外,论文还提供了关于大规模蒸馏研究的新见解,从而增进对蒸馏的理解并指导实验设计。

链接: https://arxiv.org/abs/2502.08606
作者: Dan Busbridge,Amitis Shidani,Floris Weers,Jason Ramapuram,Etai Littwin,Russ Webb
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 67 pages, 54 figures, 13 tables

点击查看摘要

Abstract:We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.
zh

[NLP-4] SPeCtrum: A Grounded Framework for Multidimensional Identity Representation in LLM -Based Agent NAACL2025

【速读】: 该论文旨在解决现有方法在模拟个体身份时过度简化人类复杂性的问题,可能导致不完整或扁平化的表征。为了解决这一问题,论文提出SPeCtrum框架,通过整合社会身份(Social Identity, S)、个人身份(Personal Identity, P)和个人生活背景(Personal Life Context, C)三个核心组件来构建真实的大型语言模型(LLM)代理人格。关键在于SPeCtrum框架能够综合这三个维度,以更全面和准确地代表个体身份,从而提升模拟的真实性与精确度。研究表明,虽然仅基于个人生活背景(C)可以满足基本的身份模拟需求,但结合社会身份(S)、个人身份(P)和个人生活背景(C)能更好地增强现实世界中身份表征的逼真性和准确性。

链接: https://arxiv.org/abs/2502.08599
作者: Keyeun Lee,Seo Hyeong Kim,Seolhee Lee,Jinsu Eun,Yena Ko,Hayeon Jeon,Esther Hehsun Kim,Seonghye Cho,Soeun Yang,Eun-mee Kim,Hajin Lim
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 8 figures, 5 tables, Accepted in NAACL2025 Main

点击查看摘要

Abstract:Existing methods for simulating individual identities often oversimplify human complexity, which may lead to incomplete or flattened representations. To address this, we introduce SPeCtrum, a grounded framework for constructing authentic LLM agent personas by incorporating an individual’s multidimensional self-concept. SPeCtrum integrates three core components: Social Identity (S), Personal Identity §, and Personal Life Context ©, each contributing distinct yet interconnected aspects of identity. To evaluate SPeCtrum’s effectiveness in identity representation, we conducted automated and human evaluations. Automated evaluations using popular drama characters showed that Personal Life Context ©-derived from short essays on preferences and daily routines-modeled characters’ identities more effectively than Social Identity (S) and Personal Identity § alone and performed comparably to the full SPC combination. In contrast, human evaluations involving real-world individuals found that the full SPC combination provided a more comprehensive self-concept representation than C alone. Our findings suggest that while C alone may suffice for basic identity simulation, integrating S, P, and C enhances the authenticity and accuracy of real-world identity representation. Overall, SPeCtrum offers a structured approach for simulating individuals in LLM agents, enabling more personalized human-AI interactions and improving the realism of simulation-based behavioral studies.
zh

[NLP-5] Quality-Aware Decoding: Unifying Quality Estimation and Decoding

【速读】: 该论文旨在解决神经机器翻译(Neural Machine Translation, NMT)中质量估计(Quality Estimation, QE)模型未直接集成到解码过程的问题。论文的关键在于提出了一种新的令牌级QE模型,并将其整合到解码策略中,以实现基于质量感知的解码(Quality-Aware Decoding)。这一方法通过在解码过程中直接利用QE模型评分,显著提升了翻译质量,相较于使用最先进的QE模型进行N-best重排序,最大提升可达1.39 XCOMET-XXL ↑。

链接: https://arxiv.org/abs/2502.08561
作者: Sai Koneru,Matthias Huck,Miriam Exel,Jan Niehues
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:An emerging research direction in NMT involves the use of Quality Estimation (QE) models, which have demonstrated high correlations with human judgment and can enhance translations through Quality-Aware Decoding. Although several approaches have been proposed based on sampling multiple candidate translations, none have integrated these models directly into the decoding process. In this paper, we address this by proposing a novel token-level QE model capable of reliably scoring partial translations. We build a uni-directional QE model for this, as decoder models are inherently trained and efficient on partial sequences. We then present a decoding strategy that integrates the QE model for Quality-Aware decoding and demonstrate that the translation quality improves when compared to the N-best list re-ranking with state-of-the-art QE models (upto 1.39 XCOMET-XXL \uparrow ). Finally, we show that our approach provides significant benefits in document translation tasks, where the quality of N-best lists is typically suboptimal.
zh

[NLP-6] QA-Expand: Multi-Question Answer Generation for Enhanced Query Expansion in Information Retrieval

【速读】: 该论文旨在解决信息检索(Information Retrieval, IR)中查询扩展(query expansion)存在的问题,特别是现有基于大型语言模型(Large Language Model, LLM)的方法常产生重复且狭窄的扩展内容,缺乏多样性的上下文信息以检索所有相关数据。论文的关键解决方案是提出了一种名为QA-Expand的新框架,该框架通过生成与初始查询相关的多个问题,并随后生成相应的伪答案作为替代文档来实现查询扩展。进一步通过反馈模型重写和筛选这些答案,确保仅采用最具信息量的扩展内容,从而显著提升了检索性能。

链接: https://arxiv.org/abs/2502.08557
作者: Wonduk Seo,Seunghyun Lee
机构: Enhans.ai
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 8 pages

点击查看摘要

Abstract:Query expansion is widely used in Information Retrieval (IR) to improve search outcomes by enriching queries with additional contextual information. Although recent Large Language Model (LLM) based methods generate pseudo-relevant content and expanded terms via multiple prompts, they often yield repetitive, narrow expansions that lack the diverse context needed to retrieve all relevant information. In this paper, we introduce QA-Expand, a novel and effective framework for query expansion. It first generates multiple relevant questions from the initial query and subsequently produces corresponding pseudo-answers as surrogate documents. A feedback model further rewrites and filters these answers to ensure only the most informative augmentations are incorporated. Extensive experiments on benchmarks such as BEIR and TREC demonstrate that QA-Expand enhances retrieval performance by up to 13% over state-of-the-art methods, offering a robust solution for modern retrieval challenges.
zh

[NLP-7] LLM s can implicitly learn from mistakes in-context

【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)是否能够在缺乏详尽错误解释的情况下,通过观察正确与错误答案自行推断出正确的推理路径,从而实现从错误中学习。关键在于发现LLMs在未提供解释性反馈的情境下,仅展示正确与错误答案能够显著提升其数学推理任务的表现,并且这一方法优于传统的链式思维提示(chain-of-thought prompting)。此外,由这些模型生成的新解释性反馈同样获得人类评估者的高度评价。

链接: https://arxiv.org/abs/2502.08550
作者: Lisa Alazraki,Maximilian Mozes,Jon Ander Campos,Yi Chern Tan,Marek Rei,Max Bartolo
机构: Imperial College London(帝国理工学院); Cohere
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensive rationale detailing why an answer is wrong or how to correct it. In this work, we examine whether LLMs can learn from mistakes in mathematical reasoning tasks when these explanations are not provided. We investigate if LLMs are able to implicitly infer such rationales simply from observing both incorrect and correct answers. Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context and incorrect answers are simply shown alongside correct ones. This approach also substantially outperforms chain-of-thought prompting in our evaluations. We show that these results are consistent across LLMs of different sizes and varying reasoning abilities. Further, we carry out an in-depth analysis, and show that prompting with both wrong and correct answers leads to greater performance and better generalisation than introducing additional, more diverse question-answer pairs into the context. Finally, we show that new rationales generated by models that have only observed incorrect and correct answers are scored equally as highly by humans as those produced with the aid of exemplar rationales. Our results demonstrate that LLMs are indeed capable of in-context implicit learning.
zh

[NLP-8] LLM Pretraining with Continuous Concepts

【速读】: 该论文旨在解决标准的离散化下一-token 预测(Next Token Prediction)在大规模语言模型预训练中的局限性。论文提出了一种名为连续概念混合(Continuous Concept Mixing, CoCoMix)的新框架,该框架结合了离散化的下一-token 预测与连续概念的学习。其关键是通过在模型隐藏状态中嵌入由预训练稀疏自动编码器学习到的连续概念,并将其与 token 隐藏表示交错,从而提高样本效率和性能,同时增强模型的可解释性和可控性。实验结果表明,CoCoMix 在多种基准测试中,包括语言建模和下游推理任务,均优于标准的下一-token 预测、知识蒸馏和插入暂停 token 的方法。

链接: https://arxiv.org/abs/2502.08524
作者: Jihoon Tack,Jack Lanchantin,Jane Yu,Andrew Cohen,Ilia Kulikov,Janice Lan,Shibo Hao,Yuandong Tian,Jason Weston,Xian Li
机构: KAIST(韩国科学技术院); Meta(Meta)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model’s hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model’s internal reasoning process.
zh

[NLP-9] Faithful Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation

【速读】: 该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的忠实性评估器在评估摘要忠实度时易受文本流畅性影响且难以识别错误的问题。关键解决方案在于提出一种多轮辩论机制,其中多个基于LLM的评估者被赋予初始立场(无论其真实信念如何),并通过提供理由来证明这些立场,从而进行多轮辩论以达成一致。这种方法通过均匀分布的初始立场分配实现更高的立场多样性,促进更有意义的辩论,并最终更有效地识别错误。此外,论文引入了一个新的维度——模糊性(ambiguity),以及详细的分类法来识别那些非黑即白情况之外的特殊情况。实验表明,该方法不仅能帮助识别模糊性,还能在非模糊性摘要上表现出更强的性能。

链接: https://arxiv.org/abs/2502.08514
作者: Mahnaz Koupaee,Jake W. Vincent,Saab Mansour,Igor Shalyminov,Han He,Hwanjun Song,Raphael Shu,Jianfeng He,Yi Nian,Amy Wing-mei Wong,Kyu J. Han,Hang Su
机构: Amazon(亚马逊); Stony Brook University(石溪大学); Korea Advanced Institute of Science and Technology(韩国科学技术院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Faithfulness evaluators based on large language models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries. We propose an approach to summary faithfulness evaluation in which multiple LLM-based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi-round debate to reach an agreement. The uniformly distributed initial assignments result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.
zh

[NLP-10] Measuring Diversity in Synthetic Datasets

【速读】: 该论文旨在解决准确衡量生成式大型语言模型(Large Language Models, LLMs)所生成的合成数据集多样性的问题,这是影响模型鲁棒性能的关键因素。论文提出的关键解决方案是DCScore,这是一种从分类视角评估合成数据集多样性的新方法。DCScore通过将多样性评估转化为样本分类任务,并利用样本间的相互关系来实现这一目标。这种方法不仅在理论上有据可依,而且在实验中表现出更强的相关性和较低的计算成本。

链接: https://arxiv.org/abs/2502.08512
作者: Yuchang Zhu,Huizhe Zhang,Bingzhe Wu,Jintang Li,Zibin Zheng,Peilin Zhao,Liang Chen,Yatao Bian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing approaches. Code is available at: this https URL.
zh

[NLP-11] Explanation based In-Context Demonstrations Retrieval for Multilingual Grammatical Error Correction NAACL2025

【速读】: 该论文旨在解决在利用少量提示学习(few-shot learning)进行语法错误纠正(GEC)时,如何有效选择上下文示例的问题。现有方法中,输入文本之间的相似性并不总能对应相似的语法错误模式,导致难以选取有效的上下文示例。论文的关键解决方案在于提出了一种基于自然语言语法错误解释(GEE)的检索方法。通过匹配测试输入与预构造数据库样本之间的GEE,该方法能够检索到合适的少量示例,其中错误样本的解释由大型语言模型(LLMs)生成。这种方法无需额外训练或语言适应,在多种语言的实验中显示出优于现有语义和BM25检索技术的效果,表明匹配错误模式是选择示例的关键。

链接: https://arxiv.org/abs/2502.08507
作者: Wei Li,Wen Luo,Guangyue Peng,Houfeng Wang
机构: State Key Laboratory of Multimedia Information Processing (多媒体信息处理国家重点实验室); School of Computer Science, Peking University (北京大学计算机学院)
类目: Computation and Language (cs.CL)
备注: Accepted by NAACL 2025 main conference

点击查看摘要

Abstract:Grammatical error correction (GEC) aims to correct grammatical, spelling, and semantic errors in natural language text. With the growing of large language models (LLMs), direct text generation has gradually become the focus of the GEC methods, and few-shot in-context learning presents a cost-effective solution. However, selecting effective in-context examples remains challenging, as the similarity between input texts does not necessarily correspond to similar grammatical error patterns. In this paper, we propose a novel retrieval method based on natural language grammatical error explanations (GEE) to address this issue. Our method retrieves suitable few-shot demonstrations by matching the GEE of the test input with that of pre-constructed database samples, where explanations for erroneous samples are generated by LLMs. We conducted multilingual GEC few-shot experiments on both major open-source and closed-source LLMs. Experiments across five languages show that our method outperforms existing semantic and BM25-based retrieval techniques, without requiring additional training or language adaptation. This also suggests that matching error patterns is key to selecting examples.
zh

[NLP-12] Salamandra Technical Report

【速读】: 该论文旨在开发和评估Salamandra模型系列,这是一组开源的解码器型大规模语言模型。解决方案的关键在于从高度多语种数据集(包含35种欧洲语言的文本及代码)训练模型,并通过精心策划的开放获取数据源确保数据质量。此外,模型在指令数据上进行微调以支持聊天应用,并展示了初步的多模态实验结果。论文还强调了通过公开设计选择、数据整理策略和评估方法来促进开放科学,并通过提供训练和评估脚本的公共访问权限以及采用宽松的Apache 2.0许可协议来推动未来研究和商业使用。

链接: https://arxiv.org/abs/2502.08489
作者: Aitor Gonzalez-Agirre,Marc Pàmies,Joan Llop,Irene Baucells,Severino Da Dalt,Daniel Tamayo,José Javier Saiz,Ferran Espuña,Jaume Prats,Javier Aula-Blasco,Mario Mina,Adrián Rubio,Alexander Shvets,Anna Sallés,Iñaki Lacunza,Iñigo Pikabea,Jorge Palomar,Júlia Falcão,Lucía Tormo,Luis Vasquez-Reina,Montserrat Marimon,Valle Ruíz-Fernández,Marta Villegas
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and this http URL this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.
zh

[NLP-13] Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning

【速读】: 该论文旨在解决生成长且正确的链式思维(Chain-of-Thought, CoT)推理轨迹的挑战。论文的关键解决方案是提出RELAY方法,通过将CoT推理步骤与循环迭代对齐,并在训练过程中引入迭代监督,从而不仅保持了循环Transformer在长度泛化方面的优势,还使其能够预测未见数据的CoT推理步骤,最终用于生成复杂问题的准确推理链,并进一步微调自回归模型以显著提升其性能。

链接: https://arxiv.org/abs/2502.08482
作者: Qifan Yu,Zhenyu He,Sijie Li,Xun Zhou,Jun Zhang,Jingjing Xu,Di He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: work in progress

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model’s reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer’s ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at this https URL.
zh

[NLP-14] mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

【速读】: 该论文旨在解决多模态嵌入模型在有限标注多模态数据条件下性能受限的问题。解决方案的关键在于合成高质量的多模态数据,具体通过确保数据具有广泛的覆盖范围(广度)、稳健的跨模态对齐(一致性)以及高保真性(真实性),从而提升下游任务的效果。

链接: https://arxiv.org/abs/2502.08468
作者: Haonan Chen,Liang Wang,Nan Yang,Yutao Zhu,Ziliang Zhao,Furu Wei,Zhicheng Dou
机构: Renmin University of China(中国人民大学); Microsoft Corporation(微软公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in this https URL.
zh

[NLP-15] Examining Spanish Counseling with MIDAS: a Motivational Interviewing Dataset in Spanish NAACL2025

【速读】: 该论文旨在探究文化与语言因素如何影响咨询过程,并验证英语环境下会话分析的结果是否适用于其他语言环境。为此,论文引入了MIDAS(西班牙语中的动机访谈数据集),这是一个从公开视频源收集的咨询数据集,包含专家标注的咨询反馈和提问。关键解决方案在于利用此数据集探索英西语言环境中咨询师行为的语言差异,并开发出在单语和多语境下工作的分类器,以展示其在咨询师行为编码任务中的应用。

链接: https://arxiv.org/abs/2502.08458
作者: Aylin Gunal,Bowen Yi,John Piette,Rada Mihalcea,Verónica Pérez-Rosas
机构: University of Michigan(密歇根大学); Texas State University(德克萨斯州立大学)
类目: Computation and Language (cs.CL)
备注: To appear in NAACL 2025 Main Conference

点击查看摘要

Abstract:Cultural and language factors significantly influence counseling, but Natural Language Processing research has not yet examined whether the findings of conversational analysis for counseling conducted in English apply to other languages. This paper presents a first step towards this direction. We introduce MIDAS (Motivational Interviewing Dataset in Spanish), a counseling dataset created from public video sources that contains expert annotations for counseling reflections and questions. Using this dataset, we explore language-based differences in counselor behavior in English and Spanish and develop classifiers in monolingual and multilingual settings, demonstrating its applications in counselor behavioral coding tasks.
zh

[NLP-16] owards Prompt Generalization: Grammar-aware Cross-Prompt Automated Essay Scoring NAACL2025

【速读】: 该论文旨在解决自动化作文评分(Automated Essay Scoring, AES)在跨Prompt设置下的挑战,即如何使评分模型适应未见过的Prompt。现有方法在特定Prompt下训练时难以获得通用的作文表示。论文的关键解决方案是提出了一种基于语法感知的跨Prompt特质评分方法(Grammar-aware Cross-Prompt Trait Scoring, GAPS),通过捕捉与Prompt无关的句法特征来学习通用作文表示。该方法利用语法错误修正技术获取修正后的作文信息,并将这些信息无缝集成到AES模型中,使得模型在训练过程中能够同时参考修正前后的作文,从而聚焦于通用特征。

链接: https://arxiv.org/abs/2502.08450
作者: Heejin Do,Taehee Park,Sangwon Ryu,Gary Geunbae Lee
机构: Graduate School of Artificial Intelligence, POSTECH, Republic of Korea(POSTECH人工智能研究生院,韩国); Department of Computer Science and Engineering, POSTECH, Republic of Korea(POSTECH计算机科学与工程系,韩国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025 (Findings)

点击查看摘要

Abstract:In automated essay scoring (AES), recent efforts have shifted toward cross-prompt settings that score essays on unseen prompts for practical applicability. However, prior methods trained with essay-score pairs of specific prompts pose challenges in obtaining prompt-generalized essay representation. In this work, we propose a grammar-aware cross-prompt trait scoring (GAPS), which internally captures prompt-independent syntactic aspects to learn generic essay representation. We acquire grammatical error-corrected information in essays via the grammar error correction technique and design the AES model to seamlessly integrate such information. By internally referring to both the corrected and the original essays, the model can focus on generic features during training. Empirical experiments validate our method’s generalizability, showing remarkable improvements in prompt-independent and grammar-related traits. Furthermore, GAPS achieves notable QWK gains in the most challenging cross-prompt scenario, highlighting its strength in evaluating unseen prompts.
zh

[NLP-17] Better Embeddings with Coupled Adam

【速读】: 该论文旨在解决大型语言模型(LLMs)中词嵌入的各向异性(anisotropy)问题。论文的关键在于提出了一种名为耦合Adam(Coupled Adam)的优化器,以缓解由Adam优化器的二阶矩导致的各向异性嵌入问题。实验结果表明,耦合Adam显著提升了嵌入的质量,并在大规模数据集上带来了更好的上游和下游性能。

链接: https://arxiv.org/abs/2502.08441
作者: Felix Stollenwerk,Tobias Stollenwerk
机构: AI Sweden; Forschungszentrum Jülich
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Despite their remarkable capabilities, LLMs learn word representations that exhibit the undesirable yet poorly understood feature of anisotropy. In this paper, we argue that the second moment in Adam is a cause of anisotropic embeddings, and suggest a modified optimizer called Coupled Adam to mitigate the problem. Our experiments demonstrate that Coupled Adam significantly improves the quality of embeddings, while also leading to better upstream and downstream performance on large enough datasets.
zh

[NLP-18] Composite SketchText Queries for Retrieving Objects with Elusive Names and Complex Interactions AAAI2024

【速读】: 该论文旨在解决非母语使用者在搜索难以命名但易于绘制的对象及其复杂交互场景时所面临的困难。具体而言,用户希望通过包含手绘草图和描述难以描绘但易于口头表达的对象属性或场景交互的复合多模态查询来进行搜索。这一问题与之前广泛研究的基于文本的图像检索(TBIR)和基于草图的图像检索(SBIR)有所不同。为了解决此问题,论文提出了一种预训练的多模态变压器基线模型STNET(Sketch+Text Network),该模型通过手绘草图定位自然场景图像中的相关对象,并编码文本和图像以执行图像检索。关键在于结合对比学习和多个改进模型性能的训练目标。实验表明,所提方法在文本、草图以及复合查询模态下的检索性能均优于现有的先进方法。

链接: https://arxiv.org/abs/2502.08438
作者: Prajwal Gatti,Kshitij Parikh,Dhriti Prasanna Paul,Manish Gupta,Anand Mishra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted at AAAI 2024, 9 pages. Project Website: this https URL

点击查看摘要

Abstract:Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for numbats. Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., numbat digging in the ground. In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of difficult-to-name but easy-to-draw objects and text describing difficult-to-sketch but easy-to-verbalize object attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of approx. 2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNET (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at our project website.
zh

[NLP-19] From Haystack to Needle: Label Space Reduction for Zero-shot Classification ICML2025

【速读】: 该论文旨在提升大型语言模型(Large Language Models, LLMs)在零样本分类任务中的性能。解决方案的关键在于引入了一种名为标签空间缩减(Label Space Reduction, LSR)的方法,通过系统性地排序和减少候选类别,迭代地精炼分类标签空间,从而使模型能够集中于最相关的选项。这种动态优化标签空间表示的方法利用未标注数据和数据驱动模型的统计学习能力,在测试时提高分类性能。

链接: https://arxiv.org/abs/2502.08436
作者: Nathan Vandemoortele,Bram Steenwinckel,Femke Ongenae,Sofie Van Hoecke
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review at ICML 2025

点击查看摘要

Abstract:We present Label Space Reduction (LSR), a novel method for improving zero-shot classification performance of Large Language Models (LLMs). LSR iteratively refines the classification label space by systematically ranking and reducing candidate classes, enabling the model to concentrate on the most relevant options. By leveraging unlabeled data with the statistical learning capabilities of data-driven models, LSR dynamically optimizes the label space representation at test time. Our experiments across seven benchmarks demonstrate that LSR improves macro-F1 scores by an average of 7.0% (up to 14.2%) with Llama-3.1-70B and 3.3% (up to 11.1%) with Claude-3.5-Sonnet compared to standard zero-shot classification baselines. To reduce the computational overhead of LSR, which requires an additional LLM call at each iteration, we propose distilling the model into a probabilistic classifier, allowing for efficient inference.
zh

[NLP-20] A Semantic Parsing Algorithm to Solve Linear Ordering Problems

【速读】: 该论文旨在解决线性排序问题中的语义解析挑战,即如何利用演绎推理来安排实体的顺序。关键解决方案在于开发了一种算法,将前提和候选陈述转换为一阶逻辑表示,并运用约束逻辑编程推断排序命题的真实性。这一方法通过引入抽象类型和模板,以及动态解释实体在上下文框架中的机制,将基于Heim和Kratzer的句法组成形式语义规则转化为计算算法。所提出的符号系统Formal Semantic Logic Inferer (FSLI) 在BIG-bench的逻辑推理多项选择题测试中达到了完美的准确性,显著优于现有的大型语言模型(LLM)和混合系统。

链接: https://arxiv.org/abs/2502.08415
作者: Maha Alkhairy,Vincent Homer,Brendan O’Connor
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 3 figures, 9 pages main paper and 6 pages references and appendix

点击查看摘要

Abstract:We develop an algorithm to semantically parse linear ordering problems, which require a model to arrange entities using deductive reasoning. Our method takes as input a number of premises and candidate statements, parsing them to a first-order logic of an ordering domain, and then utilizes constraint logic programming to infer the truth of proposed statements about the ordering. Our semantic parser transforms Heim and Kratzer’s syntax-based compositional formal semantic rules to a computational algorithm. This transformation involves introducing abstract types and templates based on their rules, and introduces a dynamic component to interpret entities within a contextual framework. Our symbolic system, the Formal Semantic Logic Inferer (FSLI), is applied to answer multiple choice questions in BIG-bench’s logical_deduction multiple choice problems, achieving perfect accuracy, compared to 67.06% for the best-performing LLM (GPT-4) and 87.63% for the hybrid system Logic-LM. These promising results demonstrate the benefit of developing a semantic parsing algorithm driven by first-order logic constructs. Comments: 3 figures, 9 pages main paper and 6 pages references and appendix Subjects: Computation and Language (cs.CL); Logic in Computer Science (cs.LO) Cite as: arXiv:2502.08415 [cs.CL] (or arXiv:2502.08415v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.08415 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Maha Alkhairy [view email] [v1] Wed, 12 Feb 2025 13:58:42 UTC (193 KB)
zh

[NLP-21] IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance

【速读】: 该论文旨在解决大型语言模型(LLMs)在实际用户交互中展现的议题偏见问题,即LLMs倾向于呈现单一视角,可能影响用户对该议题的看法。为了解决这一问题,论文的关键在于创建了一个名为IssueBench的数据集,包含249万个基于真实用户互动的模板和政治议题,用于测量LLMs在写作辅助中的议题偏见。通过使用IssueBench,研究发现议题偏见在最先进的LLMs中普遍存在且持续存在,并且不同模型之间的偏见非常相似,大多数模型更倾向于与美国民主党而非共和党的观点保持一致。IssueBench的灵活性使其能够适应其他议题、模板或任务,从而为LLMs偏见的讨论和应对提供了新的实证依据。

链接: https://arxiv.org/abs/2502.08395
作者: Paul Röttger,Musashi Hinck,Valentin Hofmann,Kobi Hackenburg,Valentina Pyatkin,Faeze Brahman,Dirk Hovy
机构: 未知
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs actually manifest in real user interactions, making it difficult to address the risks from biased LLMs. Therefore, we create IssueBench: a set of 2.49m realistic prompts for measuring issue bias in LLM writing assistance, which we construct based on 3.9k templates (e.g. “write a blog about”) and 212 political issues (e.g. “AI regulation”) from real user interactions. Using IssueBench, we show that issue biases are common and persistent in state-of-the-art LLMs. We also show that biases are remarkably similar across models, and that all models align more with US Democrat than Republican voter opinion on a subset of issues. IssueBench can easily be adapted to include other issues, templates, or tasks. By enabling robust and realistic measurement, we hope that IssueBench can bring a new quality of evidence to ongoing discussions about LLM biases and how to address them.
zh

[NLP-22] Unveiling Global Discourse Structures: Theoretical Analysis and NLP Applications in Argument Mining

【速读】: 该论文旨在解决在论证挖掘(Argument Mining)过程中,全球话语结构的检测、提取和表征问题。当前方法在论证成分的提取与分类方面存在不足。为克服这些挑战,论文提出了一种新的自然语言处理(NLP)技术架构,以提高模型的泛化能力。关键在于利用这些新型NLP技术来改进现有方法,从而更有效地识别和表征连贯的论证结构。

链接: https://arxiv.org/abs/2502.08371
作者: Christopher van Le
机构: University of Applied Sciences for Engineering and Economics (应用科技大学)
HTW Berlin (HTW 柏林)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Particularly in the structure of global discourse, coherence plays a pivotal role in human text comprehension and is a hallmark of high-quality text. This is especially true for persuasive texts, where coherent argument structures support claims effectively. This paper discusses and proposes methods for detecting, extracting and representing these global discourse structures in a proccess called Argument(ation) Mining. We begin by defining key terms and processes of discourse structure analysis, then continue to summarize existing research on the matter, and identify shortcomings in current argument component extraction and classification methods. Furthermore, we will outline an architecture for argument mining that focuses on making models more generalisable while overcoming challenges in the current field of research by utilizing novel NLP techniques. This paper reviews current knowledge, summarizes recent works, and outlines our NLP pipeline, aiming to contribute to the theoretical understanding of global discourse structures.
zh

[NLP-23] op-Theta Attention: Sparsifying Transformers by Compensated Thresholding

【速读】: 该论文旨在解决Transformer基座大语言模型(LLMs)中自注意力机制计算复杂度高的问题。关键在于引入Top-θ注意力机制,通过与精心校准的阈值进行比较来选择性剪枝不重要的注意力元素,从而大幅提高自注意力矩阵乘法的效率,同时保持模型精度。这种方法无需模型重新训练,仅需短暂校准阶段即可适应分布变化,并且避免了全向量依赖,适用于平铺和扩展,同时消除了昂贵的top-k搜索。

链接: https://arxiv.org/abs/2502.08363
作者: Konstantin Berestizshevsky,Renzo Andri,Lukas Cavigelli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 11 figures, work under submission

点击查看摘要

Abstract:The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top- \theta , which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top- \theta eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.
zh

[NLP-24] Systematic Knowledge Injection into Large Language Models via Diverse Augmentation for Domain-Specific RAG NAACL2025

【速读】: 该论文旨在解决 Retrieval-Augmented Generation (RAG) 模型在处理检索错误时导致的幻觉和错误答案的问题。关键在于通过引入一种新颖的框架,通过上下文增强(context augmentation)和知识释义(knowledge paraphrasing)两种方式来改进模型的微调过程。这种方法不仅教导模型何时忽略或依赖检索内容,还通过多答案训练使模型更好地内化专业知识,从而有效提升了模型在保持泛化能力的同时,达到了最高相对增益达10%的token级召回率。

链接: https://arxiv.org/abs/2502.08356
作者: Kushagra Bhushan,Yatin Nandwani,Dinesh Khandelwal,Sonam Gupta,Gaurav Pandey,Dinesh Raghu,Sachindra Joshi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, 14 tables, to be published in NAACL 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a prominent method for incorporating domain knowledge into Large Language Models (LLMs). While RAG enhances response relevance by incorporating retrieved domain knowledge in the context, retrieval errors can still lead to hallucinations and incorrect answers. To recover from retriever failures, domain knowledge is injected by fine-tuning the model to generate the correct response, even in the case of retrieval errors. However, we observe that without systematic knowledge augmentation, fine-tuned LLMs may memorize new information but still fail to extract relevant domain knowledge, leading to poor performance. In this work, we present a novel framework that significantly enhances the fine-tuning process by augmenting the training data in two ways – context augmentation and knowledge paraphrasing. In context augmentation, we create multiple training samples for a given QA pair by varying the relevance of the retrieved information, teaching the model when to ignore and when to rely on retrieved content. In knowledge paraphrasing, we fine-tune with multiple answers to the same question, enabling LLMs to better internalize specialized knowledge. To mitigate catastrophic forgetting due to fine-tuning, we add a domain-specific identifier to a question and also utilize a replay buffer containing general QA pairs. Experimental results demonstrate the efficacy of our method over existing techniques, achieving up to 10% relative gain in token-level recall while preserving the LLM’s generalization capabilities.
zh

[NLP-25] Contextual Compression Encoding for Large Language Models : A Novel Framework for Multi-Layered Parameter Space Pruning

【速读】: 该论文旨在解决随着模型规模不断增大而引入的计算瓶颈问题,以促进高效部署。关键解决方案是提出了结构化编码方法——上下文感知压缩编码(Contextual Compression Encoding, CCE),通过多阶段编码机制动态重构参数分布,在保持跨层表征保真度的同时,显著减少了内存占用和计算复杂性。实验评估表明,通过CCE压缩的模型在多种文本生成和分类任务中保持了语言表达性和连贯性,并且在中间网络层表现出更高的压缩比。与传统量化和剪枝方法相比,CCE提供了更平衡的效率与模型保留之间的权衡。

链接: https://arxiv.org/abs/2502.08323
作者: Barnaby Schmitt,Alistair Grosvenor,Matthias Cunningham,Clementine Walsh,Julius Pembrokeshire,Jonathan Teel
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Context-aware compression techniques have gained increasing attention as model sizes continue to grow, introducing computational bottlenecks that hinder efficient deployment. A structured encoding approach was proposed to selectively eliminate redundant parameter groups while ensuring that representational fidelity was preserved across multiple layers. Contextual Compression Encoding (CCE) introduced a multi-stage encoding mechanism that dynamically restructured parameter distributions, allowing for significant reductions in memory footprint and computational complexity. Experimental evaluations demonstrated that models compressed through CCE retained linguistic expressivity and coherence, maintaining accuracy across a range of text generation and classification tasks. Layer-wise analysis revealed that middle-network layers exhibited higher compression ratios, aligning with the observation that self-attention and feed-forward transformations contained redundancies that could be reorganized without impairing functional capacity. Comparisons against conventional quantization and pruning methods confirmed that CCE provided a more balanced trade-off between efficiency and model retention, achieving reductions in energy consumption and inference latency without requiring extensive retraining. Computational efficiency improvements were particularly evident in deployment scenarios involving resource-constrained environments, where reductions in memory usage enabled more scalable implementations. Further analyses of internal network behavior showed that compressed models exhibited stable activation distributions and adapted dynamically to input variations, reinforcing the viability of structured compression strategies for optimizing large-scale architectures.
zh

[NLP-26] MultiProSE: A Multi-label Arabic Dataset for Propaganda Sentiment and Emotion Detection

【速读】: 该论文旨在解决阿拉伯语中多标签宣传、情感和情绪检测资源极度有限的问题。关键解决方案在于引入了首个阿拉伯语多标签宣传、情感和情绪(MultiProSE)数据集,该数据集包含8,000篇标注新闻文章,并扩展了现有的阿拉伯语宣传数据集ArPro,增加了每段文本的情感和情绪注释。此外,针对不同任务开发了基于大型语言模型(如GPT-4o-mini)和预训练语言模型(包括三种基于BERT的模型)的多个基线模型。

链接: https://arxiv.org/abs/2502.08319
作者: Lubna Al-Henaki,Hend Al-Khalifa,Abdulmalik Al-Salman,Hajar Alqubayshi,Hind Al-Twailay,Gheeda Alghamdi,Hawra Aljasim
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figuers, 4 tabels

点击查看摘要

Abstract:Propaganda is a form of persuasion that has been used throughout history with the intention goal of influencing people’s opinions through rhetorical and psychological persuasion techniques for determined ends. Although Arabic ranked as the fourth most- used language on the internet, resources for propaganda detection in languages other than English, especially Arabic, remain extremely limited. To address this gap, the first Arabic dataset for Multi-label Propaganda, Sentiment, and Emotion (MultiProSE) has been introduced. MultiProSE is an open-source extension of the existing Arabic propaganda dataset, ArPro, with the addition of sentiment and emotion annotations for each text. This dataset comprises 8,000 annotated news articles, which is the largest propaganda dataset to date. For each task, several baselines have been developed using large language models (LLMs), such as GPT-4o-mini, and pre-trained language models (PLMs), including three BERT-based models. The dataset, annotation guidelines, and source code are all publicly released to facilitate future research and development in Arabic language models and contribute to a deeper understanding of how various opinion dimensions interact in news media1.
zh

[NLP-27] Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting NAACL

【速读】: 该论文旨在解决大型视觉语言模型(LVLMs)在处理空间关系时产生的幻觉问题,即生成关于图像内物体位置和空间配置的错误预测。为了解决这一问题,论文提出了一种约束感知提示框架,关键在于引入两种类型的约束:双向约束(bidirectional constraint),确保成对物体关系的一致性;传递性约束(transitivity constraint),强制多物体间的关系依赖。通过融合这些约束,模型能够生成更具有空间连贯性和一致性的输出。

链接: https://arxiv.org/abs/2502.08317
作者: Jiarui Wu,Zhuo Liu,Hangfeng He
机构: University of Rochester (罗彻斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, accepted to NAACL Findings

点击查看摘要

Abstract:Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.
zh

[NLP-28] Word Synchronization Challenge: A Benchmark for Word Association Responses for LLM s

【速读】: 该论文旨在解决评估大型语言模型(Large Language Models, LLMs)在人机交互(Human-Computer Interaction, HCI)中的表现问题。为了解决这一问题,论文提出了Word Synchronization Challenge,这是一个新的基准测试,通过一个动态的游戏化框架来评估LLMs在词联想过程中模仿人类认知过程的能力。关键在于通过模拟复杂的人类互动,考察LLMs在对话交流中如何理解和适应人类思维模式,从而实现有效的人机社交合作。初步结果表明模型复杂性对其性能有显著影响,这为理解LLMs在进行有意义的社会互动及以类人方式调整行为的能力提供了见解。

链接: https://arxiv.org/abs/2502.08312
作者: Tanguy Cazalets,Joni Dambre
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the Word Synchronization Challenge, a novel benchmark to evaluate large language models (LLMs) in Human-Computer Interaction (HCI). This benchmark uses a dynamic game-like framework to test LLMs ability to mimic human cognitive processes through word associations. By simulating complex human interactions, it assesses how LLMs interpret and align with human thought patterns during conversational exchanges, which are essential for effective social partnerships in HCI. Initial findings highlight the influence of model sophistication on performance, offering insights into the models capabilities to engage in meaningful social interactions and adapt behaviors in human-like ways. This research advances the understanding of LLMs potential to replicate or diverge from human cognitive functions, paving the way for more nuanced and empathetic human-machine collaborations.
zh

[NLP-29] Compromising Honesty and Harmlessness in Language Models via Deception Attacks

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在特定攻击下的诚实性和无害性受损问题。论文的关键在于引入了一种新颖的欺骗攻击方法,通过微调技术增强模型的欺骗倾向,使其在特定话题上误导用户,同时保持其他方面的准确性。此外,研究还发现这些具有欺骗性的模型还会生成有害内容,如仇恨言论、刻板印象等。因此,论文强调了确保这些基于LLMs的系统在多轮对话中持续不欺骗的重要性,并指出需要采取措施来防范此类欺骗攻击。

链接: https://arxiv.org/abs/2502.08301
作者: Laurène Vaugrante,Francesca Carlon,Maluna Menke,Thilo Hagendorff
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce a novel attack that undermines both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. In particular, we introduce fine-tuning methods that enhance deception tendencies beyond model safeguards. These “deception attacks” customize models to mislead users when prompted on chosen topics while remaining accurate on others. Furthermore, we find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.
zh

[NLP-30] Improving Existing Optimization Algorithms with LLM s

【速读】: 该论文旨在探索大型语言模型(Large Language Models, LLMs)如何增强现有的优化算法。关键在于利用LLMs的预训练知识,提出创新的启发式变体和实现策略。研究通过应用构造、合并、求解和自适应(Construct, Merge, Solve and Adapt, CMSA)算法验证了这一方法的有效性,结果显示GPT-4o提出的替代启发式方法在大规模和密集图上的性能优于专家设计的CMSA启发式方法。

链接: https://arxiv.org/abs/2502.08298
作者: Camilo Chacón Sartori,Christian Blum
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into optimization has created a powerful synergy, opening exciting research opportunities. This paper investigates how LLMs can enhance existing optimization algorithms. Using their pre-trained knowledge, we demonstrate their ability to propose innovative heuristic variations and implementation strategies. To evaluate this, we applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt (CMSA) – a hybrid metaheuristic for combinatorial optimization problems that incorporates a heuristic in the solution construction phase. Our results show that an alternative heuristic proposed by GPT-4o outperforms the expert-designed heuristic of CMSA, with the performance gap widening on larger and denser graphs. Project URL: this https URL
zh

[NLP-31] Redefining Simplicity: Benchmarking Large Language Models from Lexical to Document Simplification

【速读】: 该论文旨在解决文本简化(Text Simplification, TS)过程中,大型语言模型(Large Language Models, LLMs)在四个任务中的表现:词汇级简化、句法级简化、句子级简化和文档级简化。解决方案的关键在于通过自动度量和人工评估对比轻量级闭源和开源LLMs与传统非LLM方法,揭示LLMs在所有四个任务中均优于非LLM方法,并且其生成的输出质量通常超过现有的人类标注参考。

链接: https://arxiv.org/abs/2502.08281
作者: Jipeng Qiang,Minjiang Huang,Yi Zhu,Yunhao Yuan,Chaowei Zhang,Kui Yu
机构: School of Information and Engineering, Yangzhou University (扬州大学); School of Computer Science and Information Technology, Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text simplification (TS) refers to the process of reducing the complexity of a text while retaining its original meaning and key information. Existing work only shows that large language models (LLMs) have outperformed supervised non-LLM-based methods on sentence simplification. This study offers the first comprehensive analysis of LLM performance across four TS tasks: lexical, syntactic, sentence, and document simplification. We compare lightweight, closed-source and open-source LLMs against traditional non-LLM methods using automatic metrics and human evaluations. Our experiments reveal that LLMs not only outperform non-LLM approaches in all four tasks but also often generate outputs that exceed the quality of existing human-annotated references. Finally, we present some future directions of TS in the era of LLMs.
zh

[NLP-32] What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

【速读】: 该论文旨在解决将录制的视频转化为简洁且准确的文字摘要这一多模态学习中的挑战,特别是在科学领域。解决方案的关键在于引入了一个名为VISTA的数据集,该数据集包含18,599个AI会议报告视频及其对应的论文摘要。此外,论文采用了一种基于计划的框架来更好地捕捉摘要的结构化特性,从而提升总结的质量和事实一致性。尽管如此,模型性能与人类表现之间仍存在显著差距,表明科学视频总结任务依然充满挑战。

链接: https://arxiv.org/abs/2502.08279
作者: Dongqi Liu,Chenxi Whitehouse,Xi Yu,Louis Mahon,Rohit Saxena,Zheng Zhao,Yifu Qiu,Mirella Lapata,Vera Demberg
机构: Saarland University; University of Cambridge; University of Edinburgh
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2306.02873 by other authors

点击查看摘要

Abstract:Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of scientific video summarization.
zh

[NLP-33] Dealing with Annotator Disagreement in Hate Speech Classification

【速读】: 该论文旨在解决在构建高效仇恨言论检测模型过程中,由于标注者对于仇恨言论多样性和主观性的理解差异导致的标注分歧问题。关键解决方案在于评估不同的处理策略,以应对基于微调BERT模型的土耳其语推特中的仇恨言论分类中的标注者分歧。

链接: https://arxiv.org/abs/2502.08266
作者: Somaiyeh Dehghan,Mehmet Umut Sen,Berrin Yanikoglu
机构: Sabanci University(萨班哲大学); Center of Excellence in Data Analytics (VERIM), Sabanci University(数据解析卓越中心 (VERIM), 萨班哲大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hate speech detection is a crucial task, especially on social media, where harmful content can spread quickly. Implementing machine learning models to automatically identify and address hate speech is essential for mitigating its impact and preventing its proliferation. The first step in developing an effective hate speech detection model is to acquire a high-quality dataset for training. Labeled data is foundational for most natural language processing tasks, but categorizing hate speech is difficult due to the diverse and often subjective nature of hate speech, which can lead to varying interpretations and disagreements among annotators. This paper examines strategies for addressing annotator disagreement, an issue that has been largely overlooked. In particular, we evaluate different approaches to deal with annotator disagreement regarding hate speech classification in Turkish tweets, based on a fine-tuned BERT model. Our work highlights the importance of the problem and provides state-of-art benchmark results for detection and understanding of hate speech in online discourse.
zh

[NLP-34] Exploring the Potential of Large Language Models to Simulate Personality EMNLP2024

【速读】: 该论文旨在解决通过大型语言模型(LLMs)模拟人格特质以增强对话系统个性化的问题。关键在于开发一个包含预定义五大人格特征的数据集,并提供一个分析框架来测试LLMs在模拟人格技能方面的表现。研究表明,生成与人格相关文本的任务对于现有模型仍然具有挑战性。

链接: https://arxiv.org/abs/2502.08265
作者: Maria Molchanova,Anna Mikhailova,Anna Korzanova,Lidiia Ostyakova,Alexandra Dolidze
机构: MIPT, Russia (莫斯科物理技术学院,俄罗斯); HSE University, Russia (高等经济学院,俄罗斯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint submitted to Workshop on Customizable NLP (CustomNLP4U) on EMNLP2024

点击查看摘要

Abstract:With the advancement of large language models (LLMs), the focus in Conversational AI has shifted from merely generating coherent and relevant responses to tackling more complex challenges, such as personalizing dialogue systems. In an effort to enhance user engagement, chatbots are often designed to mimic human behaviour, responding within a defined emotional spectrum and aligning to a set of values. In this paper, we aim to simulate personal traits according to the Big Five model with the use of LLMs. Our research showed that generating personality-related texts is still a challenging task for the models. As a result, we present a dataset of generated texts with the predefined Big Five characteristics and provide an analytical framework for testing LLMs on a simulation of personality skills.
zh

[NLP-35] Inference-time sparse attention with asymmetric indexing

【速读】: 该论文旨在解决通过GPU-compliant向量搜索算法加速自注意力机制(Self-Attention)过程中存在的效率低下问题。主要挑战源于标准分区方法在处理键(keys)与查询(queries)的不同分布以及旋转位置编码(RoPE positional encoding)效应时表现不佳。论文的关键解决方案是引入了SAAP(Self-Attention with Asymmetric Partitions),这是一种不对称索引技术,通过为键和查询分别采用不同的分区方式,从而以数据自适应稀疏模式近似自注意力机制。这种方法无需微调预训练的语言模型,仅需训练一个小的查询分类器即可实现。在Llama 3.1-8b模型上的实验表明,SAAP方法通常将需要查找的内存比例降低至原来的二十分之一,相比FlashAttention-v2节省了60%的时间。

链接: https://arxiv.org/abs/2502.08246
作者: Pierre-Emmanuel Mazaré,Gergely Szilvasy,Maria Lomeli,Francisco Massa,Naila Murray,Hervé Jégou,Matthijs Douze
机构: Meta
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-attention in transformer models is an incremental associative memory that maps key vectors to value vectors. One way to speed up self-attention is to employ GPU-compliant vector search algorithms, yet the standard partitioning methods yield poor results in this context, because (1) keys and queries follow different distributions and (2) the effect of RoPE positional encoding. In this paper, we introduce SAAP (Self-Attention with Asymmetric Partitions), which overcomes these problems. It is an asymmetrical indexing technique that employs distinct partitions for keys and queries, thereby approximating self-attention with a data-adaptive sparsity pattern. It works on pretrained language models without finetuning, as it only requires to train (offline) a small query classifier. On a long context Llama 3.1-8b model, with sequences ranging from 100k to 500k tokens, our method typically reduces by a factor 20 the fraction of memory that needs to be looked-up, which translates to a time saving of 60% when compared to FlashAttention-v2. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.08246 [cs.CL] (or arXiv:2502.08246v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.08246 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-36] LLM Modules: Knowledge Transfer from a Large to a Small Model using Enhanced Cross-Attention

【速读】: 该论文旨在解决如何有效地将大型预训练模型的知识转移到较小模型的问题。解决方案的关键在于提出了一种LLM模块架构,并采用增强型交叉注意力机制(Enhanced Cross-Attention Mechanism),通过冻结Qwen2-1.5B模型并传递其表示到GPT-Neo-125M模型来实现知识转移,从而在有限的计算资源下训练出性能可与蒸馏方法相媲美的组合模型。

链接: https://arxiv.org/abs/2502.08213
作者: Konstantin Kolomeitsev(Almaty, Kazakhstan)
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code and pre-trained weights available at this https URL

点击查看摘要

Abstract:In this work, we propose an architecture of LLM Modules that enables the transfer of knowledge from a large pre-trained model to a smaller model using an Enhanced Cross-Attention mechanism. In the proposed scheme, the Qwen2-1.5B model is frozen and its representations are passed through specially designed attention layers to the GPT-Neo-125M model, which is trained on limited computational resources. Experimental results on the Bespoke-Stratos-17k dataset demonstrate that after 15 epochs of training, the combined model generates responses comparable in quality to those obtained by distillation. We discuss the advantages of the modular approach, provide examples of input queries and comparative analysis, and outline prospects for further extension of the method.
zh

[NLP-37] Wisdom of the Crowds in Forecasting: Forecast Summarization for Supporting Future Event Prediction

【速读】: 该论文旨在解决未来事件预测(Future Event Prediction, FEP)中的复杂事件难以通过传统数值数据准确捕捉其语义信息的问题。解决方案的关键在于利用集体智慧(Crowd Wisdom),通过聚合个体预测来形成累积观点,从而提高对未来事件发生可能性估计的准确性。论文还提出了一种新的数据模型来表示个体预测陈述,以支持这一方法。

链接: https://arxiv.org/abs/2502.08205
作者: Anisha Saha,Adam Jatowt
机构: Max Planck Institute for Informatics (马克斯·普朗克信息学研究所), Saarland Informatics Campus (萨尔州计算机科学校区); University of Innsbruck (因斯布鲁克大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Future Event Prediction (FEP) is an essential activity whose demand and application range across multiple domains. While traditional methods like simulations, predictive and time-series forecasting have demonstrated promising outcomes, their application in forecasting complex events is not entirely reliable due to the inability of numerical data to accurately capture the semantic information related to events. One forecasting way is to gather and aggregate collective opinions on the future to make predictions as cumulative perspectives carry the potential to help estimating the likelihood of upcoming events. In this work, we organize the existing research and frameworks that aim to support future event prediction based on crowd wisdom through aggregating individual forecasts. We discuss the challenges involved, available datasets, as well as the scope of improvement and future research directions for this task. We also introduce a novel data model to represent individual forecast statements.
zh

[NLP-38] Enhancing LLM Character-Level Manipulation via Divide and Conquer

【速读】: 该论文旨在解决大型语言模型(LLMs)在字符级字符串操作中的显著弱点,特别是在字符删除、插入和替换等基本操作中的不足。这些问题主要源于分词限制。论文的关键在于提出了一种名为“分而治之的字符级操作”(Character-Level Manipulation via Divide and Conquer)的新方法,通过将复杂操作分解为显式的字符级子任务,并结合受控的分词重建阶段,从而显著提升了准确性。这种方法无需额外训练即可显著改善删除、插入和替换任务的性能。

链接: https://arxiv.org/abs/2502.08180
作者: Zhen Xiong,Yujun Cai,Bryan Hooi,Nanyun Peng,Kai-Wei Chang,Zhecheng Li,Yiwei Wang
机构: University of Southern California(南加州大学); The University of Queensland(昆士兰大学); National University of Singapore(新加坡国立大学); University of California, Los Angeles(加州大学洛杉矶分校); University of California, San Diego(加州大学圣地亚哥分校); University of California, Merced(加州大学默塞德分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks. However, they exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution. These challenges stem primarily from tokenization constraints, despite the critical role of such operations in data preprocessing and code generation. Through systematic analysis, we derive two key insights: (1) LLMs face significant difficulties in leveraging intrinsic token knowledge for character-level reasoning, and (2) atomized word structures can substantially enhance LLMs’ ability to process token-level structural information. Building on these insights, we propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation. Our method decomposes complex operations into explicit character-level subtasks coupled with controlled token reconstruction phases, leading to significant improvements in accuracy. Without additional training, our method significantly improves accuracies on the \textttDeletion , \textttInsertion , and \textttSubstitution tasks. To support further research, we open-source our implementation and benchmarks.
zh

[NLP-39] ParetoRAG : Leverag ing Sentence-Context Attention for Robust and Efficient Retrieval-Augmented Generation

【速读】: 该论文旨在解决 Retrieval-Augmented Generation (RAG) 系统在检索效率低下及大型语言模型 (LLMs) 无法有效过滤无关信息方面的问题。关键解决方案是提出了一种名为 ParetoRAG 的无监督框架,通过遵循帕累托原则在句子层面进行细化,实现段落分解与核心内容动态重权,从而在不增加额外训练或 API 资源的情况下,同时提升检索精度和生成质量。

链接: https://arxiv.org/abs/2502.08178
作者: Ruobing Yao,Yifei Zhang,Shuang Song,Yuhua Liu,Neng Gao,Chenyang Tu
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所), Beijing, China; School of Cybersecurity, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院), Beijing, China; Alibaba Group(阿里巴巴集团), Beijing, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external knowledge, they still face persistent challenges in retrieval inefficiency and the inability of LLMs to filter out irrelevant information. We present ParetoRAG, an unsupervised framework that optimizes RAG systems through sentence-level refinement guided by the Pareto principle. By decomposing paragraphs into sentences and dynamically re-weighting core content while preserving contextual coherence, ParetoRAG achieves dual improvements in both retrieval precision and generation quality without requiring additional training or API resources. This framework has been empirically validated across various datasets, LLMs, and retrievers.
zh

[NLP-40] SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation

【速读】: 该论文旨在解决合成孔径雷达(SAR)遥感图像解释领域中视觉语言模型(VLMs)应用受限的问题。由于现有模型缺乏足够的领域专业知识,VLMs 在专业领域的应用受到限制。为了解决这一问题,论文的关键创新在于提出了首个大规模多模态对话数据集 SARChat-2M,其中包含约 200 万高质量的图像-文本对,并涵盖了具有详细目标注释的各种场景。此数据集不仅支持视觉理解与物体检测等关键任务,还通过开发和评估 VLMs 在 SAR 图像解释中的能力,为构建各类遥感垂直领域的多模态数据集提供了范例框架。通过在 16 种主流 VLMs 上进行实验,验证了数据集的有效性,并成功建立了首个 SAR 领域的多任务对话基准。

链接: https://arxiv.org/abs/2502.08168
作者: Zhiming Ma,Xiayang Xiao,Sihao Dong,Peidong Wang,HaiPeng Wang,Qingyun Pan
机构: The Key Laboratory for Information Science of Electromagnetic Waves (教育部); School of Information Science and Technology (信息科学与技术学院), Fudan University (复旦大学), Shanghai (上海), China (中国); China Mobile Internet Company Ltd. (中国移动互联网有限公司), Guangzhou (广州), China (中国); The School of Automation and Electrical Engineering (自动化与电气工程学院), Inner Mongolia University of Science and Technology (内蒙古科技大学), Baotou (包头), China (中国); School of Computer Science and Engineering (计算机科学与工程学院), Northeastern University (东北大学), Shenyang (沈阳), China (中国); China Mobile Group Guangdong Co., Ltd. Guangzhou Branch (中国移动通信集团广东有限公司广州分公司), Guangzhou (广州), China (中国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the field of synthetic aperture radar (SAR) remote sensing image interpretation, although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs’ capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified, and the first multi-task dialogue benchmark in the SAR field has been successfully established. The project will be released at this https URL, aiming to promote the in-depth development and wide application of SAR visual language models.
zh

[NLP-41] LowRA: Accurate and Efficient LoRA Fine-Tuning of LLM s under 2 Bits

【速读】: 该论文旨在解决大规模语言模型(LLMs)在微调过程中随着参数量增加而导致的成本上升问题,即使采用参数高效微调(PEFT)方法如LoRA,仍然资源密集。论文的关键解决方案是引入LowRA框架,它实现了低于每参数2位的LoRA微调,并且性能损失极小。LowRA通过优化量化映射、阈值选择和精度分配,同时利用高效的CUDA内核来实现可扩展部署,从而在保证较高精度的同时大幅减少了内存使用。

链接: https://arxiv.org/abs/2502.08141
作者: Zikai Zhou,Qizheng Zhang,Hermann Kumbong,Kunle Olukotun
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization - mapping, threshold selection, and precision assignment - while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance-precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.
zh

[NLP-42] Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models NAACL

【速读】: 该论文旨在解决在特定数据集上微调大型语言模型(LLMs)时常见的过拟合问题,即模型过度专精于任务或训练数据特征,导致泛化能力下降。论文提出的关键解决方案是选择性自监督微调(Selective Self-to-Supervised Fine-Tuning, S3FT),通过利用模型对查询的多个有效响应,减少模型在微调阶段的专精度。S3FT首先通过适当的评判器识别出训练集中模型的正确响应,然后使用这些正确响应以及金标准响应(或其释义)进行微调,从而提高模型的泛化能力。实验结果表明,与标准监督微调(SFT)相比,S3FT显著提升了模型在数学推理、Python编程和阅读理解等任务上的表现,并将性能下降幅度减半。

链接: https://arxiv.org/abs/2502.08130
作者: Sonam Gupta,Yatin Nandwani,Asaf Yehudai,Dinesh Khandelwal,Dinesh Raghu,Sachindra Joshi
机构: IBM Research (IBM研究)
类目: Computation and Language (cs.CL)
备注: 10 pages, Accepted to NAACL Findings 2025

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) on specific datasets is a common practice to improve performance on target tasks. However, this performance gain often leads to overfitting, where the model becomes too specialized in either the task or the characteristics of the training data, resulting in a loss of generalization. This paper introduces Selective Self-to-Supervised Fine-Tuning (S3FT), a fine-tuning approach that achieves better performance than the standard supervised fine-tuning (SFT) while improving generalization. S3FT leverages the existence of multiple valid responses to a query. By utilizing the model’s correct responses, S3FT reduces model specialization during the fine-tuning stage. S3FT first identifies the correct model responses from the training set by deploying an appropriate judge. Then, it fine-tunes the model using the correct model responses and the gold response (or its paraphrase) for the remaining samples. The effectiveness of S3FT is demonstrated through experiments on mathematical reasoning, Python programming and reading comprehension tasks. The results show that standard SFT can lead to an average performance drop of up to 4.4 on multiple benchmarks, such as MMLU and TruthfulQA. In contrast, S3FT reduces this drop by half, i.e. 2.5 , indicating better generalization capabilities than SFT while performing significantly better on the fine-tuning tasks.
zh

[NLP-43] Fino1 : On the Transferability of Reasoning Enhanced LLM s to Finance

【速读】: 该论文旨在评估大型语言模型(LLMs)在金融推理任务中的表现,并探索其局限性。研究发现,虽然更好的数据集和预训练可以提升金融推理能力,但通用增强方法如思维链(CoT)微调并不总能带来一致的性能提升。论文的关键解决方案在于开发了一种基于Llama-3.1-8B-Instruct的金融推理增强模型,通过特定领域的思维链微调和强化学习来改进金融推理能力。即使仅使用单一金融数据集进行简单微调,该模型也能实现跨任务的一致10%性能提升,超越其他所有8B参数规模的模型以及Llama3-70B-Instruct和Llama3.1-70B-Instruct模型的平均表现。

链接: https://arxiv.org/abs/2502.08127
作者: Lingfei Qian,Weipeng Zhou,Yan Wang,Xueqing Peng,Jimin Huang,Qianqian Xie
机构: TheFinAI
类目: Computation and Language (cs.CL)
备注: Ongoing work, 13 pages, 2 figures, 3 Tables

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown strong general reasoning abilities, yet their effectiveness in financial reasoning remains underexplored. In this study, we comprehensively evaluate 16 powerful reasoning and general LLMs on three complex financial tasks involving financial text, tabular data, and equations, assessing numerical reasoning, tabular interpretation, financial terminology comprehension, long-context processing, and equation-based problem solving. Our results show that while better datasets and pretraining improve financial reasoning, general enhancements like CoT fine-tuning do not always yield consistent gains. Moreover, all reasoning strategies face challenges in improving performance on long-context and multi-table tasks. To address these limitations, we develop a financial reasoning-enhanced model based on Llama-3.1-8B-Instruct, by CoT fine-tuning and reinforcement learning with domain-specific reasoning paths. Even with simple fine-tuning with one financial dataset, our model achieves a consistent 10% performance improvement across tasks, surpassing all 8B models and even Llama3-70B-Instruct and Llama3.1-70B-Instruct on average. Our results highlight the need for domain-specific adaptations in financial tasks, emphasizing future directions such as multi-table reasoning, long-context processing, and financial terminology comprehension. All our datasets, models, and codes are publicly available. Furthermore, we introduce a leaderboard for benchmarking future datasets and models.
zh

[NLP-44] HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成响应过程中存在的幻觉现象(hallucination),这会损害其可靠性,尤其是在需要高精度事实性的领域。论文的关键解决方案是提出了一种名为HuDEx的解释增强型幻觉检测模型,该模型不仅能够检测幻觉现象,还能提供详细的解释,从而帮助用户和LLM本身理解和减少错误。这种方法通过将检测与解释相结合,显著提升了幻觉检测的准确性和可靠性,并展示了良好的适应性,能够在零样本和其他测试环境中保持高性能。

链接: https://arxiv.org/abs/2502.08109
作者: Sujeong Lee,Hayoung Lee,Seongsoo Heo,Wonik Choi
机构: Inha University (仁荷大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown promising improvements, often surpassing existing methods across a wide range of downstream tasks in natural language processing. However, these models still face challenges, which may hinder their practical applicability. For example, the phenomenon of hallucination is known to compromise the reliability of LLMs, especially in fields that demand high factual precision. Current benchmarks primarily focus on hallucination detection and factuality evaluation but do not extend beyond identification. This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx, aimed at enhancing the reliability of LLM-generated responses by both detecting hallucinations and providing detailed explanations. The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors. Our measurement results demonstrate that the proposed model surpasses larger LLMs, such as Llama3 70B and GPT-4, in hallucination detection accuracy, while maintaining reliable explanations. Furthermore, the proposed model performs well in both zero-shot and other test environments, showcasing its adaptability across diverse benchmark datasets. The proposed approach further enhances the hallucination detection research by introducing a novel approach to integrating interpretability with hallucination detection, which further enhances the performance and reliability of evaluating hallucinations in language models.
zh

[NLP-45] GCoT: Chain-of-Thought Prompt Learning for Graphs

【速读】: 该论文旨在解决如何为无文本信息的图数据设计链式思维(CoT)提示框架以引导图模型逐步学习的问题。关键在于将每个下游任务的适应过程分解为一系列基于提示的推理步骤,其中每个步骤包括提示驱动的推理、“思维”生成以及基于当前状态的节点特定提示学习。通过这种方式,论文提出了GCoT框架,利用预训练图编码器生成“思维”,并基于此“思维”为每个节点学习特定提示,从而实现逐步推理。

链接: https://arxiv.org/abs/2502.08092
作者: Xingtong Yu,Chang Zhou,Zhongwei Kuai,Xinming Zhang,Yuan Fang
机构: Singapore Management University(新加坡管理大学); University of Science and Technology of China(中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting has achieved remarkable success in natural language processing (NLP). However, its vast potential remains largely unexplored for graphs. This raises an interesting question: How can we design CoT prompting for graphs to guide graph models to learn step by step? On one hand, unlike natural languages, graphs are non-linear and characterized by complex topological structures. On the other hand, many graphs lack textual data, making it difficult to formulate language-based CoT prompting. In this work, we propose the first CoT prompt learning framework for text-free graphs, GCoT. Specifically, we decompose the adaptation process for each downstream task into a series of inference steps, with each step consisting of prompt-based inference, thought'' generation, and thought-conditioned prompt learning. While the steps mimic CoT prompting in NLP, the exact mechanism differs significantly. Specifically, at each step, an input graph, along with a prompt, is first fed into a pre-trained graph encoder for prompt-based inference. We then aggregate the hidden layers of the encoder to construct a thought’', which captures the working state of each node in the current step. Conditioned on this thought, we learn a prompt specific to each node based on the current state. These prompts are fed into the next inference step, repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we conduct comprehensive experiments on eight public datasets, which demonstrate the advantage of our approach.
zh

[NLP-46] NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals NAACL2025

【速读】: 该论文旨在通过将文本分解为原子命题,深入分析输入与输出文本,并在传统自然语言推理(NLI)和非单调自然语言推理(defeasible NLI)任务中应用这种分解方法。关键在于将复杂的推理问题细分为更小的子问题或粒度推断,以评估模型在逻辑一致性、不同推理类型的理解能力以及基准数据集中的示例多样性方面的表现。研究表明大型语言模型(LLMs)在处理原子NLI和非单调NLI子问题时仍存在逻辑一致性的问题。论文还确定了非单调NLI示例中关键的原子子问题,并提出了一种衡量模型推理一致性的方法,以捕捉模型在不同情境下对同一事实预测的一致性程度。

链接: https://arxiv.org/abs/2502.08080
作者: Neha Srikanth,Rachel Rudinger
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model’s consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.
zh

[NLP-47] On Mechanistic Circuits for Extractive Question-Answering

【速读】: 该论文旨在解决在上下文增强语言模型中理解和实现数据归因的问题。关键在于通过因果中介分析技术提取模型内部组件(如注意力头、MLPs)的功能性电路,并利用这些电路来理解模型如何利用参数记忆和检索到的上下文之间的相互作用。论文进一步发现了一组特定的注意力头能够默认执行可靠的数据归因,从而在模型前向传播过程中无需额外计算即可获得归因结果。基于这一洞察,论文引入了ATTNATTRIB算法,该算法能够快速获得最先进的归因结果,并展示了如何使用ATTNATTRIB的归因作为信号引导模型从前向传播中更多依赖于上下文信息,而不是参数记忆。

链接: https://arxiv.org/abs/2502.08059
作者: Samyadeep Basu,Vlad Morariu,Zichao Wang,Ryan Rossi,Cherry Zhao,Soheil Feizi,Varun Manjunatha
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models are increasingly used to process documents and facilitate question-answering on them. In our paper, we extract mechanistic circuits for this real-world language modeling task: context-augmented language modeling for extractive question-answering (QA) tasks and understand the potential benefits of circuits towards downstream applications such as data attribution to context information. We extract circuits as a function of internal model components (e.g., attention heads, MLPs) using causal mediation analysis techniques. Leveraging the extracted circuits, we first understand the interplay between the model’s usage of parametric memory and retrieved context towards a better mechanistic understanding of context-augmented language models. We then identify a small set of attention heads in our circuit which performs reliable data attribution by default, thereby obtaining attribution for free in just the model’s forward pass. Using this insight, we then introduce ATTNATTRIB, a fast data attribution algorithm which obtains state-of-the-art attribution results across various extractive QA benchmarks. Finally, we show the possibility to steer the language model towards answering from the context, instead of the parametric memory by using the attribution from ATTNATTRIB as an additional signal during the forward pass. Beyond mechanistic understanding, our paper provides tangible applications of circuits in the form of reliable data attribution and model steering.
zh

[NLP-48] Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLM s

【速读】: 该论文旨在解决现有Large Language Models (LLMs) 文化一致性评估方法过于受限的问题。当前的方法主要依赖于封闭式的多选题调查,而论文提出通过采用更现实且不受限制的方法来评估文化一致性,如利用世界价值观调查(World Values Survey, WVS)和霍夫斯泰德文化维度(Hofstede Cultural Dimensions)。关键在于使用更灵活的评估框架,强调特定文化指标,并在非强制性回答条件下进行评估,从而揭示了封闭式评估方法的局限性。

链接: https://arxiv.org/abs/2502.08045
作者: Mohsinul Kabir,Ajwad Abrar,Sophia Ananiadou
机构: Department of Computer Science, National Center for Text Mining, The University of Manchester(计算机科学系,文本挖掘国家中心,曼彻斯特大学); Department of Computer Scinece and Engineering, Islamic University of Technology(计算机科学与工程系,伊斯兰科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Preprint

点击查看摘要

Abstract:A large number of studies rely on closed-style multiple-choice surveys to evaluate cultural alignment in Large Language Models (LLMs). In this work, we challenge this constrained evaluation paradigm and explore more realistic, unconstrained approaches. Using the World Values Survey (WVS) and Hofstede Cultural Dimensions as case studies, we demonstrate that LLMs exhibit stronger cultural alignment in less constrained settings, where responses are not forced. Additionally, we show that even minor changes, such as reordering survey choices, lead to inconsistent outputs, exposing the limitations of closed-style evaluations. Our findings advocate for more robust and flexible evaluation frameworks that focus on specific cultural proxies, encouraging more nuanced and accurate assessments of cultural alignment in LLMs.
zh

[NLP-49] Franken-Adapter: Cross-Lingual Adaptation of LLM s by Embedding Surgery

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言上的能力远落后于英语的问题,从而实现这些模型的普遍可访问性。解决方案的关键在于提出了一种名为\textit{Franken-Adapter}的方法,这是一种针对解码器-only LLMs的模块化语言适应方法,通过嵌入手术(embedding surgery)创建目标语言的定制词汇,并通过对多语言数据进行嵌入调整来进行语言适应。这种方法通过将预训练的嵌入与已在英语对齐数据上指令微调(instruction-tuned)的LLMs集成,实现了零样本跨语言迁移。实验结果表明,这种方法在多达270亿参数的Gemma2模型上,在96种语言中提升了最多20%的能力,同时在英语任务中的性能下降控制在1%以内。进一步分析揭示了定制分词器(tokenizer)在增强语言适应性方面起到关键作用,同时也提高了推理效率。

链接: https://arxiv.org/abs/2502.08037
作者: Fan Jiang,Honglin Yu,Grace Chung,Trevor Cohn
机构: 未知
类目: Computation and Language (cs.CL)
备注: 33 pages

点击查看摘要

Abstract:The capabilities of Large Language Models (LLMs) in low-resource languages lag far behind those in English, making their universal accessibility a significant challenge. To alleviate this, we present \textitFranken-Adapter , a modular language adaptation approach for decoder-only LLMs with embedding surgery. Our method begins by creating customized vocabularies for target languages and performing language adaptation through embedding tuning on multilingual data. These pre-trained embeddings are subsequently integrated with LLMs that have been instruction-tuned on English alignment data to enable zero-shot cross-lingual transfer. Our experiments on \textttGemma2 models with up to 27B parameters demonstrate improvements of up to 20% across 96 languages, spanning both discriminative and generative tasks, with minimal regressions ( 1%) in English. Further in-depth analysis reveals the critical role of customizing tokenizers in enhancing language adaptation, while boosting inference efficiency. Additionally, we show the versatility of our method by achieving a 14% improvement over a math-optimized LLM across 20 languages, offering a modular solution to transfer reasoning abilities across languages post hoc.
zh

[NLP-50] Contextual Subspace Manifold Projection for Structural Refinement of Large Language Model Representations

【速读】: 该论文旨在解决深度神经架构内部表示在特征分布上的低效性问题,这限制了模型的表达能力和适应性。关键解决方案在于引入“上下文子空间流形投影”(Contextual Subspace Manifold Projection),通过受控子空间约束有选择地重构标记嵌入(token embeddings),从而确保更稳定且几何定义明确的特征分布。这种方法减少了各向异性(anisotropy),提高了表示紧凑性,并保持了跨变换器层的语义保真度,同时增强了标记嵌入的特征可分离性。

链接: https://arxiv.org/abs/2502.08026
作者: Alistair Wren,Beatrice Loxley,Hamish Cadwallader,Simon Beckwith,Fabian Pargeter,James Blades
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Internal representations within deep neural architectures encode high-dimensional abstractions of linguistic structures, yet they often exhibit inefficiencies in feature distribution, limiting expressiveness and adaptability. Contextual Subspace Manifold Projection introduces a structured refinement technique that selectively reconfigures token embeddings through controlled subspace constraints, ensuring more stable and geometrically well-defined feature distributions. Empirical evaluations demonstrated that the structured intervention reduced anisotropy, leading to improved representation compactness while preserving semantic fidelity across transformer layers. Clustering analyses indicated that token embeddings exhibited greater feature separability, reinforcing the hypothesis that structured projection techniques enhance internal representation organization without sacrificing linguistic coherence. Gradient magnitude distributions suggested that the method introduced a smoother optimization trajectory, potentially contributing to more stable parameter updates throughout training. Computational overhead associated with the projection operations remained minimal, ensuring that the refinements did not introduce significant trade-offs in model efficiency or inference speed. Comparisons with standard embedding refinement techniques highlighted that structured manifold constraints provided a direct mechanism for improving representation quality without requiring additional gradient-based optimization. Perplexity evaluations confirmed that the adjustments did not negatively impact sequence coherence, further validating the effectiveness of the proposed approach.
zh

[NLP-51] Speculate then Collaborate: Fusing Knowledge of Language Models during Decoding

【速读】: 该论文旨在解决大型语言模型(LLMs)在特定领域表现出色但在其他领域受限的问题,主要由于训练数据的局限性。为了解决这一问题,论文提出了一种名为协同推测解码(Collaborative Speculative Decoding, CoSD)的新算法。CoSD的关键在于利用一个初始序列生成模型(draft model)来产生初始输出,并通过一个易于学习的规则或决策树来决定何时调用辅助模型以优化这些初始输出。这种方法不仅提高了知识融合的效率,还增强了推理过程的可解释性,并且具有跨领域和跨模型的适用性。实验结果表明,与现有方法相比,CoSD在多个基准测试中提升了高达10%的准确性,提供了一个可扩展且有效的解决方案。

链接: https://arxiv.org/abs/2502.08020
作者: Ziyao Wang,Muneeza Azmart,Ang Li,Raya Horesh,Mikhail Yurochkin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to improve their performance across domains. To realize this potential, we introduce a novel Collaborative Speculative Decoding (CoSD) algorithm that enables efficient LLM knowledge fusion at test time without requiring additional model training. CoSD employs a draft model to generate initial sequences and an easy-to-learn rule or decision tree to decide when to invoke an assistant model to improve these drafts. CoSD not only enhances knowledge fusion but also improves inference efficiency, is transferable across domains and models, and offers greater explainability. Experimental results demonstrate that CoSD improves accuracy by up to 10% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications
zh

[NLP-52] he Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models NAACL

【速读】: 该论文旨在探究不同提示方法(prompting methods)如何影响解码器型语言模型(decoder-only language models)中的表征几何结构(representation geometry),并揭示这些方法在少量上下文学习(few-shot in-context learning)中的任务适应机制。关键在于通过基于统计物理的框架(a framework grounded in statistical physics),分析输入分布样本(input distribution samples)和标签语义(label semantics)在任务适应中的作用,并揭示不同任务在表征层面的协同与干扰交互(synergistic and interfering interactions between different tasks)。这一研究有助于深化对大型语言模型理论理解,并为开发更有效的、注重表征的提示策略奠定基础。

链接: https://arxiv.org/abs/2502.08009
作者: Artem Kirsanov,Chi-Ning Chou,Kyunghyun Cho,SueYeon Chung
机构: New York University; Flatiron Institute; New York University; Genentech; New York University; Flatiron Institute
类目: Computation and Language (cs.CL)
备注: To appear in NAACL Findings 2025

点击查看摘要

Abstract:Decoder-only language models have the ability to dynamically switch between various computational tasks based on input prompts. Despite many successful applications of prompting, there is very limited understanding of the internal mechanism behind such flexibility. In this work, we investigate how different prompting methods affect the geometry of representations in these models. Employing a framework grounded in statistical physics, we reveal that various prompting techniques, while achieving similar performance, operate through distinct representational mechanisms for task adaptation. Our analysis highlights the critical role of input distribution samples and label semantics in few-shot in-context learning. We also demonstrate evidence of synergistic and interfering interactions between different tasks on the representational level. Our work contributes to the theoretical understanding of large language models and lays the groundwork for developing more effective, representation-aware prompting strategies.
zh

[NLP-53] MetaSC: Test-Time Safety Specification Optimization for Language Models

【速读】: 该论文旨在解决语言模型在推理阶段的安全性问题,特别是在应对对抗性攻击请求以及避免道德危害等方面。论文的关键在于提出了一种动态安全框架,通过利用元批评机制迭代更新安全性提示(规格),从而在推理过程中动态优化这些提示,以适应性地驱动审查和修正过程。这种测试时优化方法不仅提升了对抗性攻击的防御能力,还在多种与安全相关的任务中表现出色。

链接: https://arxiv.org/abs/2502.07985
作者: Víctor Gallego
机构: Komorebi AI(科莫雷比AI); Madrid, Spain(西班牙马德里)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code to be released at this https URL .
zh

[NLP-54] raining Sparse Mixture Of Experts Text Embedding Models

【速读】: 该论文旨在解决Transformer-based文本嵌入模型在增加参数以提升性能的同时,引入的部署挑战,如推理延迟和内存使用增加的问题。特别是在检索增强生成(Retrieval-Augmented Generation, RAG)应用中,大型模型的高内存需求限制了数据集的摄入容量,而更高的延迟直接影响查询时间性能。论文的关键解决方案是引入Nomic Embed v2,这是一种首次应用于通用文本嵌入的混合专家(Mixture of Experts, MoE)模型。该模型不仅在单语和多语种基准测试中超越了同类参数规模的模型,而且在性能上与规模为其两倍的模型相当。

链接: https://arxiv.org/abs/2502.07972
作者: Zach Nussbaum,Brandon Duderstadt
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts. However, this scaling approach introduces significant deployment challenges, including increased inference latency and memory usage. These challenges are particularly severe in retrieval-augmented generation (RAG) applications, where large models’ increased memory requirements constrain dataset ingestion capacity, and their higher latency directly impacts query-time performance. While causal language models have addressed similar efficiency challenges using Mixture of Experts (MoE) architectures, this approach hasn’t been successfully adapted to the general text embedding setting. In this paper, we introduce Nomic Embed v2, the first general purpose MoE text embedding model. Our model outperforms models in the same parameter class on both monolingual and multilingual benchmarks while also maintaining competitive performance with models twice its size. We open-source all code, models, and evaluation data to ensure full reproducibility of our training pipeline.
zh

[NLP-55] Caught in the Web of Words: Do LLM s Fall for Spin in Medical Literature?

【速读】: 该论文旨在探讨大型语言模型(Large Language Models, LLMs)在解读临床试验结果时是否同样受到结果扭曲(spin)的影响。已有研究表明,研究人员倾向于在发表的研究成果中强调“积极”的发现,即使实证结果并不明确,这可能导致结果被扭曲。鉴于LLMs越来越多地用于整理和综合已发表的医学证据,这一问题尤为重要。研究的关键解决方案在于通过特定的提示方式来减轻LLMs输出中扭曲结果的影响,尽管这些模型普遍更容易受到结果扭曲的影响,并可能在其生成的通俗语言总结中隐含地包含这种扭曲。

链接: https://arxiv.org/abs/2502.07963
作者: Hye Sun Yun,Karen Y.C. Zhang,Ramez Kouzy,Iain J. Marshall,Junyi Jessy Li,Byron C. Wallace
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present “positive” findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and may affect patient care decisions. In this study, we ask whether the interpretation of trial results offered by Large Language Models (LLMs) is similarly affected by spin. This is important since LLMs are increasingly being used to trawl through and synthesize published medical evidence. We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin’s impact on LLM outputs.
zh

[NLP-56] Adapting Multilingual Embedding Models to Historical Luxembourgish

【速读】: 该论文旨在解决在历史文本中进行有效的跨语言语义搜索的问题。由于光学字符识别(OCR)噪声和过时的拼写形式,预训练的多语言模型在处理历史数字化内容时面临挑战。论文的关键解决方案在于采用领域内训练数据的简单适应方法,从而实现在跨语言评估中的高达98%的准确率。

链接: https://arxiv.org/abs/2502.07938
作者: Andrianos Michail,Corina Julia Raclé,Juri Opitz,Simon Clematide
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models, typically evaluated on contemporary texts, face challenges with historical digitized content due to OCR noise and outdated spellings. We explore the use of multilingual embeddings for cross-lingual semantic search on historical Luxembourgish, a low-resource language. We collect historical Luxembourgish news articles spanning various time periods and use GPT-4o to segment and translate them into closely related languages, creating 20,000 parallel training sentences per language pair. We further create a historical bitext mining evaluation set and find that these models struggle to perform cross-lingual search on historical Luxembourgish. To address this, we propose a simple adaptation method using in-domain training data, achieving up to 98% accuracy in cross-lingual evaluations. We release our adapted models and historical Luxembourgish-German/French bitexts to support further research.
zh

[NLP-57] Elevating Legal LLM Responses: Harnessing Trainable Logical Structures and Semantic Knowledge with Legal Reasoning

【速读】: 该论文旨在解决大型语言模型(LLMs)在法律问答任务中普遍存在的逻辑不特定性和幻觉问题。现有方法如检索增强生成(RAG)技术虽提供部分解决方案,但通常仅关注语义相似性而忽视了法律推理所需的逻辑结构。论文提出的关键解决方案是逻辑语义整合模型(LSIM),这是一种新颖的监督框架,通过结合语义一致性和逻辑一致性来改善这一状况。LSIM包含三个组成部分:强化学习预测每个问题的结构化事实-规则链;可训练的深度结构化语义模型(DSSM)通过集成语义和逻辑特征检索最相关的问题候选;基于上下文的学习利用检索到的内容生成最终答案。

链接: https://arxiv.org/abs/2502.07912
作者: Rujing Yao,Yang Wu,Chenghao Wang,Jingwei Xiong,Fang Wang,Xiaozhong Liu
机构: Nankai University (南开大学); Worcester Polytechnic Institute (伍斯特理工学院); Fayuan Technology Co., Ltd. (法元科技有限公司); University of California, Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive results across numerous domains, yet they experience notable deficiencies in legal question-answering tasks. LLMs often generate generalized responses that lack the logical specificity required for expert legal advice and are prone to hallucination, providing answers that appear correct but are unreliable. Retrieval-Augmented Generation (RAG) techniques offer partial solutions to address this challenge, but existing approaches typically focus only on semantic similarity, neglecting the logical structure essential to legal reasoning. In this paper, we propose the Logical-Semantic Integration Model (LSIM), a novel supervised framework that bridges semantic and logical coherence. LSIM comprises three components: reinforcement learning predicts a structured fact-rule chain for each question, a trainable Deep Structured Semantic Model (DSSM) retrieves the most relevant candidate questions by integrating semantic and logical features, and in-context learning generates the final answer using the retrieved content. Our experiments on a real-world legal QA dataset-validated through both automated metrics and human evaluation-demonstrate that LSIM significantly enhances accuracy and reliability compared to existing methods.
zh

[NLP-58] Intelligent Legal Assistant: An Interactive Clarification System for Legal Question Answering

【速读】: 该论文旨在解决用户在寻求法律建议时因缺乏专业知识而提出的问题常常遗漏关键信息的问题,导致传统法律问答系统难以准确识别用户需求,从而提供不精确或泛泛的建议。解决方案的关键在于开发了一个名为“智能法律助手”的系统,该系统通过与用户的互动来精确捕捉其需求。具体而言,当用户提出问题时,系统会请求用户选择地理位置以确定适用的法律法规,并基于用户初始问题中的关键缺失信息生成澄清问题和选项,使用户能够补充必要的细节。最终,在收集到所有必要信息后,系统将提供包含总体结论、判例法分析和解决方案建议在内的深入法律分析。

链接: https://arxiv.org/abs/2502.07904
作者: Rujing Yao,Yiquan Wu,Tong Zhang,Xuhui Zhang,Yuting Huang,Yang Wu,Jiayin Yang,Changlong Sun,Fang Wang,Xiaozhong Liu
机构: Nankai University(Tianjin); Zhejiang University(Hangzhou); Fayuan Inc.(Hangzhou); Worcester Polytechnic Institute(Worcester)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of large language models has opened new avenues for users seeking legal advice. However, users often lack professional legal knowledge, which can lead to questions that omit critical information. This deficiency makes it challenging for traditional legal question-answering systems to accurately identify users’ actual needs, often resulting in imprecise or generalized advice. In this work, we develop a legal question-answering system called Intelligent Legal Assistant, which interacts with users to precisely capture their needs. When a user poses a question, the system requests that the user select their geographical location to pinpoint the applicable laws. It then generates clarifying questions and options based on the key information missing from the user’s initial question. This allows the user to select and provide the necessary details. Once all necessary information is provided, the system produces an in-depth legal analysis encompassing three aspects: overall conclusion, jurisprudential analysis, and resolution suggestions.
zh

[NLP-59] Vision-Language Models for Edge Networks: A Comprehensive Survey

【速读】: 该论文旨在解决视觉大型语言模型(Vision Large Language Models, VLMs)在资源受限的边缘设备上的部署挑战。关键解决方案在于优化VLMs以适应这些环境,具体包括模型压缩技术(如剪枝、量化、知识蒸馏)以及专用硬件解决方案,从而提升模型效率。此外,论文还探讨了高效的训练与微调方法、边缘部署挑战及隐私考虑,进一步推动VLMs在实际应用中的普及。

链接: https://arxiv.org/abs/2502.07855
作者: Ahmed Sharshar,Latif U. Khan,Waseem Ullah,Mohsen Guizani
机构: Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains such as autonomous vehicles, smart surveillance, and healthcare, their deployment on resource-constrained edge devices remains challenging due to processing power, memory, and energy limitations. This survey explores recent advancements in optimizing VLMs for edge environments, focusing on model compression techniques, including pruning, quantization, knowledge distillation, and specialized hardware solutions that enhance efficiency. We provide a detailed discussion of efficient training and fine-tuning methods, edge deployment challenges, and privacy considerations. Additionally, we discuss the diverse applications of lightweight VLMs across healthcare, environmental monitoring, and autonomous systems, illustrating their growing impact. By highlighting key design strategies, current challenges, and offering recommendations for future directions, this survey aims to inspire further research into the practical deployment of VLMs, ultimately making advanced AI accessible in resource-limited settings.
zh

[NLP-60] Analyzing the Resource Utilization of Lambda Functions on Mobile Devices: Case Studies on Kotlin and Swift

【速读】: 该论文旨在探讨Lambda函数在移动编程中的资源消耗影响,特别是关注其对智能手机电池利用率、内存使用和执行时间的影响。论文的关键在于评估Lambda函数在不增加功能性的情况下,是否会导致显著的资源开销。研究发现,Lambda函数在移动设备上引入了可观的资源负担,这对优化移动应用性能具有重要意义。

链接: https://arxiv.org/abs/2502.07809
作者: Chibundom U. Ejimuda,Gaston Longhitano,Reza Rawassizadeh
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Performance (cs.PF)
备注: 6 pages, 2 images

点击查看摘要

Abstract:With billions of smartphones in use globally, the daily time spent on these devices contributes significantly to overall electricity consumption. Given this scale, even minor reductions in smartphone power use could result in substantial energy savings. This study explores the impact of Lambda functions on resource consumption in mobile programming. While Lambda functions are known for enhancing code readability and conciseness, their use does not add to the functional capabilities of a programming language. Our research investigates the implications of using Lambda functions in terms of battery utilization, memory usage, and execution time compared to equivalent code structures without Lambda functions. Our findings reveal that Lambda functions impose a considerable resource overhead on mobile devices without offering additional functionalities.
zh

计算机视觉

[CV-0] Poly-Autoregressive Prediction for Modeling Interactions

【速读】:该论文旨在解决多智能体系统中预测单个智能体行为的问题。解决方案的关键在于提出了Poly-Autoregressive (PAR)建模方法,通过分析自智能体的历史状态及与其他交互智能体的过去和当前状态来预测自智能体的未来行为。PAR将所有智能体的行为表示为时间步长上的状态序列令牌,从而实现了在少量数据预处理调整的情况下,广泛应用于人类社交行为预测、自动驾驶车辆轨迹预测以及手-物交互中的物体姿态预测等不同场景,并且使用小规模的Transformer骨干网络,PAR在这些场景中表现优于传统的自回归(Autoregressive, AR)模型。

链接: https://arxiv.org/abs/2502.08646
作者: Neerja Thakkar,Tara Sadjadpour,Jathushan Rajasegaran,Shiry Ginosar,Jitendra Malik
机构: UC Berkeley (加州大学伯克利分校); Toyota Technical Institute at Chicago (芝加哥丰田技术研究所); Google DeepMind (谷歌深感)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:We introduce a simple framework for predicting the behavior of an agent in multi-agent settings. In contrast to autoregressive (AR) tasks, such as language processing, our focus is on scenarios with multiple agents whose interactions are shaped by physical constraints and internal motivations. To this end, we propose Poly-Autoregressive (PAR) modeling, which forecasts an ego agent’s future behavior by reasoning about the ego agent’s state history and the past and current states of other interacting agents. At its core, PAR represents the behavior of all agents as a sequence of tokens, each representing an agent’s state at a specific timestep. With minimal data pre-processing changes, we show that PAR can be applied to three different problems: human action forecasting in social situations, trajectory prediction for autonomous vehicles, and object pose forecasting during hand-object interaction. Using a small proof-of-concept transformer backbone, PAR outperforms AR across these three scenarios. The project website can be found at this https URL.
zh

[CV-1] A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards ICRA2025

【速读】:该论文旨在解决在开放世界环境中机器人操作任务规范的挑战,需要灵活且适应性强的目标,这些目标能够与人类意图保持一致,并通过迭代反馈逐步演化。论文的关键解决方案是引入了一种名为迭代关键点奖励(Iterative Keypoint Reward, IKER)的方法。IKER 是一种基于视觉的、Python 编写的奖励函数,它作为动态的任务规范。该框架利用视觉语言模型(VLMs)生成和优化这些奖励函数,以应对多步骤操作任务。IKER 通过对场景中的关键点进行采样并生成条件于这些关键点的奖励函数来工作,它利用关键点之间的空间关系和常识先验来实现精确的 SE(3) 控制。

链接: https://arxiv.org/abs/2502.08643
作者: Shivansh Patel,Xinchen Yin,Wenlong Huang,Shubham Garg,Hooshang Nayyeri,Li Fei-Fei,Svetlana Lazebnik,Yunzhu Li
机构: University of Illinois at Urbana-Champaign; Stanford University; Amazon; Columbia University
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025, Project Page: this https URL

点击查看摘要

Abstract:Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER’s effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.
zh

[CV-2] SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation

【速读】:该论文旨在解决现有大型视觉-语言模型在生成高质量矢量草图时依赖耗时优化过程的问题。SwiftSketch作为解决方案的关键,是一个基于扩散模型的图像条件矢量草图生成器,能够在不到一秒钟的时间内生成高质量草图。SwiftSketch通过逐步去噪从高斯分布中采样的笔划控制点来操作,并采用变压器解码器架构以有效处理矢量表示的离散性质及捕捉笔划间的内在全局依赖关系。

链接: https://arxiv.org/abs/2502.08642
作者: Ellie Arar,Yarden Frenkel,Daniel Cohen-Or,Ariel Shamir,Yael Vinker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Recent advancements in large vision-language models have enabled highly expressive and diverse vector sketch generation. However, state-of-the-art methods rely on a time-consuming optimization process involving repeated feedback from a pretrained model to determine stroke placement. Consequently, despite producing impressive sketches, these methods are limited in practical applications. In this work, we introduce SwiftSketch, a diffusion model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second. SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution. Its transformer-decoder architecture is designed to effectively handle the discrete nature of vector representation and capture the inherent global dependencies between strokes. To train SwiftSketch, we construct a synthetic dataset of image-sketch pairs, addressing the limitations of existing sketch datasets, which are often created by non-artists and lack professional quality. For generating these synthetic sketches, we introduce ControlSketch, a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet. We demonstrate that SwiftSketch generalizes across diverse concepts, efficiently producing sketches that combine high fidelity with a natural and visually appealing style.
zh

[CV-3] CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

【速读】:该论文旨在解决文本到视频生成中的三维可控性问题。解决方案的关键在于CineMaster框架,它通过两个阶段实现:首先,设计了一个交互工作流程,使用户能够直观地构建三维感知的条件信号,包括对象边界框定位和三维空间内的相机运动;其次,利用这些控制信号(渲染的深度图、相机轨迹和对象类别标签)作为引导,驱动文本到视频扩散模型生成用户预期的视频内容。此外,为了克服缺乏带有三维物体运动和相机姿态标注的数据集的问题,论文提出了一种自动数据标注管道,从大规模视频数据中提取三维边界框和相机轨迹。

链接: https://arxiv.org/abs/2502.08639
作者: Qinghe Wang,Yawen Luo,Xiaoyu Shi,Xu Jia,Huchuan Lu,Tianfan Xue,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai
机构: Dalian University of Technology; The Chinese University of Hong Kong; Kuaishou Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals–comprising rendered depth maps, camera trajectories and object class labels–serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: this https URL.
zh

[CV-4] PulseCheck457: A Diagnostic Benchmark for Comprehensive Spatial Reasoning of Large Multimodal Models

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在复杂且精确的三维空间推理能力方面的不确定性。现有基准主要集中在二维空间理解,缺乏全面评估不同复杂度下的六维空间推理的框架。为了解决这一局限性,论文提出了PulseCheck457,这是一个设计用于四个关键空间推理能力的可扩展且无偏见的合成数据集:多对象识别、二维位置、三维位置和三维方向。论文的关键解决方案在于开发了一个级联评估结构,并通过五个难度级别中的七种问题类型来测试这些模型,从基础的单个对象识别到新的复杂的六维空间推理任务,从而量化了模型在任务复杂性增加时的性能下降,并引入了相对性能下降率(Relative Performance Dropping Rate, RPDR)来突出三维推理能力的弱点。此外,利用数据集的无偏设计属性,还揭示了不同属性上的预测偏差。

链接: https://arxiv.org/abs/2502.08636
作者: Xingrui Wang,Wufei Ma,Tiezheng Zhang,Celso M de Melo,Jieneng Chen,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学); DEVCOM Army Research Laboratory (陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present PulseCheck457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.
zh

[CV-5] Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

【速读】:该论文旨在解决视频重照明(Video Relighting)过程中存在的过度训练成本高以及数据集多样性不足的问题。此外,直接将图像重照明模型应用于每一帧会导致光源一致性差及重照明后外观不一致,从而产生闪烁现象。为解决这些问题,论文提出了一种名为Light-A-Video的无训练方法。其关键是引入了两个关键技术:首先,设计了一致性光注意力(Consistent Light Attention, CLA)模块,通过增强自注意力层中的跨帧交互来稳定背景光源的生成;其次,利用光照传输独立性的物理原理,采用渐进光融合(Progressive Light Fusion, PLF)策略,在源视频外观与重照明外观之间进行线性融合,以确保光照在时间上的平滑过渡。

链接: https://arxiv.org/abs/2502.08590
作者: Yujie Zhou,Jiazi Bu,Pengyang Ling,Pan Zhang,Tong Wu,Qidong Huang,Jinsong Li,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Anyi Rao,Jiaqi Wang,Li Niu
机构: Shanghai Jiao Tong University(上海交通大学); University of Science and Technology of China(中国科学技术大学); The Chinese University of Hong Kong(香港中文大学); Hong Kong University of Science and Technology(香港科技大学); Stanford University(斯坦福大学); Shanghai AI Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video’s appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the image quality, ensuring coherent lighting transitions across frames. Project page: this https URL.
zh

[CV-6] Ultrasound Image Generation using Latent Diffusion Models

【速读】:该论文旨在解决利用扩散模型生成高质量、逼真的医学超声(Ultrasound, US)图像的问题,特别是在获取开源医疗图像相对困难的情况下。论文的关键解决方案在于通过对大规模扩散模型进行渐进式的微调,利用不同公开可用的数据库来模拟真实的乳腺超声图像。具体而言,作者对先进的隐变量扩散模型Stable Diffusion进行了微调,使用了BUSI(乳腺超声图像)数据集,并通过简单的提示语指定了器官和病理类型,成功生成了逼真的乳腺超声图像。此外,通过ControlNet引入分割条件以提供用户控制。

链接: https://arxiv.org/abs/2502.08580
作者: Benoit Freiche,Anthony El-Khoury,Ali Nasiri-Sarvi,Mahdi S. Hosseini,Damien Garcia,Adrian Basarab,Mathieu Boily,Hassan Rivaz
机构: Concordia University (康考迪亚大学), Montreal, Canada; CREATIS (CREATIS), Villeurbanne, France; McGill University (麦吉尔大学), Montreal, Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages conference paper for SPIE medical imaging

点击查看摘要

Abstract:Diffusion models for image generation have been a subject of increasing interest due to their ability to generate diverse, high-quality images. Image generation has immense potential in medical imaging because open-source medical images are difficult to obtain compared to natural images, especially for rare conditions. The generated images can be used later to train classification and segmentation models. In this paper, we propose simulating realistic ultrasound (US) images by successive fine-tuning of large diffusion models on different publicly available databases. To do so, we fine-tuned Stable Diffusion, a state-of-the-art latent diffusion model, on BUSI (Breast US Images) an ultrasound breast image dataset. We successfully generated high-quality US images of the breast using simple prompts that specify the organ and pathology, which appeared realistic to three experienced US scientists and a US radiologist. Additionally, we provided user control by conditioning the model with segmentations through ControlNet. We will release the source code at this http URL to allow fast US image generation to the scientific community.
zh

[CV-7] A Novel Approach to for Multimodal Emotion Recognition : Multimodal semantic information fusion

【速读】:该论文旨在解决多模态情感识别中的异构数据融合及模态相关性的有效利用问题。解决方案的关键在于提出了一种基于对比学习和视觉序列压缩的新型多模态情感识别方法DeepMSI-MER,通过对比学习增强跨模态特征融合,并利用视觉序列压缩减少视觉模态的冗余性。

链接: https://arxiv.org/abs/2502.08573
作者: Wei Dai,Dequan Zheng,Feng Yu,Yanrong Zhang,Yaohui Hou
机构: Harbin University of Commerce(哈尔滨商业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the advancement of artificial intelligence and computer vision technologies, multimodal emotion recognition has become a prominent research topic. However, existing methods face challenges such as heterogeneous data fusion and the effective utilization of modality correlations. This paper proposes a novel multimodal emotion recognition approach, DeepMSI-MER, based on the integration of contrastive learning and visual sequence compression. The proposed method enhances cross-modal feature fusion through contrastive learning and reduces redundancy in the visual modality by leveraging visual sequence compression. Experimental results on two public datasets, IEMOCAP and MELD, demonstrate that DeepMSI-MER significantly improves the accuracy and robustness of emotion recognition, validating the effectiveness of multimodal feature fusion and the proposed approach.
zh

[CV-8] AR Glulam: Accurate Augmented Reality Using Multiple Fiducial Markers for Glulam Fabrication

【速读】:该论文旨在探索多标记物增强现实 (AR) 在高精度制造中的工业应用,特别关注在工厂环境中使用该技术制造层压胶合木 (glulam) 梁。论文的关键在于通过采用多个标识物的方法来提升 AR 系统的精度,以满足诸如层压胶合木梁制造所需的严格公差(小于2毫米)要求。此前实验室验证已证明该方法具有高达0.97的精度。

链接: https://arxiv.org/abs/2502.08566
作者: Alexander Htet Kyaw,Arvin Xu,Sasa Zivkovic,Gwyllim Jahn,Cameron Newnham,Nick Van Den Berg
机构: 未知
类目: Emerging Technologies (cs.ET); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 10 Figures, Project Paper for Association for Computer Aided Design in Architecture

点击查看摘要

Abstract:Recent advancements in Augmented Reality (AR) have demonstrated applications in architecture, design, and fabrication. Compared to conventional 2D construction drawings, AR can be used to superimpose contextual instructions, display 3D spatial information and enable on-site engagement. Despite the potential of AR, the widespread adoption of the technology in the industry is limited by its precision. Precision is important for projects requiring strict construction tolerances, design fidelity, and fabrication feedback. For example, the manufacturing of glulam beams requires tolerances of less than 2mm. The goal of this project is to explore the industrial application of using multiple fiducial markers for high-precision AR fabrication. While the method has been validated in lab settings with a precision of 0.97, this paper focuses on fabricating glulam beams in a factory setting with an industry manufacturer, Unalam Factory.
zh

[CV-9] Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion

【速读】:该论文旨在解决在利用人工智能预测个体患者疾病进展中的几个挑战,包括实现特定患者的个性化预测、确保时空一致性、高效利用纵向数据以及管理三维扫描带来的大量内存需求。论文的关键解决方案是提出了Brain Latent Progression (BrLP),这是一种新颖的时空模型,用于预测三维脑部MRI中的个体疾病进展。BrLP的核心贡献在于:(i) 在小潜空间中操作以缓解高维成像数据带来的计算挑战;(ii) 显式整合受试者元数据以增强预测的个体化;(iii) 通过辅助模型融入疾病动态先验知识,促进纵向数据的整合;(iv) 引入潜平均稳定化(LAS)算法,确保推理时预测进展的时空一致性,并提供预测不确定性的度量。

链接: https://arxiv.org/abs/2502.08560
作者: Lemuel Puglisi,Daniel C. Alexander,Daniele Ravì
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2405.03328

点击查看摘要

Abstract:The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: this https URL.
zh

[CV-10] Human-Centric Foundation Models: Perception Generation and Agent ic Modeling

【速读】:该论文旨在解决数字人类建模和类人实体理解与生成的问题。论文的关键在于提出了一种基于人类中心(Human-centric)的基准模型(Foundation Models)分类法,将现有方法归纳为四组:(1) 人类中心感知基准模型,用于多模态2D和3D理解;(2) 人类中心AIGC基准模型,用于生成高保真、多样化的与人类相关的内容;(3) 统一感知与生成模型,整合上述能力以增强人类理解和合成;(4) 人类中心能动基准模型,超越感知与生成,学习类人智能和交互行为。通过这种分类法,论文综述了当前最先进的技术,并讨论了新兴挑战和未来研究方向。

链接: https://arxiv.org/abs/2502.08556
作者: Shixiang Tang,Yizhou Wang,Lu Chen,Yuan Wang,Sida Peng,Dan Xu,Wanli Ouyang
机构: The Chinese University of Hong Kong(香港中文大学); State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室); Tsinghua University (清华大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 9 pages

点击查看摘要

Abstract:Human understanding and generation are critical for modeling digital humans and humanoid embodiments. Recently, Human-centric Foundation Models (HcFMs) inspired by the success of generalist models, such as large language and vision models, have emerged to unify diverse human-centric tasks into a single framework, surpassing traditional task-specific approaches. In this survey, we present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups: (1) Human-centric Perception Foundation Models that capture fine-grained features for multi-modal 2D and 3D understanding. (2) Human-centric AIGC Foundation Models that generate high-fidelity, diverse human-related content. (3) Unified Perception and Generation Models that integrate these capabilities to enhance both human understanding and synthesis. (4) Human-centric Agentic Foundation Models that extend beyond perception and generation to learn human-like intelligence and interactive behaviors for humanoid embodied tasks. We review state-of-the-art techniques, discuss emerging challenges and future research directions. This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling.
zh

[CV-11] Copula-based mixture model identification for subgroup clustering with imaging applications

【速读】:该论文旨在解决模型聚类技术在处理非标准混合分布(non-canonical mixtures)时的局限性,即大多数研究集中在单一成分分布形式的典型混合模型上。论文的关键解决方案是引入基于Copula的混合模型(Copula-Based Mixture Models, CBMMs),允许异构成分分布通过灵活选择边缘分布和copula形式来组成。为了识别这些CBMMs,论文提出了一种改进的广义条件估计算法(Generalized Iterative Conditional Estimation, GICE),该算法能够以无监督方式迭代估计边际分布和copula的形式及其参数。

链接: https://arxiv.org/abs/2502.08549
作者: Fei Zheng,Nicolas Duchateau
机构: Univ Lyon (里昂大学), INSA‐Lyon (里昂国立应用学院), Université Claude Bernard Lyon 1 (克劳德·贝尔纳里昂第一大学), UJM-Saint Etienne (圣埃蒂安大学), CNRS (法国国家科学研究中心), Inserm (法国国家健康与医学研究院), CREATIS UMR 5220 (创新与技术研究中心), U1294 (研究单位1294), LYON, France (法国里昂); Institut Universitaire de France (IUF) (法国大学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Model-based clustering techniques have been widely applied to various application areas, while most studies focus on canonical mixtures with unique component distribution form. However, this strict assumption is often hard to satisfy. In this paper, we consider the more flexible Copula-Based Mixture Models (CBMMs) for clustering, which allow heterogeneous component distributions composed by flexible choices of marginal and copula forms. More specifically, we propose an adaptation of the Generalized Iterative Conditional Estimation (GICE) algorithm to identify the CBMMs in an unsupervised manner, where the marginal and copula forms and their parameters are estimated iteratively. GICE is adapted from its original version developed for switching Markov model identification with the choice of realization time. Our CBMM-GICE clustering method is then tested on synthetic two-cluster data (N=2000 samples) with discussion of the factors impacting its convergence. Finally, it is compared to the Expectation Maximization identified mixture models with unique component form on the entire MNIST database (N=70000), and on real cardiac magnetic resonance data (N=276) to illustrate its value for imaging applications.
zh

[CV-12] Moment of Untruth: Dealing with Negative Queries in Video Moment Retrieval

【速读】:该论文旨在解决现有视频片段检索任务在处理无关查询时导致假阳性预测的问题。为了解决这一问题,论文提出了负样本感知视频片段检索(Negative-Aware Video Moment Retrieval, NA-VMR)任务,不仅考虑视频片段检索的准确性,还考虑对负样本查询的拒绝准确性。论文区分了域内和域外的负样本查询,并提供了两个流行数据集QVHighlights和Charades-STA的新评估基准。为了应对NA-VMR,论文提出了UniVTG-NA方法,它是对UniVTG模型的改进。UniVTG-NA在保持高视频片段检索召回率的同时,实现了平均98.4%的负样本拒绝准确率。

链接: https://arxiv.org/abs/2502.08544
作者: Kevin Flanagan,Dima Damen,Michael Wray
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences. The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Negative-Aware Video Moment Retrieval (NA-VMR), which considers both moment retrieval accuracy and negative query rejection accuracy. We make the distinction between In-Domain and Out-of-Domain negative queries and provide new evaluation benchmarks for two popular video moment retrieval datasets: QVHighlights and Charades-STA. We analyse the ability of current SOTA video moment retrieval approaches to adapt to Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of UniVTG designed to tackle NA-VMR. UniVTG-NA achieves high negative rejection accuracy (avg. 98.4% ) scores while retaining moment retrieval scores to within 3.87% Recall@1. Dataset splits and code are available at this https URL Comments: 16 pages, 9 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.08544 [cs.CV] (or arXiv:2502.08544v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.08544 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-13] A Survey on Image Quality Assessment: Insights Analysis and Future Outlook

【速读】:该论文旨在解决图像质量评估(IQA)在不同应用场景中的方法选择与优化问题。关键在于分析当前方法的优势与局限,并提出适应特定失真类型的IQA方法,强调未来研究需注重实用性、可解释性及易于实施的特点。文中探讨了从传统统计度量到现代深度学习模型如卷积神经网络(CNNs)和Transformer模型等多种IQA技术,以期为初学者和有经验的研究人员提供有益参考。

链接: https://arxiv.org/abs/2502.08540
作者: Chengqian Ma,Zhengyi Shi,Zhiqiang Lu,Shenghao Xie,Fei Chao,Yao Sui
机构: National Institute of Health Data Science, Peking University (北京大学健康数据科学研究所), Beijing, China; School of Informatics, Xiamen University (厦门大学信息学院), Xiamen, Fujian, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image quality assessment (IQA) represents a pivotal challenge in image-focused technologies, significantly influencing the advancement trajectory of image processing and computer vision. Recently, IQA has witnessed a notable surge in innovative research efforts, driven by the emergence of novel architectural paradigms and sophisticated computational techniques. This survey delivers an extensive analysis of contemporary IQA methodologies, organized according to their application scenarios, serving as a beneficial reference for both beginners and experienced researchers. We analyze the advantages and limitations of current approaches and suggest potential future research pathways. The survey encompasses both general and specific IQA methodologies, including conventional statistical measures, machine learning techniques, and cutting-edge deep learning models such as convolutional neural networks (CNNs) and Transformer models. The analysis within this survey highlights the necessity for distortion-specific IQA methods tailored to various application scenarios, emphasizing the significance of practicality, interpretability, and ease of implementation in future developments.
zh

[CV-14] Referring Remote Sensing Image Segmentation via Bidirectional Alignment Guided Joint Prediction

【速读】:该论文旨在解决遥感图像分割(Referring Remote Sensing Image Segmentation, RRSIS)中的挑战,特别是由于视觉语言差距、高空间分辨率、目标多样性和小目标识别困难等问题。论文提出了一种名为\ours的新框架,关键在于引入了双向空间相关性(Bidirectional Spatial Correlation, BSC)以增强视觉语言特征对齐,目标-背景双流解码器(Target-Background TwinStream Decoder, T-BTD)以精确区分目标与非目标,以及双模态对象学习策略(Dual-Modal Object Learning Strategy, D-MOLS)以实现鲁棒的多模态特征重构。这些创新显著提升了整体交并比(overall IoU, oIoU)和平均交并比(mean IoU, mIoU),有效解决了RRSIS的核心挑战。

链接: https://arxiv.org/abs/2502.08486
作者: Tianxiang Zhang,Zhaokun Wen,Bo Kong,Kecheng Liu,Yisi Zhang,Peixian Zhuang,Jiangyun Li
机构: School of Automation and Electrical Engineering, University of Science and Technology Beijing (北京科技大学自动化与电气工程学院); Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, the School of Automation and Electrical Engineering, University of Science and Technology Beijing (工业过程知识自动化教育部重点实验室,北京科技大学自动化与电气工程学院); Shunde Graduate School of University of Science and Technology Beijing (北京科技大学顺德研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring Remote Sensing Image Segmentation (RRSIS) is critical for ecological monitoring, urban planning, and disaster management, requiring precise segmentation of objects in remote sensing imagery guided by textual descriptions. This task is uniquely challenging due to the considerable vision-language gap, the high spatial resolution and broad coverage of remote sensing imagery with diverse categories and small targets, and the presence of clustered, unclear targets with blurred edges. To tackle these issues, we propose \ours, a novel framework designed to bridge the vision-language gap, enhance multi-scale feature interaction, and improve fine-grained object differentiation. Specifically, \ours introduces: (1) the Bidirectional Spatial Correlation (BSC) for improved vision-language feature alignment, (2) the Target-Background TwinStream Decoder (T-BTD) for precise distinction between targets and non-targets, and (3) the Dual-Modal Object Learning Strategy (D-MOLS) for robust multimodal feature reconstruction. Extensive experiments on the benchmark datasets RefSegRS and RRSIS-D demonstrate that \ours achieves state-of-the-art performance. Specifically, \ours improves the overall IoU (oIoU) by 3.76 percentage points (80.57) and 1.44 percentage points (79.23) on the two datasets, respectively. Additionally, it outperforms previous methods in the mean IoU (mIoU) by 5.37 percentage points (67.95) and 1.84 percentage points (66.04), effectively addressing the core challenges of RRSIS with enhanced precision and robustness.
zh

[CV-15] raining-Free Restoration of Pruned Neural Networks

【速读】:该论文旨在解决网络剪枝后恢复精度依赖于昂贵的再训练过程及需要原始数据的问题。论文的关键在于提出了一种无需微调和数据的更严谨且鲁棒的方法,称为LBYL(Leave Before You Leave)。此方法通过让每个被剪枝的神经元将其信息碎片传递给尽可能多的保留神经元,从而多个神经元共同实现对原神经元输出的更稳健近似。这种方法基于对原始网络与其近似值之间重构误差的理论分析,并由此导出闭合形式的损失函数。实验验证了LBYL在逼近原始网络方面更为有效,从而提高了恢复网络的准确性。

链接: https://arxiv.org/abs/2502.08474
作者: Keonho Lee,Minsoo Kim,Dong-Wan Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review in TNNLS since May 2022

点击查看摘要

Abstract:Although network pruning has been highly popularized to compress deep neural networks, its resulting accuracy heavily depends on a fine-tuning process that is often computationally expensive and requires the original data. However, this may not be the case in real-world scenarios, and hence a few recent works attempt to restore pruned networks without any expensive retraining process. Their strong assumption is that every neuron being pruned can be replaced with another one quite similar to it, but unfortunately this does not hold in many neural networks, where the similarity between neurons is extremely low in some layers. In this article, we propose a more rigorous and robust method of restoring pruned networks in a fine-tuning free and data-free manner, called LBYL (Leave Before You Leave). LBYL significantly relaxes the aforementioned assumption in a way that each pruned neuron leaves its pieces of information to as many preserved neurons as possible and thereby multiple neurons together obtain a more robust approximation to the original output of the neuron who just left. Our method is based on a theoretical analysis on how to formulate the reconstruction error between the original network and its approximation, which nicely leads to a closed form solution for our derived loss function. Through the extensive experiments, LBYL is confirmed to be indeed more effective to approximate the original network and consequently able to achieve higher accuracy for restored networks, compared to the recent approaches exploiting the similarity between two neurons. The very first version of this work, which contains major technical and theoretical components, was submitted to NeurIPS 2021 and ICML 2022.
zh

[CV-16] Handwritten Text Recognition: A Survey

【速读】:该论文旨在综述手写文本识别(Handwritten Text Recognition, HTR)领域的发展历程与现状,从早期基于启发式的模型到当前基于深度学习的先进神经网络模型。论文的关键在于提供一个统一框架,分析研究方法、评估基准的最新进展、关键数据集,并讨论文献中报告的结果。此外,论文还指出了亟待解决的研究挑战,并概述了有前景的未来发展方向,以指导研究人员和从业者推动该领域的发展。关键解决方案在于发展能够处理从单词级到文档级复杂度的手写文本识别系统。

链接: https://arxiv.org/abs/2502.08417
作者: Carlos Garrido-Munoz,Antonio Rios-Vila,Jorge Calvo-Zaragoza
机构: University of Alicante (阿利坎特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Handwritten Text Recognition (HTR) has become an essential field within pattern recognition and machine learning, with applications spanning historical document preservation to modern data entry and accessibility solutions. The complexity of HTR lies in the high variability of handwriting, which makes it challenging to develop robust recognition systems. This survey examines the evolution of HTR models, tracing their progression from early heuristic-based approaches to contemporary state-of-the-art neural models, which leverage deep learning techniques. The scope of the field has also expanded, with models initially capable of recognizing only word-level content progressing to recent end-to-end document-level approaches. Our paper categorizes existing work into two primary levels of recognition: (1) \emphup to line-level, encompassing word and line recognition, and (2) \emphbeyond line-level, addressing paragraph- and document-level challenges. We provide a unified framework that examines research methodologies, recent advances in benchmarking, key datasets in the field, and a discussion of the results reported in the literature. Finally, we identify pressing research challenges and outline promising future directions, aiming to equip researchers and practitioners with a roadmap for advancing the field.
zh

[CV-17] ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification CVPR2024 DATE

【速读】:该论文旨在解决基于多重实例学习(MIL)框架在处理全幅图像(WSI)时依赖大量袋级别标签以及仅从原始幻灯片中学习的问题,这容易受到数据分布变化的影响。同时,虽然基于视觉语言模型(VLM)的方法通过大规模病理图像-文本对的预训练引入了语言先验知识,但缺乏病理学先验知识的考虑,未能显著提升模型性能,并且收集配对数据和预训练过程耗时较长。为了解决上述问题,论文提出了一种双尺度视觉-语言多重实例学习(ViLa-MIL)框架,用于全幅图像分类。其关键是提出了一种基于冻结的大规模语言模型(LLM)的双尺度视觉描述文本提示,以有效提升VLM的性能。此外,通过引入原型引导的补丁解码器和上下文引导的文本解码器,分别增强图像分支和文本分支的特征表示。

链接: https://arxiv.org/abs/2502.08391
作者: Jiangbo Shi,Chen Li,Tieliang Gong,Yefeng Zheng,Huazhu Fu
机构: School of Computer Science and Technology, Xi’an Jiaotong University (西安交通大学计算机科学与技术学院), Xi’an, China; Jarvis Research Center, Tencent YouTu Lab (腾讯优图实验室贾维斯研究中心), Shenzhen, China; Institute of High Performance Computing, Agency for Science, Technology and Research (新加坡科技研究局高性能计算研究所), Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2024 (Updated version with corrections for typos and errors.)

点击查看摘要

Abstract:Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However, these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides, which are easily affected by variations in data distribution. Recently, vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However, the previous text prompt lacks the consideration of pathological prior knowledge, therefore does not substantially boost the model’s performance. Moreover, the collection of such pairs and the pre-training process are very time-consuming and this http URL solve the above problems, we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically, we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently, for the image branch, we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch, we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.
zh

[CV-18] Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features

【速读】:该论文旨在解决动态区域与静态区域交织的视频帧中,现有方法难以准确捕捉动态信息且容易过度拟合静态信息的问题,导致生成结果纹理模糊。论文的关键解决方案在于提出了动态-静态特征解耦模块(Dynamic-Static Feature Decoupling, DSFD),通过沿时间轴区分显著差异部分为动态特征,其余部分为静态特征,从而增强动态表示。此外,设计了时序-空间相似性融合模块(Temporal-Spatial Similarity Fusion, TSSF)以进一步从不同视角增强动态特征,并确保准确的运动预测。

链接: https://arxiv.org/abs/2502.08377
作者: Liying Yang,Chen Liu,Zhenwei Zhu,Ajian Liu,Hui Ma,Jian Nong,Yanyan Liang
机构: Macau University of Science and Technology(澳门科技大学); The University of Queensland(昆士兰大学); Institute of Automation, Chinese Academy of Sciences(CASIA)(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, the generation of dynamic 3D objects from a video has shown impressive results. Existing methods directly optimize Gaussians using whole information in frames. However, when dynamic regions are interwoven with static regions within frames, particularly if the static regions account for a large proportion, existing methods often overlook information in dynamic regions and are prone to overfitting on static regions. This leads to producing results with blurry textures. We consider that decoupling dynamic-static features to enhance dynamic representations can alleviate this issue. Thus, we propose a dynamic-static feature decoupling module (DSFD). Along temporal axes, it regards the portions of current frame features that possess significant differences relative to reference frame features as dynamic features. Conversely, the remaining parts are the static features. Then, we acquire decoupled features driven by dynamic features and current frame features. Moreover, to further enhance the dynamic representation of decoupled features from different viewpoints and ensure accurate motion prediction, we design a temporal-spatial similarity fusion module (TSSF). Along spatial axes, it adaptively selects a similar information of dynamic regions. Hinging on the above, we construct a novel approach, DS4D. Experimental results verify our method achieves state-of-the-art (SOTA) results in video-to-4D. In addition, the experiments on a real-world scenario dataset demonstrate its effectiveness on the 4D scene. Our code will be publicly available.
zh

[CV-19] AdvSwap: Covert Adversarial Perturbation with High Frequency Info-swapping for Autonomous Driving Perception ITSC

【速读】:该论文旨在解决自动驾驶车辆感知模块易受对抗性攻击的问题,这些攻击通过神经网络中的漏洞利用对抗性输入来危及人工智能安全。论文的关键解决方案是提出了一种新颖的对抗性攻击方法AdvSwap,它创造性地利用基于小波的高频信息交换生成隐蔽的对抗性样本,以欺骗摄像头。AdvSwap采用可逆神经网络进行选择性的高频信息交换,保持前向传播和数据完整性。该方案有效地移除了原始标签数据,并融合了引导图像数据,从而生成隐蔽且鲁棒的对抗性样本。

链接: https://arxiv.org/abs/2502.08374
作者: Yuanhao Huang,Qinfan Zhang,Jiandong Xing,Mengyue Cheng,Haiyang Yu,Yilong Ren,Xiao Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27th IEEE International Conference on Intelligent Transportation Systems (ITSC)

点击查看摘要

Abstract:Perception module of Autonomous vehicles (AVs) are increasingly susceptible to be attacked, which exploit vulnerabilities in neural networks through adversarial inputs, thereby compromising the AI safety. Some researches focus on creating covert adversarial samples, but existing global noise techniques are detectable and difficult to deceive the human visual system. This paper introduces a novel adversarial attack method, AdvSwap, which creatively utilizes wavelet-based high-frequency information swapping to generate covert adversarial samples and fool the camera. AdvSwap employs invertible neural network for selective high-frequency information swapping, preserving both forward propagation and data integrity. The scheme effectively removes the original label data and incorporates the guidance image data, producing concealed and robust adversarial samples. Experimental evaluations and comparisons on the GTSRB and nuScenes datasets demonstrate that AdvSwap can make concealed attacks on common traffic targets. The generates adversarial samples are also difficult to perceive by humans and algorithms. Meanwhile, the method has strong attacking robustness and attacking transferability.
zh

[CV-20] Uncertainty Aware Human-machine Collaboration in Camouflaged Object Detection

【速读】:该论文旨在解决在伪装物体检测(Camouflaged Object Detection, COD)任务中提高系统可靠性和决策准确性的问题。关键在于引入了一种人机协作框架,通过多视角主干网络估计计算机视觉(CV)模型预测中的不确定性,并在训练过程中有效利用这些不确定性以提升效率。此外,在测试阶段,低置信度情况被转交给人类专家通过基于RSVP的脑机接口(BCI)进行评估,从而实现更可靠的决策。这种方法不仅提升了系统的整体性能,还在CAMO数据集上达到了最先进的结果,平衡准确率(BA)平均提高了4.56%,F1分数提高了3.66%。

链接: https://arxiv.org/abs/2502.08373
作者: Ziyue Yang,Kehan Wang,Yuhang Ming,Yong Peng,Han Yang,Qiong Chen,Wanzeng Kong
机构: Hangzhou Dianzi University(杭州电子科技大学); Hangzhou Dianzi University(杭州电子科技大学); Hangzhou Dianzi University(杭州电子科技大学); Hangzhou Dianzi University(杭州电子科技大学); Hangzhou Dianzi University(杭州电子科技大学); CETHIK; Hangzhou Dianzi University(杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Camouflaged Object Detection (COD), the task of identifying objects concealed within their environments, has seen rapid growth due to its wide range of practical applications. A key step toward developing trustworthy COD systems is the estimation and effective utilization of uncertainty. In this work, we propose a human-machine collaboration framework for classifying the presence of camouflaged objects, leveraging the complementary strengths of computer vision (CV) models and noninvasive brain-computer interfaces (BCIs). Our approach introduces a multiview backbone to estimate uncertainty in CV model predictions, utilizes this uncertainty during training to improve efficiency, and defers low-confidence cases to human evaluation via RSVP-based BCIs during testing for more reliable decision-making. We evaluated the framework in the CAMO dataset, achieving state-of-the-art results with an average improvement of 4.56% in balanced accuracy (BA) and 3.66% in the F1 score compared to existing methods. For the best-performing participants, the improvements reached 7.6% in BA and 6.66% in the F1 score. Analysis of the training process revealed a strong correlation between our confidence measures and precision, while an ablation study confirmed the effectiveness of the proposed training policy and the human-machine collaboration strategy. In general, this work reduces human cognitive load, improves system reliability, and provides a strong foundation for advancements in real-world COD applications and human-computer interaction. Our code and data are available at: this https URL.
zh

[CV-21] Sat-DN: Implicit Surface Reconstruction from Multi-View Satellite Images with Depth and Normal Supervision

【速读】:该论文旨在解决高分辨率多视角卫星图像中地形几何重建和细节建筑立面难以精确重建的问题。传统立体匹配方法难以捕捉细微特征,而神经辐射场(NeRFs)虽然能够实现高质量重建,但其训练时间过长。此外,建筑物立面可见性差、像素间光照和风格差异以及卫星图像中的弱纹理区域等问题也增加了重建难度。为了解决这些问题,论文提出了一种名为Sat-DN的新框架,关键在于采用渐进式训练的多分辨率哈希网格重建架构,并引入显式的深度引导和表面法线一致性约束以增强重建质量。多分辨率哈希网格加速了训练过程,渐进策略逐步提高学习频率,利用粗略的低频几何结构指导精细高频细节的重建。这些深度和法线约束确保了清晰的建筑轮廓和正确的平面分布。

链接: https://arxiv.org/abs/2502.08352
作者: Tianle Liu,Shuangming Zhao,Wanshou Jiang,Bingxuan Guo
机构: School of Remote Sensing Information Engineering, Wuhan University (武汉大学遥感信息工程学院); State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (国家地理信息系统工程技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With advancements in satellite imaging technology, acquiring high-resolution multi-view satellite imagery has become increasingly accessible, enabling rapid and location-independent ground model reconstruction. However, traditional stereo matching methods struggle to capture fine details, and while neural radiance fields (NeRFs) achieve high-quality reconstructions, their training time is prohibitively long. Moreover, challenges such as low visibility of building facades, illumination and style differences between pixels, and weakly textured regions in satellite imagery further make it hard to reconstruct reasonable terrain geometry and detailed building facades. To address these issues, we propose Sat-DN, a novel framework leveraging a progressively trained multi-resolution hash grid reconstruction architecture with explicit depth guidance and surface normal consistency constraints to enhance reconstruction quality. The multi-resolution hash grid accelerates training, while the progressive strategy incrementally increases the learning frequency, using coarse low-frequency geometry to guide the reconstruction of fine high-frequency details. The depth and normal constraints ensure a clear building outline and correct planar distribution. Extensive experiments on the DFC2019 dataset demonstrate that Sat-DN outperforms existing methods, achieving state-of-the-art results in both qualitative and quantitative evaluations. The code is available at this https URL.
zh

[CV-22] Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation

【速读】:该论文旨在解决医学图像分割中的标签稀缺问题。为应对这一挑战,论文提出了一种名为Hierarchical Encoder-driven MAE (Hi-End-MAE) 的解决方案,其关键是通过编码器驱动的重构(Encoder-driven reconstruction)鼓励编码器学习更具有信息量的特征以指导掩码补丁的重构,以及通过分层密集解码(Hierarchical dense decoding)实现分层解码结构以捕捉不同层间的丰富表示。这些创新提升了模型在多种下游任务中的迁移学习能力。

链接: https://arxiv.org/abs/2502.08347
作者: Fenghe Tang,Qingsong Yao,Wenxin Ma,Chenxu Wu,Zihang Jiang,S. Kevin Zhou
机构: School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC); Center for Medical Imaging, Robotics, and Analytic Computing & LEarning (MIRACLE), Suzhou Institute for Advanced Research, USTC; State Key Laboratory of Precision and Intelligent Chemistry, USTC; Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, Code: this https URL

点击查看摘要

Abstract:Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across seven public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: this https URL
zh

[CV-23] Foundation Models in Computational Pathology: A Review of Challenges Opportunities and Impact

【速读】:该论文旨在探讨生成式及多功能人工智能(Generative and Multi-purpose AI)在计算病理学中的潜力及其临床应用的实际整合。关键在于通过建立全球基准来提升评估标准,并推动这些模型的广泛应用与社会接受度,从而充分发挥前沿人工智能在计算病理学中的影响力。

链接: https://arxiv.org/abs/2502.08333
作者: Mohsin Bilal,Aadam,Manahil Raza,Youssef Altherwy,Anas Alsuhaibani,Abdulrahman Abduljabbar,Fahdah Almarshad,Paul Golding,Nasir Rajpoot
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 63 pages, 7 figures

点击查看摘要

Abstract:From self-supervised, vision-only models to contrastive visual-language frameworks, computational pathology has rapidly evolved in recent years. Generative AI “co-pilots” now demonstrate the ability to mine subtle, sub-visual tissue cues across the cellular-to-pathology spectrum, generate comprehensive reports, and respond to complex user queries. The scale of data has surged dramatically, growing from tens to millions of multi-gigapixel tissue images, while the number of trainable parameters in these models has risen to several billion. The critical question remains: how will this new wave of generative and multi-purpose AI transform clinical diagnostics? In this article, we explore the true potential of these innovations and their integration into clinical practice. We review the rapid progress of foundation models in pathology, clarify their applications and significance. More precisely, we examine the very definition of foundational models, identifying what makes them foundational, general, or multipurpose, and assess their impact on computational pathology. Additionally, we address the unique challenges associated with their development and evaluation. These models have demonstrated exceptional predictive and generative capabilities, but establishing global benchmarks is crucial to enhancing evaluation standards and fostering their widespread clinical adoption. In computational pathology, the broader impact of frontier AI ultimately depends on widespread adoption and societal acceptance. While direct public exposure is not strictly necessary, it remains a powerful tool for dispelling misconceptions, building trust, and securing regulatory support.
zh

[CV-24] Screener: Self-supervised Pathology Segmentation Model for 3D Medical Images

【速读】:该论文旨在解决在三维医学图像中精确分割所有病理发现的挑战,现有监督模型因受限于标注的数据集中的有限病理类别而难以实现。为解决这一问题,论文将病理分割问题转化为无监督视觉异常分割(Unsupervised Visual Anomaly Segmentation, UVAS)问题,并利用病理模式相对于健康组织的固有稀有性。关键解决方案包括:(1) 密集自监督学习(Self-Supervised Learning, SSL)用于特征提取,无需预训练;(2) 学习到的、掩码不变的密集特征作为条件变量,替代手工设计的位置编码。这些创新使模型Screener在包含1,820个扫描数据的四个大规模测试数据集上超越了现有的UVAS方法。

链接: https://arxiv.org/abs/2502.08321
作者: Mikhail Goncharov,Eugenia Soboleva,Mariia Donskova,Ivan Oseledets,Marina Munkhoeva,Maxim Panov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of all pathological findings in 3D medical images remains a significant challenge, as supervised models are limited to detecting only the few pathology classes annotated in existing datasets. To address this, we frame pathology segmentation as an unsupervised visual anomaly segmentation (UVAS) problem, leveraging the inherent rarity of pathological patterns compared to healthy ones. We enhance the existing density-based UVAS framework with two key innovations: (1) dense self-supervised learning (SSL) for feature extraction, eliminating the need for supervised pre-training, and (2) learned, masking-invariant dense features as conditioning variables, replacing hand-crafted positional encodings. Trained on over 30,000 unlabeled 3D CT volumes, our model, Screener, outperforms existing UVAS methods on four large-scale test datasets comprising 1,820 scans with diverse pathologies. Code and pre-trained models will be made publicly available.
zh

[CV-25] When do they StOP?: A First Step Towards Automatically Identifying Team Communication in the Operating Room

【速读】:该论文旨在解决手术室团队沟通识别的问题,特别是通过自动检测所有手术团队成员参与的关键沟通环节——即“Time-out”和“StOP?”协议来提升患者安全及计算机辅助手术流程分析与术中支持系统的发展。解决方案的关键在于提出了一种新的群体活动检测方法,该方法通过编码场景上下文和动作特征,并使用高效的神经网络模型输出结果,从而在手术视频记录中定位这些沟通环节的起止时间。

链接: https://arxiv.org/abs/2502.08299
作者: Keqi Chen,Lilien Schewski,Vinkle Srivastav,Joël Lavanchy,Didier Mutter,Guido Beldi,Sandra Keller,Nicolas Padoy
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357, France(斯特拉斯堡大学,CNRS,INSERM,ICube,UMR7357,法国);

IHU Strasbourg, Strasbourg 67000, France(斯特拉斯堡IHU,法国);

Department for Biomedical Research (DBMR), University of Bern, 3008 Bern, Switzerland(伯尔尼大学生物医学研究系,瑞士);

University Digestive Health Care Center, Clarunis, Basel 4002, Switzerland(巴塞尔消化健康护理中心,Clarunis,瑞士);

Department of Biomedical Engineering, University of Basel, Allschwil 4123, Switzerland(巴塞尔大学生物医学工程系,瑞士);

University Hospital of Strasbourg, Strasbourg 67000, France(斯特拉斯堡大学医院,法国);

Department for Visceral Surgery and Medicine, Bern University Hospital, University of Bern, 3010 Bern, Switzerland(伯尔尼大学医院,伯尔尼大学内脏外科和内科系,瑞士)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Surgical performance depends not only on surgeons’ technical skills but also on team communication within and across the different professional groups present during the operation. Therefore, automatically identifying team communication in the OR is crucial for patient safety and advances in the development of computer-assisted surgical workflow analysis and intra-operative support systems. To take the first step, we propose a new task of detecting communication briefings involving all OR team members, i.e. the team Time-out and the StOP?-protocol, by localizing their start and end times in video recordings of surgical operations. Methods: We generate an OR dataset of real surgeries, called Team-OR, with more than one hundred hours of surgical videos captured by the multi-view camera system in the OR. The dataset contains temporal annotations of 33 Time-out and 22 StOP?-protocol activities in total. We then propose a novel group activity detection approach, where we encode both scene context and action features, and use an efficient neural network model to output the results. Results: The experimental results on the Team-OR dataset show that our approach outperforms existing state-of-the-art temporal action detection approaches. It also demonstrates the lack of research on group activities in the OR, proving the significance of our dataset. Conclusion: We investigate the Team Time-Out and the StOP?-protocol in the OR, by presenting the first OR dataset with temporal annotations of group activities protocols, and introducing a novel group activity detection approach that outperforms existing approaches. Code is available at this https URL .
zh

[CV-26] BEAM: Bridging Physically-based Rendering and Gaussian Modeling for Relightable Volumetric Video

【速读】:该论文旨在解决传统体积视频方法在固定光照条件下表现不佳以及神经网络方法在效率、质量或适应性方面存在的权衡问题。论文的关键解决方案是提出BEAM,一种结合4D高斯表示与基于物理的渲染(Physically-Based Rendering, PBR)的新管道,通过一系列高斯基础技术恢复详细的几何结构和PBR属性,从而生成高质量且可重新照明的体积视频。

链接: https://arxiv.org/abs/2502.08297
作者: Yu Hong,Yize Wu,Zhehao Shen,Chengcheng Guo,Yuheng Jiang,Yingliang Zhang,Jingyi Yu,Lan Xu
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Volumetric video enables immersive experiences by capturing dynamic 3D scenes, enabling diverse applications for virtual reality, education, and telepresence. However, traditional methods struggle with fixed lighting conditions, while neural approaches face trade-offs in efficiency, quality, or adaptability for relightable scenarios. To address these limitations, we present BEAM, a novel pipeline that bridges 4D Gaussian representations with physically-based rendering (PBR) to produce high-quality, relightable volumetric videos from multi-view RGB footage. BEAM recovers detailed geometry and PBR properties via a series of available Gaussian-based techniques. It first combines Gaussian-based performance tracking with geometry-aware rasterization in a coarse-to-fine optimization framework to recover spatially and temporally consistent geometries. We further enhance Gaussian attributes by incorporating PBR properties step by step. We generate roughness via a multi-view-conditioned diffusion model, and then derive AO and base color using a 2D-to-3D strategy, incorporating a tailored Gaussian-based ray tracer for efficient visibility computation. Once recovered, these dynamic, relightable assets integrate seamlessly into traditional CG pipelines, supporting real-time rendering with deferred shading and offline rendering with ray tracing. By offering realistic, lifelike visualizations under diverse lighting conditions, BEAM opens new possibilities for interactive entertainment, storytelling, and creative visualization.
zh

[CV-27] Fully-Geometric Cross-Attention for Point Cloud Registration

【速读】:该论文旨在解决点云配准在重叠区域较小时因噪声点对应关系导致的失败问题。关键在于引入了一种针对基于Transformer架构的新型交叉注意力机制,通过在超级点级别融合来自坐标和特征的信息,并将Gromov-Wasserstein距离整合到交叉注意力公式中,以联合计算不同点云之间的距离并考虑其几何结构。这种方法使得任意刚性变换下的两个不同点云中的点可以相互关注,从而提高了内点对应关系的数量,实现了更精确的配准结果。

链接: https://arxiv.org/abs/2502.08285
作者: Weijie Wang,Guofeng Mei,Jian Zhang,Nicu Sebe,Bruno Lepri,Fabio Poiesi
机构: University of Trento; Fondazione Bruno Kessler; University of Technology Sydney
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud registration approaches often fail when the overlap between point clouds is low due to noisy point correspondences. This work introduces a novel cross-attention mechanism tailored for Transformer-based architectures that tackles this problem, by fusing information from coordinates and features at the super-point level between point clouds. This formulation has remained unexplored primarily because it must guarantee rotation and translation invariance since point clouds reside in different and independent reference frames. We integrate the Gromov-Wasserstein distance into the cross-attention formulation to jointly compute distances between points across different point clouds and account for their geometric structure. By doing so, points from two distinct point clouds can attend to each other under arbitrary rigid transformations. At the point level, we also devise a self-attention mechanism that aggregates the local geometric structure information into point features for fine matching. Our formulation boosts the number of inlier correspondences, thereby yielding more precise registration results compared to state-of-the-art approaches. We have conducted an extensive evaluation on 3DMatch, 3DLoMatch, KITTI, and 3DCSR datasets.
zh

[CV-28] UniCoRN: Unified Commented Retrieval Network with LMMs

【速读】:本文旨在解决复杂、组合查询中多模态检索方法在处理视觉内容推理方面的局限性,以及大型多模态模型(Large Multimodal Models, LMMs)虽能通过语言回答复杂视觉问题,但缺乏检索相关实体支持其答案的能力。为解决这些问题,论文提出UniCoRN (Unified Commented Retrieval Network),它结合了多模态检索方法和生成语言方法的优势,超越了检索增强生成(Retrieval-Augmented Generation, RAG)。UniCoRN的关键在于引入了一个实体适配器模块,该模块将检索到的多模态实体注入LMM中,使其在生成答案和评论时能够关注这些实体。通过保持基础LMM不变,UniCoRN保留了其原有功能,并能够在单一集成框架下执行检索和文本生成任务。

链接: https://arxiv.org/abs/2502.08254
作者: Maximilian Jaritz,Matthieu Guillaumin,Sabine Sternig,Loris Bazzani
机构: Amazon.com
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal retrieval methods have limitations in handling complex, compositional queries that require reasoning about the visual content of both the query and the retrieved entities. On the other hand, Large Multimodal Models (LMMs) can answer with language to more complex visual questions, but without the inherent ability to retrieve relevant entities to support their answers. We aim to address these limitations with UniCoRN, a Unified Commented Retrieval Network that combines the strengths of composed multimodal retrieval methods and generative language approaches, going beyond Retrieval-Augmented Generation (RAG). We introduce an entity adapter module to inject the retrieved multimodal entities back into the LMM, so it can attend to them while generating answers and comments. By keeping the base LMM frozen, UniCoRN preserves its original capabilities while being able to perform both retrieval and text generation tasks under a single integrated framework. To assess these new abilities, we introduce the Commented Retrieval task (CoR) and a corresponding dataset, with the goal of retrieving an image that accurately answers a given question and generate an additional textual response that provides further clarification and details about the visual information. We demonstrate the effectiveness of UniCoRN on several datasets showing improvements of +4.5% recall over the state of the art for composed multimodal retrieval and of +14.9% METEOR / +18.4% BEM over RAG for commenting in CoR.
zh

[CV-29] FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis

【速读】:该论文旨在解决相机可控视频生成的问题。解决方案的关键在于引入了一种基于光流的新型视频扩散模型——FloVD。FloVD通过利用光流图来表示相机和运动物体的动态,从而实现了无需精确相机参数即可直接从任意训练视频中估计光流的优势,并且能够通过背景运动实现详细的相机控制。这种两阶段的视频合成管道,包括光流生成和基于光流的视频合成,确保了自然物体运动的同时支持详细的相机控制。

链接: https://arxiv.org/abs/2502.08244
作者: Wonjoon Jin,Qi Dai,Chong Luo,Seung-Hwan Baek,Sunghyun Cho
机构: POSTECH(浦山科技大学); Microsoft Research Asia(微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:This paper presents FloVD, a novel optical-flow-based video diffusion model for camera-controllable video generation. FloVD leverages optical flow maps to represent motions of the camera and moving objects. This approach offers two key benefits. Since optical flow can be directly estimated from videos, our approach allows for the use of arbitrary training videos without ground-truth camera parameters. Moreover, as background optical flow encodes 3D correlation across different viewpoints, our method enables detailed camera control by leveraging the background motion. To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.
zh

[CV-30] Learning Human Skill Generators at Key-Step Levels

【速读】:该论文致力于解决在关键步骤层面学习人类技能生成的问题。当前视频生成模型难以处理复杂的人类技能,因为这些技能涉及多步骤、长时间的动作以及复杂的场景转换。为此,论文提出了一项新任务——关键步骤技能生成(Key-step Skill Generation, KS-Gen),旨在简化生成人类技能视频的复杂性。解决方案的关键在于引入了一个新的框架,首先利用多模态大语言模型(Multimodal Large Language Model, MLLM)生成关键步骤的描述,然后使用关键步骤图像生成器(Key-step Image Generator, KIG)解决技能视频中关键步骤间的不连续性问题,最后通过视频生成模型结合这些描述和关键步骤图像生成具有高时间一致性的关键步骤视频片段。

链接: https://arxiv.org/abs/2502.08234
作者: Yilu Wu,Chenhui Zhu,Shuai Wang,Hanlin Wang,Jing Wang,Zhaoxiang Zhang,Limin Wang
机构: State Key Laboratory for Novel Software Technology, Nanjing University(软件新技术国家重点实验室,南京大学); State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA(多模态人工智能系统国家重点实验室,中科院自动化研究所); Shanghai AI Lab(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We are committed to learning human skill generators at key-step levels. The generation of skills is a challenging endeavor, but its successful implementation could greatly facilitate human skill learning and provide more experience for embodied intelligence. Although current video generation models can synthesis simple and atomic human operations, they struggle with human skills due to their complex procedure process. Human skills involve multi-step, long-duration actions and complex scene transitions, so the existing naive auto-regressive methods for synthesizing long videos cannot generate human skills. To address this, we propose a novel task, the Key-step Skill Generation (KS-Gen), aimed at reducing the complexity of generating human skill videos. Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill, rather than a full-length video. To support this task, we introduce a carefully curated dataset and define multiple evaluation metrics to assess performance. Considering the complexity of KS-Gen, we propose a new framework for this task. First, a multimodal large language model (MLLM) generates descriptions for key steps using retrieval argument. Subsequently, we use a Key-step Image Generator (KIG) to address the discontinuity between key steps in skill videos. Finally, a video generation model uses these descriptions and key-step images to generate video clips of the key steps with high temporal consistency. We offer a detailed analysis of the results, hoping to provide more insights on human skill generation. All models and data are available at this https URL.
zh

[CV-31] Plantation Monitoring Using Drone Images: A Dataset and Performance Review

【速读】:该论文旨在解决自动监测树种植被健康状态的问题,特别是在发展中国家如印度的小型农民群体中。解决方案的关键在于使用无人机图像代替传统的卫星图像,以提供更易于获取且成本更低的数据源,并引入深度卷积操作来增强卷积神经网络(CNN)模型在无人机图像数据集上的性能。

链接: https://arxiv.org/abs/2502.08233
作者: Yashwanth Karumanchi,Gudala Laxmi Prasanna,Snehasis Mukherjee,Nagesh Kolagani
机构: University of Utah (犹他大学); WASSAN; Shiv Nadar Institution of Eminence (Shiv Nadar杰出大学); Centurion University of Technology and Management (Centurion科技大学与管理学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic monitoring of tree plantations plays a crucial role in agriculture. Flawless monitoring of tree health helps farmers make informed decisions regarding their management by taking appropriate action. Use of drone images for automatic plantation monitoring can enhance the accuracy of the monitoring process, while still being affordable to small farmers in developing countries such as India. Small, low cost drones equipped with an RGB camera can capture high-resolution images of agricultural fields, allowing for detailed analysis of the well-being of the plantations. Existing methods of automated plantation monitoring are mostly based on satellite images, which are difficult to get for the farmers. We propose an automated system for plantation health monitoring using drone images, which are becoming easier to get for the farmers. We propose a dataset of images of trees with three categories: Good health", Stunted", and ``Dead". We annotate the dataset using CVAT annotation tool, for use in research purposes. We experiment with different well-known CNN models to observe their performance on the proposed dataset. The initial low accuracy levels show the complexity of the proposed dataset. Further, our study revealed that, depth-wise convolution operation embedded in a deep CNN model, can enhance the performance of the model on drone dataset. Further, we apply state-of-the-art object detection models to identify individual trees to better monitor them automatically.
zh

[CV-32] RISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents ICML2025

【速读】:该论文旨在解决现有大型视觉语言模型(Large Vision Language Models, LVLMs)在基于图形用户界面(GUI)代理任务中的两大主要问题:跨数据集和跨平台泛化能力不足,以及缺乏全面的GUI理解。解决方案的关键在于提出了一种名为TRISHUL的新框架,它是一种无需训练的代理框架,通过引入分层屏幕解析(Hierarchical Screen Parsing, HSP)和空间增强元素描述(Spatially Enhanced Element Description, SEED)模块,实现了动作定位(action grounding)和GUI引用(GUI referring)的无缝集成。这些模块协同工作,提供多粒度、空间和语义丰富的GUI元素表示,从而显著提升了LVLMs在全面GUI理解方面的性能。

链接: https://arxiv.org/abs/2502.08226
作者: Kunal Singh,Shreyas Singh,Mukund Khanna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review at ICML 2025, 8 pages 5 figures

点击查看摘要

Abstract:Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI elements) or GUI referring (describing GUI elements given a location), TRISHUL seamlessly integrates both. At its core, TRISHUL employs Hierarchical Screen Parsing (HSP) and the Spatially Enhanced Element Description (SEED) module, which work synergistically to provide multi-granular, spatially, and semantically enriched representations of GUI elements. Our results demonstrate TRISHUL’s superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. Additionally, for GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.
zh

[CV-33] ake What You Need: Flexible Multi-Task Semantic Communications with Channel Adaptation

【速读】:该论文旨在解决高效语义通信系统的需求,以管理多样化的任务并适应波动的信道条件。解决方案的关键在于提出了一种基于掩码自编码器架构的新型信道自适应和多任务感知语义通信框架。此框架通过引入一个多任务感知评分机制来优化有意义信息的传输,并利用一个信道感知提取器动态选择相关信息以应对实时信道状况,从而实现语义相关性和传输效率的联合优化。

链接: https://arxiv.org/abs/2502.08221
作者: Xiang Chen,Shuying Gan,Chenyuan Feng,Xijun Wang,Tony Q. S. Quek
机构: Sun Yat-sen University (中山大学); EURECOM; Singapore University of Technology and Design (新加坡科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:The growing demand for efficient semantic communication systems capable of managing diverse tasks and adapting to fluctuating channel conditions has driven the development of robust, resource-efficient frameworks. This article introduces a novel channel-adaptive and multi-task-aware semantic communication framework based on a masked auto-encoder architecture. Our framework optimizes the transmission of meaningful information by incorporating a multi-task-aware scoring mechanism that identifies and prioritizes semantically significant data across multiple concurrent tasks. A channel-aware extractor is employed to dynamically select relevant information in response to real-time channel conditions. By jointly optimizing semantic relevance and transmission efficiency, the framework ensures minimal performance degradation under resource constraints. Experimental results demonstrate the superior performance of our framework compared to conventional methods in tasks such as image reconstruction and object detection. These results underscore the framework’s adaptability to heterogeneous channel environments and its scalability for multi-task applications, positioning it as a promising solution for next-generation semantic communication networks.
zh

[CV-34] Deepfake Detection with Spatio-Temporal Consistency and Attention

【速读】:该论文旨在解决深度伪造视频检测中依赖全局帧特征且未能充分利用空间-时间不一致性和忽略细微局部变化的问题。关键在于提出了一种神经网络深度伪造检测器,该检测器专注于单帧及帧序列层面的局部篡改特征。通过使用ResNet主干网络,结合空间注意力机制增强浅层帧级特征学习,并融合纹理增强的浅层特征与深层特征以进一步提升空间流的效果。同时,采用距离注意力机制处理帧序列,允许在较深的层中融合时间注意力图与已学习特征。这种整体模型作为分类器训练,以识别伪造内容。

链接: https://arxiv.org/abs/2502.08216
作者: Yunzhuo Chen,Naveed Akhtar,Nur Al Hasan Haldar,Ajmal Mian
机构: The University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We evaluate our method on two popular large data sets and achieve significant performance over the state-of-the-art this http URL, our technique also provides memory and computational advantages over the competitive techniques.
zh

[CV-35] ActiveSSF: An Active-Learning-Guided Self-Supervised Framework for Long-Tailed Megakaryocyte Classification

【速读】:该论文旨在解决在诊断骨髓增生异常综合征时,精确分类巨核细胞所面临的三个主要挑战:(1) 背景噪声干扰细胞细节,(2) 类别分布长尾导致罕见亚型数据不足,以及 (3) 复杂的形态变异引起类内变异性高的问题。为了解决这些问题,论文提出了一种名为 ActiveSSF 的框架,该框架集成了主动学习与自监督预训练。关键解决方案在于其采用高斯滤波结合 K-means 聚类和 HSV 分析(结合临床先验知识)进行感兴趣区域的精确提取;动态调整相似性阈值以缓解类别不平衡的自适应样本选择机制;以及针对标记样本的原型聚类以克服形态复杂性。

链接: https://arxiv.org/abs/2502.08200
作者: Linghao Zhuang,Ying Zhang,Gege Yuan,Xingyue Zhao,Zhiping Jiang
机构: School of Software Engineering, Xinjiang University (新疆大学), China; School of Software Engineering, Xi’an Jiaotong University (西安交通大学), China; Department of Hematology, Xiangya Hospital, Central South University (中南大学湘雅医院), China; National Clinical Research Center for Geriatric Diseases, Xiangya Hospital, Central South University (中南大学湘雅医院), China; Hunan Hematology Oncology Clinical Medical Research Center (湖南血液肿瘤临床医学研究中心), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, submitted to EMBC 2025

点击查看摘要

Abstract:Precise classification of megakaryocytes is crucial for diagnosing myelodysplastic syndromes. Although self-supervised learning has shown promise in medical image analysis, its application to classifying megakaryocytes in stained slides faces three main challenges: (1) pervasive background noise that obscures cellular details, (2) a long-tailed distribution that limits data for rare subtypes, and (3) complex morphological variations leading to high intra-class variability. To address these issues, we propose the ActiveSSF framework, which integrates active learning with self-supervised pretraining. Specifically, our approach employs Gaussian filtering combined with K-means clustering and HSV analysis (augmented by clinical prior knowledge) for accurate region-of-interest extraction; an adaptive sample selection mechanism that dynamically adjusts similarity thresholds to mitigate class imbalance; and prototype clustering on labeled samples to overcome morphological complexity. Experimental results on clinical megakaryocyte datasets demonstrate that ActiveSSF not only achieves state-of-the-art performance but also significantly improves recognition accuracy for rare subtypes. Moreover, the integration of these advanced techniques further underscores the practical potential of ActiveSSF in clinical settings. To foster further research, the code and datasets will be publicly released in the future.
zh

[CV-36] AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance

【速读】:该论文旨在解决在生成指定角色高质量视频过程中灵活性不足的问题,尤其在将源角色融合到目标场景时面临挑战。关键解决方案在于提出AnyCharV框架,通过两阶段训练过程,利用姿态信息灵活生成包含任意源角色和目标场景的视频。第一阶段开发基础模型实现源角色与目标场景的融合,第二阶段通过自增强机制进一步提升可控性,从而更好地保留角色细节。

链接: https://arxiv.org/abs/2502.08189
作者: Zhao Wang,Hao Wen,Lingting Zhu,Chenming Shang,Yujiu Yang,Qi Dou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Character video generation is a significant real-world application focused on producing high-quality videos featuring specific characters. Recent advancements have introduced various control signals to animate static characters, successfully enhancing control over the generation process. However, these methods often lack flexibility, limiting their applicability and making it challenging for users to synthesize a source character into a desired target scene. To address this issue, we propose a novel framework, AnyCharV, that flexibly generates character videos using arbitrary source characters and target scenes, guided by pose information. Our approach involves a two-stage training process. In the first stage, we develop a base model capable of integrating the source character with the target scene using pose guidance. The second stage further bootstraps controllable generation through a self-boosting mechanism, where we use the generated video in the first stage and replace the fine mask with the coarse one, enabling training outcomes with better preservation of character details. Experimental results demonstrate the effectiveness and robustness of our proposed method. Our project page is this https URL.
zh

[CV-37] Latest Advancements Towards Catastrophic Forgetting under Data Scarcity: A Comprehensive Survey on Few-Shot Class Incremental Learning

【速读】:该论文旨在解决数据稀缺条件下持续学习(Continual Learning)的问题,特别是在动态环境中仅有少量样本时,深度神经网络如何进行有效的学习。论文的关键解决方案在于全面调研少样本类别增量学习(Few-Shot Class Incremental Learning, FSCIL)方法,强调了几个重要方面:FSCIL方法的综合与正式目标、原型修正的重要性、基于预训练模型和语言引导机制的新学习范式、FSCIL性能指标及评估的深入分析,以及FSCIL在各个实际应用领域的实践背景。通过这些方面的探讨,论文提出了开放性挑战、潜在解决方案和未来的研究方向。

链接: https://arxiv.org/abs/2502.08181
作者: M. Anwar Ma’sum,Mahardhika Pratama,Igor Skrjanc
机构: University of South Australia (南澳大学); University of Ljubljana (卢布尔雅那大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data scarcity significantly complicates the continual learning problem, i.e., how a deep neural network learns in dynamic environments with very few samples. However, the latest progress of few-shot class incremental learning (FSCIL) methods and related studies show insightful knowledge on how to tackle the problem. This paper presents a comprehensive survey on FSCIL that highlights several important aspects i.e. comprehensive and formal objectives of FSCIL approaches, the importance of prototype rectifications, the new learning paradigms based on pre-trained model and language-guided mechanism, the deeper analysis of FSCIL performance metrics and evaluation, and the practical contexts of FSCIL in various areas. Our extensive discussion presents the open challenges, potential solutions, and future directions of FSCIL.
zh

[CV-38] CoDynTrust: Robust Asynchronous Collaborative Perception via Dynamic Feature Trust Modulus

【速读】:该论文旨在解决协作感知在实际环境中因通信延迟、时钟不同步或采样配置差异导致的时间异步性所引发的信息不匹配问题。论文的关键解决方案是提出CoDynTrust框架,它通过编码不确定性来处理时间异步性引起的信息不匹配。具体而言,CoDynTrust生成动态特征信任模量(Dynamic Feature Trust Modulus, DFTM),以建模认知不确定性和随机不确定性,并有选择地抑制或保留单车特征,从而缓解信息不匹配。此外,设计了一个多尺度融合模块来处理由DFTM处理的多尺度特征图。这一方法能够对抗低质量信息的影响,并允许不确定性传递到下游任务如规划和控制中。

链接: https://arxiv.org/abs/2502.08169
作者: Yunjiang Xu,Lingzhi Li,Jin Wang,Benyuan Yang,Zhiwen Wu,Xinhong Chen,Jianping Wang
机构: School of Computer Science and Technology, Soochow University, Suzhou 215006, China(Soochow University); School of Future Science and Engineering, Soochow University, Suzhou 215299, China(Soochow University); Computer Science of Department, City University of Hong Kong(City University of Hong Kong)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, conference

点击查看摘要

Abstract:Collaborative perception, fusing information from multiple agents, can extend perception range so as to improve perception performance. However, temporal asynchrony in real-world environments, caused by communication delays, clock misalignment, or sampling configuration differences, can lead to information mismatches. If this is not well handled, then the collaborative performance is patchy, and what’s worse safety accidents may occur. To tackle this challenge, we propose CoDynTrust, an uncertainty-encoded asynchronous fusion perception framework that is robust to the information mismatches caused by temporal asynchrony. CoDynTrust generates dynamic feature trust modulus (DFTM) for each region of interest by modeling aleatoric and epistemic uncertainty as well as selectively suppressing or retaining single-vehicle features, thereby mitigating information mismatches. We then design a multi-scale fusion module to handle multi-scale feature maps processed by DFTM. Compared to existing works that also consider asynchronous collaborative perception, CoDynTrust combats various low-quality information in temporally asynchronous scenarios and allows uncertainty to be propagated to downstream tasks such as planning and control. Experimental results demonstrate that CoDynTrust significantly reduces performance degradation caused by temporal asynchrony across multiple datasets, achieving state-of-the-art detection performance even with temporal asynchrony. The code is available at this https URL.
zh

[CV-39] DNNs May Determine Major Properties of Their Outputs Early with Timing Possibly Driven by Bias

【速读】:该论文旨在探讨深度神经网络(DNNs)在推理早期阶段即主要确定输出结果的现象,并强调模型固有偏见在此过程中的关键作用。通过类比人类决策中的快速直观启发式方法,论文以扩散模型(DMs)为例,展示了DNNs的决策过程如何受到设计和训练过程中偏见类型及程度的影响。论文的关键在于揭示DNNs决策过程的时间动态特性,从而为偏见缓解、高效推理以及机器学习系统的解释提供新的视角。

链接: https://arxiv.org/abs/2502.08167
作者: Song Park,Sanghyuk Chun,Byeongho Heo,Dongyoon Han
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: First two authors contributed equally

点击查看摘要

Abstract:This paper argues that deep neural networks (DNNs) mostly determine their outputs during the early stages of inference, where biases inherent in the model play a crucial role in shaping this process. We draw a parallel between this phenomenon and human decision-making, which often relies on fast, intuitive heuristics. Using diffusion models (DMs) as a case study, we demonstrate that DNNs often make early-stage decision-making influenced by the type and extent of bias in their design and training. Our findings offer a new perspective on bias mitigation, efficient inference, and the interpretation of machine learning systems. By identifying the temporal dynamics of decision-making in DNNs, this paper aims to inspire further discussion and research within the machine learning community.
zh

[CV-40] Force Matching with Relativistic Constraints: A Physics-Inspired Approach to Stable and Efficient Generative Modeling

【速读】:该论文旨在解决高维生成模型中的稳定性问题,提出了一种名为Force Matching (ForM)的新框架。ForM的关键在于引入了洛伦兹因子(Lorentz factor),通过施加速度约束确保样本速度保持在常数限制内。这一机制有效地稳定了生成动力学过程,从而实现了更稳健和可控的采样。实验结果表明,在半圆数据集上,ForM显著优于基线方法,并且通过消融研究进一步验证了速度约束的有效性。

链接: https://arxiv.org/abs/2502.08150
作者: Yang Cao,Bo Chen,Xiaoyu Li,Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Mingda Wan
机构: Wyoming Seminary; Middle Tennessee State University (田纳西州中部大学); Stevens Institute of Technology (史蒂文斯理工学院); The University of Hong Kong (香港大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Tsinghua University (清华大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Simons Institute for the Theory of Computing, University of California, Berkeley (伯克利加州大学西蒙斯计算理论研究所); Anhui University (安徽大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces Force Matching (ForM), a novel framework for generative modeling that represents an initial exploration into leveraging special relativistic mechanics to enhance the stability of the sampling process. By incorporating the Lorentz factor, ForM imposes a velocity constraint, ensuring that sample velocities remain bounded within a constant limit. This constraint serves as a fundamental mechanism for stabilizing the generative dynamics, leading to a more robust and controlled sampling process. We provide a rigorous theoretical analysis demonstrating that the velocity constraint is preserved throughout the sampling procedure within the ForM framework. To validate the effectiveness of our approach, we conduct extensive empirical evaluations. On the \textithalf-moons dataset, ForM significantly outperforms baseline methods, achieving the lowest Euclidean distance loss of \textbf0.714, in contrast to vanilla first-order flow matching (5.853) and first- and second-order flow matching (5.793). Additionally, we perform an ablation study to further investigate the impact of our velocity constraint, reaffirming the superiority of ForM in stabilizing the generative process. The theoretical guarantees and empirical results underscore the potential of integrating special relativity principles into generative modeling. Our findings suggest that ForM provides a promising pathway toward achieving stable, efficient, and flexible generative processes. This work lays the foundation for future advancements in high-dimensional generative modeling, opening new avenues for the application of physical principles in machine learning.
zh

[CV-41] Generalized Class Discovery in Instance Segmentation AAAI2025

【速读】:该论文旨在解决广义类别发现(Generalized Class Discovery, GCD)在实例分割中的任务,目标是在已标注和未标注数据下,发现新的类别并获得能够对已知和新发现类别实例进行分割的模型。为了解决类别分布不平衡的问题,论文提出了一种实例温度分配(Instance-wise Temperature Assignment, ITA)方法用于对比学习,并引入了类别可靠度标准来评估伪标签。ITA方法通过放宽头部类别样本的实例区分度来增强GCD,而可靠度标准则避免在训练实例分割网络时排除尾部类别的大部分伪标签。此外,论文还提出动态调整这些标准以利用早期阶段的多样化样本,并在后期仅依赖可靠的伪标签。最后,论文通过在两个设置上进行实验验证了所提方法的有效性,结果表明其性能优于现有最先进方法。

链接: https://arxiv.org/abs/2502.08149
作者: Cuong Manh Hoang,Yeejin Lee,Byeongkeun Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2025

点击查看摘要

Abstract:This work addresses the task of generalized class discovery (GCD) in instance segmentation. The goal is to discover novel classes and obtain a model capable of segmenting instances of both known and novel categories, given labeled and unlabeled data. Since the real world contains numerous objects with long-tailed distributions, the instance distribution for each class is inherently imbalanced. To address the imbalanced distributions, we propose an instance-wise temperature assignment (ITA) method for contrastive learning and class-wise reliability criteria for pseudo-labels. The ITA method relaxes instance discrimination for samples belonging to head classes to enhance GCD. The reliability criteria are to avoid excluding most pseudo-labels for tail classes when training an instance segmentation network using pseudo-labels from GCD. Additionally, we propose dynamically adjusting the criteria to leverage diverse samples in the early stages while relying only on reliable pseudo-labels in the later stages. We also introduce an efficient soft attention module to encode object-specific representations for GCD. Finally, we evaluate our proposed method by conducting experiments on two settings: COCO _half + LVIS and LVIS + Visual Genome. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art methods.
zh

[CV-42] Riemannian Complex Hermit Positive Definite Convolution Network for Polarimetric SAR Image Classification

【速读】:该论文旨在解决PolSAR图像处理中深度学习方法在欧几里得空间下无法有效保留复协方差矩阵几何结构的问题。论文的关键解决方案在于提出了一种Riemannian复HPD卷积网络(HPD_CNN),通过引入复HPD展开网络(HPDnet)和增强的CV-3DCNN网络,定义了HPD映射、校正及对数特征层来直接学习复HPD矩阵的几何特征,并设计了快速特征值分解方法以减少计算负担。此外,还定义了一个从黎曼空间到欧几里得空间的增强网络,用于增强上下文信息以提高分类性能。

链接: https://arxiv.org/abs/2502.08137
作者: Junfei Shi,Mengmeng Nie,Yuke Li,Haiyan Jin,Weisi Lin
机构: Department of Computer Science and Technology, Shaanxi Key Laboratory for Network Computing and Security Technology, Xi’an University of Technology, Xi’an, China; College of Computing and Data Science, Nanyang Technological University, Singapore, 639798
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Deep learning can learn high-level semantic features in Euclidean space effectively for PolSAR images, while they need to covert the complex covariance matrix into a feature vector or complex-valued vector as the network input. However, the complex covariance matrices are essentially a complex Hermit positive definite (HPD) matrix endowed in Riemannian manifold rather than Euclidean space. The matrix’s real and imagery parts are with the same significance, as the imagery part represents the phase information. The matrix vectorization will destroy the geometric structure and manifold characteristics of complex covariance matrices. To learn complex HPD matrices directly, we propose a Riemannian complex HPD convolution network(HPD_CNN) for PolSAR images. This method consists of a complex HPD unfolding network(HPDnet) and a CV-3DCNN enhanced network. The proposed complex HPDnet defines the HPD mapping, rectifying and the logEig layers to learn geometric features of complex matrices. In addition, a fast eigenvalue decomposition method is designed to reduce computation burden. Finally, a Riemannian-to-Euclidean enhanced network is defined to enhance contextual information for classification. Experimental results on two real PolSSAR datasets demonstrate the proposed method can achieve superior performance than the state-of-the-art methods especially in heterogeneous regions.
zh

[CV-43] A Survey on Data Curation for Visual Contrastive Learning: Why Crafting Effective Positive and Negative Pairs Matters

【速读】:该论文旨在解决视觉对比学习中正样本(Positive)和负样本(Negative)配对设计对表征质量、训练效率和计算成本的影响问题。关键在于提出了一种分类方法,详细描述了现有技术如何优化正负样本配对的选择,从而提升表征质量和加速收敛速度。

链接: https://arxiv.org/abs/2502.08134
作者: Shasvat Desai,Debasmita Ghose,Deep Chakraborty
机构: Yale University; University of Massachusetts Amherst
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Visual contrastive learning aims to learn representations by contrasting similar (positive) and dissimilar (negative) pairs of data samples. The design of these pairs significantly impacts representation quality, training efficiency, and computational cost. A well-curated set of pairs leads to stronger representations and faster convergence. As contrastive pre-training sees wider adoption for solving downstream tasks, data curation becomes essential for optimizing its effectiveness. In this survey, we attempt to create a taxonomy of existing techniques for positive and negative pair curation in contrastive learning, and describe them in detail.
zh

[CV-44] PoGDiff: Product-of-Gaussians Diffusion Models for Imbalanced Text-to-Image Generation

【速读】:该论文旨在解决扩散模型在训练或微调于类别不平衡数据集时性能下降的问题。这一问题主要源于图像-文本对中多数类与少数类数据的不均衡表示。论文的关键解决方案是提出了一种新的微调方法PoGDiff,该方法通过将真实标签分布替换为由原始真实目标与条件于邻近文本嵌入的预测分布相乘得到的高斯乘积(Product of Gaussians, PoG),从而替代直接最小化预测分布与真实分布之间的KL散度,有效解决了扩散模型中的类别不平衡问题,提升了生成的准确性和质量。

链接: https://arxiv.org/abs/2502.08106
作者: Ziyan Wang,Sizhe Wei,Xiaoming Huo,Hao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Diffusion models have made significant advancements in recent years. However, their performance often deteriorates when trained or fine-tuned on imbalanced datasets. This degradation is largely due to the disproportionate representation of majority and minority data in image-text pairs. In this paper, we propose a general fine-tuning approach, dubbed PoGDiff, to address this challenge. Rather than directly minimizing the KL divergence between the predicted and ground-truth distributions, PoGDiff replaces the ground-truth distribution with a Product of Gaussians (PoG), which is constructed by combining the original ground-truth targets with the predicted distribution conditioned on a neighboring text embedding. Experiments on real-world datasets demonstrate that our method effectively addresses the imbalance problem in diffusion models, improving both generation accuracy and quality.
zh

[CV-45] ID-Cloak: Crafting Identity-Specific Cloaks Against Personalized Text-to-Image Generation

【速读】:该论文旨在解决个性化文本到图像模型引发的公民隐私问题,特别是当这些模型基于少量参考照片生成新概念图像时。由于网络上广泛共享的个人图像,现有的针对特定图像的反个性化技术在实际应用中受到限制。论文的关键解决方案是首次提出身份特定隐身衣(ID-Cloak)的概念,该方法能够保护属于特定身份的所有图像。具体而言,通过建模身份子空间来保留个人共性并学习多样化的上下文以捕捉待保护的图像分布,进而设计出引导模型远离正常输出的特定身份隐身衣,从而实现有效保护。

链接: https://arxiv.org/abs/2502.08097
作者: Qianrui Teng,Xing Cui,Xuannan Liu,Peipei Li,Zekun Li,Huaibo Huang,Ran He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Personalized text-to-image models allow users to generate images of new concepts from several reference photos, thereby leading to critical concerns regarding civil privacy. Although several anti-personalization techniques have been developed, these methods typically assume that defenders can afford to design a privacy cloak corresponding to each specific image. However, due to extensive personal images shared online, image-specific methods are limited by real-world practical applications. To address this issue, we are the first to investigate the creation of identity-specific cloaks (ID-Cloak) that safeguard all images belong to a specific identity. Specifically, we first model an identity subspace that preserves personal commonalities and learns diverse contexts to capture the image distribution to be protected. Then, we craft identity-specific cloaks with the proposed novel objective that encourages the cloak to guide the model away from its normal output within the subspace. Extensive experiments show that the generated universal cloak can effectively protect the images. We believe our method, along with the proposed identity-specific cloak setting, marks a notable advance in realistic privacy protection.
zh

[CV-46] MAA: Meticulous Adversarial Attack against Vision-Language Pre-trained Models

【速读】:该论文旨在解决现有对抗攻击方法在评估多模态任务视觉-语言预训练(Vision-Language Pre-trained, VLP)模型鲁棒性时存在的有限迁移性问题。具体而言,针对特定模型设计的对抗攻击往往难以有效泛化到其他不同模型,从而限制了其广泛评估模型鲁棒性的能力。这一局限主要归因于过度依赖模型特定特征和区域,尤其是在图像模态方面。为了解决这个问题,论文提出了一种名为精细对抗攻击(Meticulous Adversarial Attack, MAA)的方法。MAA的关键在于通过开发新颖的重缩放与滑动裁剪(resizing and sliding crop, RScrop)技术,以及引入多粒度相似性干扰(multi-granularity similarity disruption, MGSD)策略,实现对抗图像的细粒度优化,从而充分挖掘单个样本的模型无关特性和漏洞,增强对抗攻击的有效性和迁移性。

链接: https://arxiv.org/abs/2502.08079
作者: Peng-Fei Zhang,Guangdong Bai,Zi Huang
机构: The University of Queensland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current adversarial attacks for evaluating the robustness of vision-language pre-trained (VLP) models in multi-modal tasks suffer from limited transferability, where attacks crafted for a specific model often struggle to generalize effectively across different models, limiting their utility in assessing robustness more broadly. This is mainly attributed to the over-reliance on model-specific features and regions, particularly in the image modality. In this paper, we propose an elegant yet highly effective method termed Meticulous Adversarial Attack (MAA) to fully exploit model-independent characteristics and vulnerabilities of individual samples, achieving enhanced generalizability and reduced model dependence. MAA emphasizes fine-grained optimization of adversarial images by developing a novel resizing and sliding crop (RScrop) technique, incorporating a multi-granularity similarity disruption (MGSD) strategy. Extensive experiments across diverse VLP models, multiple benchmark datasets, and a variety of downstream tasks demonstrate that MAA significantly enhances the effectiveness and transferability of adversarial attacks. A large cohort of performance studies is conducted to generate insights into the effectiveness of various model configurations, guiding future advancements in this domain.
zh

[CV-47] Knowledge Swapping via Learning and Unlearning

【速读】:该论文旨在解决通过任务选择性地调节预训练模型的知识,使其能够遗忘用户指定的信息,保留关键知识,并同时获取新知识的问题。解决方案的关键在于“Learning Before Forgetting”策略,该策略基于增量学习通常从低级表示到高级语义逐步进行,而遗忘则相反,从高级语义向低级特征逐层退化这一观察结果。实验验证了该策略在图像分类、目标检测和语义分割等任务中的有效性。

链接: https://arxiv.org/abs/2502.08075
作者: Mingyu Xing,Lechao Cheng,Shenggeng Tang,Yaxiong Wang,Zhun Zhong,Meng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:We introduce \textbfKnowledge Swapping, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user-specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low-level representations to higher-level semantics, whereas forgetting tends to occur in the opposite direction-starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of \textitLearning Before Forgetting. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available at \hrefthis https URLthis https URL.
zh

[CV-48] From Brainwaves to Brain Scans: A Robust Neural Network for EEG-to-fMRI Synthesis

【速读】:该论文旨在解决功能性磁共振成像(fMRI)高成本和基础设施需求与脑电图(EEG)空间分辨率不足之间的差距。解决方案的关键在于引入E2fNet模型,这是一种能够从低成本EEG数据中合成高精度fMRI图像的深度学习方法,通过捕捉和转换EEG数据中的有意义特征来实现精确的神经定位。

链接: https://arxiv.org/abs/2502.08025
作者: Kristofer Grover Roos,Quan Huu Cap,Atsushi Fukuda
机构: Aillis, Inc.(艾丽莉斯公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While functional magnetic resonance imaging (fMRI) offers rich spatial resolution, it is limited by high operational costs and significant infrastructural demands. In contrast, electroencephalography (EEG) provides millisecond-level precision in capturing electrical activity but lacks the spatial resolution necessary for precise neural localization. To bridge these gaps, we introduce E2fNet, a simple yet effective deep learning model for synthesizing fMRI images from low-cost EEG data. E2fNet is specifically designed to capture and translate meaningful features from EEG across electrode channels into accurate fMRI representations. Extensive evaluations across three datasets demonstrate that E2fNet consistently outperforms existing methods, achieving state-of-the-art results in terms of the structural similarity index measure (SSIM). Our findings suggest that E2fNet is a promising, cost-effective solution for enhancing neuroimaging capabilities. The code is available at this https URL.
zh

[CV-49] owards Training One-Step Diffusion Models Without Distillation

【速读】:该论文旨在探究是否可以不经由传统的两阶段过程(先训练教师扩散模型,再将其蒸馏至单步学生模型)直接训练单步生成式模型 (One-step Generative Models)。研究的关键在于两个方面:首先,论文表明教师模型的评分函数 (score function) 并非必要,提出了一系列无需评分估计即可获得竞争性结果的蒸馏方法;其次,论文揭示了从教师模型权重初始化对于成功训练的重要性,并发现这一优势主要归因于所学习的特征表示 (feature representations),而非“输入-输出”映射的改进。这些发现增进了我们对初始权重在单步模型训练中的作用及其对蒸馏质量影响的理解。

链接: https://arxiv.org/abs/2502.08005
作者: Mingtian Zhang,Jiajun He,Wenlin Chen,Zijing Ou,José Miguel Hernández-Lobato,Bernhard Schölkopf,David Barber
机构: University College London(伦敦大学学院); University of Cambridge(剑桥大学); Imperial College London(帝国理工学院); MPI for Intelligent Systems, Tübingen(马克斯·普朗克智能系统研究所,蒂宾根)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, Technical Report

点击查看摘要

Abstract:Recent advances in one-step generative models typically follow a two-stage process: first training a teacher diffusion model and then distilling it into a one-step student model. This distillation process traditionally relies on both the teacher model’s score function to compute the distillation loss and its weights for student initialization. In this paper, we explore whether one-step generative models can be trained directly without this distillation process. First, we show that the teacher’s score function is not essential and propose a family of distillation methods that achieve competitive results without relying on score estimation. Next, we demonstrate that initialization from teacher weights is indispensable in successful training. Surprisingly, we find that this benefit is not due to improved ``input-output" mapping but rather the learned feature representations, which dominate distillation quality. Our findings provide a better understanding of the role of initialization in one-step model training and its impact on distillation quality.
zh

[CV-50] Joint Modelling Histology and Molecular Markers for Cancer Classification

【速读】:该论文旨在解决癌症分类中分子标记与组织学特征联合预测及交互建模的问题。为应对这一挑战,关键在于引入了一种新型数字病理方法,通过多尺度解耦模块提取从高倍率(细胞水平)到低倍率(组织水平)全切片图像的多尺度特征,并基于这些特征采用基于注意力的分层多任务多实例学习框架同时预测组织学和分子标记。此外,还提出了一种基于共现概率的标签相关性图网络来建模分子标记的共现关系,并设计了一个跨模态交互模块,结合动态置信度约束损失和跨模态梯度调制策略,以模型组织学和分子标记之间的交互作用。

链接: https://arxiv.org/abs/2502.07979
作者: Xiaofei Wang,Hanyu Liu,Yupei Zhang,Boyang Zhao,Hao Duan,Wanming Hu,Yonggao Mou,Stephen Price,Chao Li
机构: Department of Clinical Neurosciences, University of Cambridge, UK(剑桥大学临床神经科学系); School of Science and Engineering, University of Dundee, UK(邓迪大学理工学院); Department of Neurosurgery, State Key Laboratory of Oncology in South China, Guangdong Provincial Clinical Research Center for Cancer, Sun Yat-sen University Cancer Center, China(中山大学肿瘤防治中心神经外科,南中国国家肿瘤学重点实验室,广东省级癌症临床研究中心); Department of Pathology, State Key Laboratory of Oncology in South China, Guangdong Provincial Clinical Research Center for Cancer, Sun Yat-sen University Cancer Center, China(中山大学肿瘤防治中心病理科,南中国国家肿瘤学重点实验室,广东省级癌症临床研究中心); Department of Applied Mathematics and Theoretical Physics, University of Cambridge, UK(剑桥大学应用数学与理论物理系); School of Medicine, University of Dundee, UK(邓迪大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by Medical Image Analysis

点击查看摘要

Abstract:Cancers are characterized by remarkable heterogeneity and diverse prognosis. Accurate cancer classification is essential for patient stratification and clinical decision-making. Although digital pathology has been advancing cancer diagnosis and prognosis, the paradigm in cancer pathology has shifted from purely relying on histology features to incorporating molecular markers. There is an urgent need for digital pathology methods to meet the needs of the new paradigm. We introduce a novel digital pathology approach to jointly predict molecular markers and histology features and model their interactions for cancer classification. Firstly, to mitigate the challenge of cross-magnification information propagation, we propose a multi-scale disentangling module, enabling the extraction of multi-scale features from high-magnification (cellular-level) to low-magnification (tissue-level) whole slide images. Further, based on the multi-scale features, we propose an attention-based hierarchical multi-task multi-instance learning framework to simultaneously predict histology and molecular markers. Moreover, we propose a co-occurrence probability-based label correlation graph network to model the co-occurrence of molecular markers. Lastly, we design a cross-modal interaction module with the dynamic confidence constrain loss and a cross-modal gradient modulation strategy, to model the interactions of histology and molecular markers. Our experiments demonstrate that our method outperforms other state-of-the-art methods in classifying glioma, histology features and molecular markers. Our method promises to promote precise oncology with the potential to advance biomedical research and clinical applications. The code is available at this https URL
zh

[CV-51] Federated Self-supervised Domain Generalization for Label-efficient Polyp Segmentation MICCAI2024

【速读】:该论文旨在解决在构建基于深度学习的结肠息肉分割模型时,处理未标记数据集所面临的隐私保护与数据共享难题。解决方案的关键在于提出了一种名为LFDG(标签高效联邦领域泛化)的方法,结合联邦学习(Federated Learning, FL)与自监督学习(Self-Supervised Learning, SSL),通过引入对抗学习的数据增强方法(SSADA)来提升数据多样性,并利用基于源重建和增强掩蔽(Source-reconstruction and Augmentation-masking, SRAM)的松弛模块来保持特征学习的稳定性。实验验证表明,LFDG方法在六个医疗中心的息肉图像上取得了比基线和其他近期FL及SSL方法更好的性能,分别提高了3.80%和3.92%。

链接: https://arxiv.org/abs/2502.07951
作者: Xinyi Tan,Jiacheng Wang,Liansheng Wang
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Accepted at ADSMI @ MICCAI 2024

点击查看摘要

Abstract:Employing self-supervised learning (SSL) methodologies assumes par-amount significance in handling unlabeled polyp datasets when building deep learning-based automatic polyp segmentation models. However, the intricate privacy dynamics surrounding medical data often preclude seamless data sharing among disparate medical centers. Federated learning (FL) emerges as a formidable solution to this privacy conundrum, yet within the realm of FL, optimizing model generalization stands as a pressing imperative. Robust generalization capabilities are imperative to ensure the model’s efficacy across diverse geographical domains post-training on localized client datasets. In this paper, a Federated self-supervised Domain Generalization method is proposed to enhance the generalization capacity of federated and Label-efficient intestinal polyp segmentation, named LFDG. Based on a classical SSL method, DropPos, LFDG proposes an adversarial learning-based data augmentation method (SSADA) to enhance the data diversity. LFDG further proposes a relaxation module based on Source-reconstruction and Augmentation-masking (SRAM) to maintain stability in feature learning. We have validated LFDG on polyp images from six medical centers. The performance of our method achieves 3.80% and 3.92% better than the baseline and other recent FL methods and SSL methods, respectively.
zh

[CV-52] SurGrID: Controllable Surgical Simulation via Scene Graph to Image Diffusion

【速读】:该论文旨在解决现有手术模拟工具缺乏照片级真实感及固定行为模式的问题。论文的关键解决方案是引入SurGrID,这是一种将场景图(Scene Graph)转化为图像扩散模型的框架,能够通过场景图精确控制和交互生成手术场景。SurGrID利用新颖的预训练步骤将场景图中的局部和全局信息转换为中间表示,从而提高了生成图像的真实性和与输入图的连贯性。这一方法不仅提升了生成图像的保真度,还增强了模拟的真实性与可控制性。

链接: https://arxiv.org/abs/2502.07945
作者: Yannik Frisch,Ssharvien Kumar Sivakumar,Çağhan Köksal,Elsa Böhm,Felix Wagner,Adrian Gericke,Ghazal Ghazaei,Anirban Mukhopadhyay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Surgical simulation offers a promising addition to conventional surgical training. However, available simulation tools lack photorealism and rely on hardcoded behaviour. Denoising Diffusion Models are a promising alternative for high-fidelity image synthesis, but existing state-of-the-art conditioning methods fall short in providing precise control or interactivity over the generated scenes. We introduce SurGrID, a Scene Graph to Image Diffusion Model, allowing for controllable surgical scene synthesis by leveraging Scene Graphs. These graphs encode a surgical scene’s components’ spatial and semantic information, which are then translated into an intermediate representation using our novel pre-training step that explicitly captures local and global information. Our proposed method improves the fidelity of generated images and their coherence with the graph input over the state-of-the-art. Further, we demonstrate the simulation’s realism and controllability in a user assessment study involving clinical experts. Scene Graphs can be effectively used for precise and interactive conditioning of Denoising Diffusion Models for simulating surgical scenes, enabling high fidelity and interactive control over the generated content. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2502.07945 [cs.CV] (or arXiv:2502.07945v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.07945 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yannik Frisch [view email] [v1] Tue, 11 Feb 2025 20:49:13 UTC (17,768 KB)
zh

[CV-53] DeepSeek on a Trip: Inducing Targeted Visual Hallucinations via Representation Vulnerabilities

【速读】:该论文旨在解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)中的视觉嵌入操纵攻击问题。研究通过在DeepSeek Janus模型上实施适应性嵌入操作攻击,实现了高达98.0%的诱导目标视觉幻觉率,并保持了高视觉保真度(结构相似性SSIM为0.88)。关键解决方案在于通过系统优化图像嵌入来实现嵌入级别的安全措施,以及引入基于Llama-3.1 8B Instruct的新型多提示幻觉检测框架,以增强对这类攻击的鲁棒性评估。

链接: https://arxiv.org/abs/2502.07905
作者: Chashi Mahiul Islam,Samuel Jacob Chacko,Preston Horne,Xiuwen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) represent the cutting edge of AI technology, with DeepSeek models emerging as a leading open-source alternative offering competitive performance to closed-source systems. While these models demonstrate remarkable capabilities, their vision-language integration mechanisms introduce specific vulnerabilities. We implement an adapted embedding manipulation attack on DeepSeek Janus that induces targeted visual hallucinations through systematic optimization of image embeddings. Through extensive experimentation across COCO, DALL-E 3, and SVIT datasets, we achieve hallucination rates of up to 98.0% while maintaining high visual fidelity (SSIM 0.88) of the manipulated images on open-ended questions. Our analysis demonstrates that both 1B and 7B variants of DeepSeek Janus are susceptible to these attacks, with closed-form evaluation showing consistently higher hallucination rates compared to open-ended questioning. We introduce a novel multi-prompt hallucination detection framework using LLaMA-3.1 8B Instruct for robust evaluation. The implications of these findings are particularly concerning given DeepSeek’s open-source nature and widespread deployment potential. This research emphasizes the critical need for embedding-level security measures in MLLM deployment pipelines and contributes to the broader discussion of responsible AI implementation.
zh

[CV-54] xtAtlas5M: A Large-scale Dataset for Dense Text Image Generation

【速读】:该论文旨在解决长文本在图像生成中的渲染挑战,现有数据集主要关注较短和简单的文本,无法充分评估大规模生成模型在长文本图像生成方面的性能。解决方案的关键在于引入TextAtlas5M数据集,该数据集包含500万张跨越多种数据类型的长文本生成和收集的图像,从而实现对大规模生成模型在长文本图像生成方面的全面评估。此外,还构建了TextAtlasEval测试集,进一步增强了基准测试的广泛性和难度。

链接: https://arxiv.org/abs/2502.07870
作者: Alex Jinpeng Wang,Dongxing Mao,Jiawei Zhang,Weiming Han,Zhuobai Dong,Linjie Li,Yiqi Lin,Zhengyuan Yang,Libo Qin,Fuwei Zhang,Lijuan Wang,Min Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 15 figures. Dataset Website: this https URL

点击查看摘要

Abstract:Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.
zh

[CV-55] EventEgo3D: 3D Human Motion Capture from a Head-Mounted Event Camera

【速读】:该论文旨在解决单目头戴设备在低光照和快速运动条件下的人体三维动作捕捉难题,现有依赖于RGB相机的方法在这些条件下往往失效。论文的关键解决方案是引入EventEgo3D++,这是一种利用配备鱼眼镜头的单目事件相机进行人体三维动作捕捉的新方法。事件相机因其高时间分辨率,在高速场景和变化照明条件下表现优异,能够提供可靠的精确三维重建线索。EventEgo3D++通过利用事件流的LNES表示来实现精准的三维重建。此外,该研究开发了一种移动头戴设备原型,并采集了一个包含真实事件观测数据的综合数据集,涵盖受控工作室环境和野外条件,以及一个合成数据集。为了提供更全面的数据集,还包括了allocentric RGB流及其对应的SMPL身体模型,以提供不同视角。实验表明,EventEgo3D++在挑战性条件下实现了优于现有方案的三维精度和鲁棒性,并支持实时三维姿态更新,速率达到140Hz。

链接: https://arxiv.org/abs/2502.07869
作者: Christen Millerdurai,Hiroyasu Akada,Jian Wang,Diogo Luvizon,Alain Pagani,Didier Stricker,Christian Theobalt,Vladislav Golyanik
机构: DFKI(德国人工智能研究中心); MPI-INF(马克斯普朗克信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 20 figures, 9 tables. arXiv admin note: text overlap with arXiv:2404.08640

点击查看摘要

Abstract:Monocular egocentric 3D human motion capture remains a significant challenge, particularly under conditions of low lighting and fast movements, which are common in head-mounted device applications. Existing methods that rely on RGB cameras often fail under these conditions. To address these limitations, we introduce EventEgo3D++, the first approach that leverages a monocular event camera with a fisheye lens for 3D human motion capture. Event cameras excel in high-speed scenarios and varying illumination due to their high temporal resolution, providing reliable cues for accurate 3D human motion capture. EventEgo3D++ leverages the LNES representation of event streams to enable precise 3D reconstructions. We have also developed a mobile head-mounted device (HMD) prototype equipped with an event camera, capturing a comprehensive dataset that includes real event observations from both controlled studio environments and in-the-wild settings, in addition to a synthetic dataset. Additionally, to provide a more holistic dataset, we include allocentric RGB streams that offer different perspectives of the HMD wearer, along with their corresponding SMPL body model. Our experiments demonstrate that EventEgo3D++ achieves superior 3D accuracy and robustness compared to existing solutions, even in challenging conditions. Moreover, our method supports real-time 3D pose updates at a rate of 140Hz. This work is an extension of the EventEgo3D approach (CVPR 2024) and further advances the state of the art in egocentric 3D human motion capture. For more details, visit the project page at this https URL.
zh

[CV-56] ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources

【速读】:该论文旨在解决多模态深度学习系统在动态场景中面临的两个主要挑战:计算资源可用性的变化(如多租户和设备异质性)以及输入质量的波动(如传感器数据损坏和环境噪声)。为应对这些挑战,论文提出了一种名为ADMN(自适应深度多模态网络)的解决方案。ADMN的关键在于其能够根据计算资源约束调整所有模态的活动层总数,并根据输入模态的质量重新分配层。ADMN展示了其能够在保持与最先进网络相当准确度的同时,将浮点运算减少高达75%。

链接: https://arxiv.org/abs/2502.07862
作者: Jason Wu,Kang Yang,Lance Kaplan,Mani Srivastava
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Current multimodal systems employ static resource provisioning and cannot easily adapt when compute resources change over time. Additionally, their reliance on processing sensor data with fixed feature extractors is ill-equipped to handle variations in modality quality. Consequently, uninformative modalities, such as those with high noise, needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges - it adjusts the total number of active layers across all modalities to meet compute resource constraints, and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.
zh

[CV-57] MRS: A Fast Sampler for Mean Reverting Diffusion based on ODE and SDE Solvers ICLR2025

【速读】:该论文旨在解决可控生成在扩散模型应用中的挑战,特别是针对Mean Reverting (MR) Diffusion方法需要数百次函数评估(NFEs)才能获得高质量样本的问题。论文的关键解决方案是提出了一种新的算法MRS (MR Sampler),通过解析地求解与MR Diffusion相关的反向时间随机微分方程(SDE)和概率流常微分方程(PF-ODE),并推导出半解析解。这些解由一个解析函数和一个由神经网络参数化的积分组成,从而能够在较少步骤内生成高质量样本,显著加速采样过程,提高其在可控生成任务中的实用性。

链接: https://arxiv.org/abs/2502.07856
作者: Ao Li,Wei Fang,Hongbo Zhao,Le Lu,Ge Yang,Minfeng Xu
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院), Beijing, China; Institute of Automation, Chinese Academy of Sciences (CASIA) (自动化研究所,中国科学院); DAMO Academy, Alibaba Group (阿里巴巴达摩院); Hupan Laboratory (湖畔实验室), Hangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:In applications of diffusion models, controllable generation is of practical significance, but is also challenging. Current methods for controllable generation primarily focus on modifying the score function of diffusion models, while Mean Reverting (MR) Diffusion directly modifies the structure of the stochastic differential equation (SDE), making the incorporation of image conditions simpler and more natural. However, current training-free fast samplers are not directly applicable to MR Diffusion. And thus MR Diffusion requires hundreds of NFEs (number of function evaluations) to obtain high-quality samples. In this paper, we propose a new algorithm named MRS (MR Sampler) to reduce the sampling NFEs of MR Diffusion. We solve the reverse-time SDE and the probability flow ordinary differential equation (PF-ODE) associated with MR Diffusion, and derive semi-analytical solutions. The solutions consist of an analytical function and an integral parameterized by a neural network. Based on this solution, we can generate high-quality samples in fewer steps. Our approach does not require training and supports all mainstream parameterizations, including noise prediction, data prediction and velocity prediction. Extensive experiments demonstrate that MR Sampler maintains high sampling quality with a speedup of 10 to 20 times across ten different image restoration tasks. Our algorithm accelerates the sampling procedure of MR Diffusion, making it more practical in controllable generation.
zh

[CV-58] Advancing Heat Demand Forecasting with Attention Mechanisms: Opportunities and Challenges

【速读】:该论文旨在解决 district heating systems (DHS) 中的热需求准确预测问题。面对人口增长及可再生能源在供暖领域中的核心地位日益提升,精准的热需求预测变得尤为重要。论文的关键解决方案在于构建了一个基于深度学习(Deep Learning, DL)的模型,该模型通过时间-频率空间表示输入特征,并采用注意力机制进行多步预测,从而实现更精确的热需求预报。评估结果显示,所提出的注意力机制模型在不同供应区域内的表现优于LSTM和CNN基线模型,其平均绝对误差(MAE)为0.105 kW h,平均绝对百分比误差(MAPE)为5.4%,显著提升了预测精度。

链接: https://arxiv.org/abs/2502.07854
作者: Adithya Ramachandran,Thorkil Flensmark B. Neergaard,Andreas Maier,Siming Bayer
机构: Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学模式识别实验室); Brønderslev Forsyning (布伦德斯莱夫供水公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Global leaders and policymakers are unified in their unequivocal commitment to decarbonization efforts in support of Net-Zero agreements. District Heating Systems (DHS), while contributing to carbon emissions due to the continued reliance on fossil fuels for heat production, are embracing more sustainable practices albeit with some sense of vulnerability as it could constrain their ability to adapt to dynamic demand and production scenarios. As demographic demands grow and renewables become the central strategy in decarbonizing the heating sector, the need for accurate demand forecasting has intensified. Advances in digitization have paved the way for Machine Learning (ML) based solutions to become the industry standard for modeling complex time series patterns. In this paper, we focus on building a Deep Learning (DL) model that uses deconstructed components of independent and dependent variables that affect heat demand as features to perform multi-step ahead forecasting of head demand. The model represents the input features in a time-frequency space and uses an attention mechanism to generate accurate forecasts. The proposed method is evaluated on a real-world dataset and the forecasting performance is assessed against LSTM and CNN-based forecasting models. Across different supply zones, the attention-based models outperforms the baselines quantitatively and qualitatively, with an Mean Absolute Error (MAE) of 0.105 with a standard deviation of 0.06kW h and a Mean Absolute Percentage Error (MAPE) of 5.4% with a standard deviation of 2.8%, in comparison the second best model with a MAE of 0.10 with a standard deviation of 0.06kW h and a MAPE of 5.6% with a standard deviation of 3%.
zh

[CV-59] chnical note on calibrating vision-language models under covariate shift

【速读】:该论文旨在解决低样本视觉分类任务中视觉-语言基础模型在目标数据分布上的泛化能力受限的问题。论文指出,由于样本不足导致的数据变化敏感性以及类别转移(covariate shift)和置信度错配(confidence misalignment)是主要挑战。论文的关键解决方案是提出了一种统一框架——置信校准类别转移校正(Confidence-Calibrated Covariate Shift Correction, C3SC),它通过利用Fisher信息惩罚来校正类别转移,并采用置信度误配惩罚(CMP)降低错误分类示例的置信度。实验结果表明,C3SC显著提升了校准性能(ECE提升5.82%),并在具有挑战性的类别转移数据集上提高了3.5%的准确性,从而为实际应用中的可靠视觉-语言低样本任务提供了有前景的解决方案。

链接: https://arxiv.org/abs/2502.07847
作者: Behraj Khan,Rizwan Qureshi,Tahir Syed
机构: School of Mathematics and Computer Science, Institute of Business Administration Karachi (卡拉奇商学院数学与计算机科学学院), Pakistan (巴基斯坦); Center for Research in Computer Vision, University of Central Florida (中佛罗里达大学计算机视觉研究中心), USA (美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite being a successful example of emerging capability, vision-language foundation models for low-shot vision classification have a limited ability to sufficiently generalize to the target data distribution due to sample poverty, leading to sensitivity to variations in the data. A popular mitigation strategy is finetuning over multiple datasets, but domain generalization is expensive when practiced in this manner. This work examines both covariate shift between pre-training data and the underspecified target data, and \textitconfidence misalignment, where the model’s prediction confidence amplified by the limited data availability. We propose \textitConfidence-Calibrated Covariate Shift Correction ( C3SC ), a unified framework to mitigate both covariate shift and confidence misalignment. C3SC leverages Fisher information penalty for covariate shift correction and confidence misalignment penalty (CMP) to lower confidence on misclassified examples. Experimental results across various vision and covariate shift datasets demonstrates that C3SC significantly improves in calibration (ECE) by 5.82% at maximum. C3SC shows better robustness as well by showing 3.5% improvement in accuracy metric on challenging covariate shift datasets, making C3SC a promising solution for reliable real-world vision-language low-shot applications under distribution shift.
zh

[CV-60] Spread them Apart: Towards Robust Watermarking of Generated Content

【速读】:该论文旨在解决生成式模型所产生内容的版权归属问题及潜在的滥用风险。论文提出的关键解决方案是在生成内容过程中嵌入不可篡改的水印,以确保未来能够检测到生成内容并识别其创建者。此方法在推理阶段嵌入水印,无需重新训练模型,并且证明对于有界幅度的添加扰动具有鲁棒性。论文展示了该方法应用于扩散模型时,在面对多种合成水印移除攻击时,其鲁棒性可媲美当前最先进的水印方案。

链接: https://arxiv.org/abs/2502.07845
作者: Mikhail Pautov,Danil Ivanov,Andrey V. Galichin,Oleg Rogov,Ivan Oseledets
机构: AIRI; Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究院); Moscow Technical University of Communications and Informatics (莫斯科通讯与信息技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative models that can produce realistic images have improved significantly in recent years. The quality of the generated content has increased drastically, so sometimes it is very difficult to distinguish between the real images and the generated ones. Such an improvement comes at a price of ethical concerns about the usage of the generative models: the users of generative models can improperly claim ownership of the generated content protected by a license. In this paper, we propose an approach to embed watermarks into the generated content to allow future detection of the generated content and identification of the user who generated it. The watermark is embedded during the inference of the model, so the proposed approach does not require the retraining of the latter. We prove that watermarks embedded are guaranteed to be robust against additive perturbations of a bounded magnitude. We apply our method to watermark diffusion models and show that it matches state-of-the-art watermarking schemes in terms of robustness to different types of synthetic watermark removal attacks.
zh

[CV-61] ranSplat: Surface Embedding-guided 3D Gaussian Splatting for Transparent Object Manipulation

【速读】:该论文旨在解决透明物体操作中的显著挑战,即由于难以获得精确且密集的深度测量导致的问题。现有深度传感器和完成方法在处理透明物体时存在不足,通常会导致不完整的或错误的深度数据,并且无法保持帧间一致性。论文的关键解决方案是提出了一种名为TranSplat的方法,这是一种面向透明物体的表面嵌入引导三维高斯点阵方法。TranSplat通过使用潜在扩散模型生成表面嵌入,提供一致且连续的表示形式,增强了三维高斯点阵的散射效果,从而改善了深度完成的效果。

链接: https://arxiv.org/abs/2502.07840
作者: Jeongyun Kim,Jeongho Noh,Dong-Guw Lee,Ayoung Kim
机构: SNU(首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Transparent object manipulation remains a sig- nificant challenge in robotics due to the difficulty of acquiring accurate and dense depth measurements. Conventional depth sensors often fail with transparent objects, resulting in in- complete or erroneous depth data. Existing depth completion methods struggle with interframe consistency and incorrectly model transparent objects as Lambertian surfaces, leading to poor depth reconstruction. To address these challenges, we propose TranSplat, a surface embedding-guided 3D Gaussian Splatting method tailored for transparent objects. TranSplat uses a latent diffusion model to generate surface embeddings that provide consistent and continuous representations, making it robust to changes in viewpoint and lighting. By integrating these surface embeddings with input RGB images, TranSplat effectively captures the complexities of transparent surfaces, enhancing the splatting of 3D Gaussians and improving depth completion. Evaluations on synthetic and real-world transpar- ent object benchmarks, as well as robot grasping tasks, show that TranSplat achieves accurate and dense depth completion, demonstrating its effectiveness in practical applications. We open-source synthetic dataset and model: https://github. com/jeongyun0609/TranSplat
zh

[CV-62] NanoVLMs: How small can we go and still make coherent Vision Language Models?

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在生成连贯且一致文本方面的局限性,特别是在模型规模较小的情况下。论文的关键解决方案在于引入两个新型数据集:ShortDesc(包含简洁的图像描述)和LongDesc(包含更详细的图像描述)。这些数据集中的文本使用了简化词汇和语法,类似于幼儿使用的语言,并通过一个缩小规模的模型GPT-4o进行生成。通过这些数据集,研究展示了可以训练出比现有小型VLM小多达10倍的模型,同时保持架构的简单性。评估方法采用GPT-4o对生成的文本进行评分,从创意、意义性和一致性三个方面打分,从而克服标准基准的局限性,提供多维度的模型能力评估。

链接: https://arxiv.org/abs/2502.07838
作者: Mukund Agarwalla,Himanshu Kumar,Raj Dandekar,Rajat Dandekar,Sreedath Panat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs), such as GPT-4V and Llama 3.2 vision, have garnered significant research attention for their ability to leverage Large Language Models (LLMs) in multimodal tasks. However, their potential is constrained by inherent challenges, including proprietary restrictions, substantial computational demands, and limited accessibility. Smaller models, such as GIT and BLIP, exhibit marked limitations, often failing to generate coherent and consistent text beyond a few tokens, even with extensive training. This underscores a pivotal inquiry: how small can a VLM be and still produce fluent and consistent text? Drawing inspiration from the exceptional learning process of 3-4 year old children, who rely heavily on visual cues for understanding and communication, we introduce two novel datasets: ShortDesc (featuring concise image descriptions) and LongDesc (containing more detailed image descriptions). These datasets consist of image-text pairs where the text is restricted to the simple vocabulary and syntax typically used by young children, generated with a scaled- down model, GPT-4o. Using these datasets, we demonstrate that it is possible to train VLMs that are significantly smaller, up to 10 times smaller than state of the art(SOTA) small VLMs while maintaining architectural simplicity. To evaluate the outputs, we leverage GPT-4o to grade the text, as if stories written by students, on creativity, meaningfulness, and consistency, assigning scores out of 10. This method addresses limitations of standard benchmarks by accommodating unstructured outputs and providing a multidimensional evaluation of the model capabilities. Our findings contribute to the development of lightweight, accessible multimodal models for resource constrained environments.
zh

[CV-63] Captured by Captions: On Memorization and its Mitigation in CLIP Models ICLR2025

【速读】:该论文旨在探究多模态模型(如CLIP)在训练过程中如何利用数据,特别是关注记忆机制(memorization)的作用。论文的关键在于提出了一种针对CLIP的记忆机制定义(CLIPMem),并量化分析了其记忆行为,发现CLIP的记忆行为介于有监督学习和自监督学习之间。论文指出文本编码器比图像编码器对记忆的影响更大,并据此提出了减少记忆效应同时保持模型效用的策略。这些策略在传统学习范式中通常是相互矛盾的,即减少记忆效应会导致模型效用降低。

链接: https://arxiv.org/abs/2502.07830
作者: Wenhao Wang,Adam Dziedzic,Grace C. Kim,Michael Backes,Franziska Boenisch
机构: CISPA(苏黎世联邦理工学院); Georgia Institute of Technology(乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Multi-modal models, such as CLIP, have demonstrated strong performance in aligning visual and textual representations, excelling in tasks like image retrieval and zero-shot classification. Despite this success, the mechanisms by which these models utilize training data, particularly the role of memorization, remain unclear. In uni-modal models, both supervised and self-supervised, memorization has been shown to be essential for generalization. However, it is not well understood how these findings would apply to CLIP, which incorporates elements from both supervised learning via captions that provide a supervisory signal similar to labels, and from self-supervised learning via the contrastive objective. To bridge this gap in understanding, we propose a formal definition of memorization in CLIP (CLIPMem) and use it to quantify memorization in CLIP models. Our results indicate that CLIP’s memorization behavior falls between the supervised and self-supervised paradigms, with “mis-captioned” samples exhibiting highest levels of memorization. Additionally, we find that the text encoder contributes more to memorization than the image encoder, suggesting that mitigation strategies should focus on the text domain. Building on these insights, we propose multiple strategies to reduce memorization while at the same time improving utility–something that had not been shown before for traditional learning paradigms where reducing memorization typically results in utility decrease.
zh

[CV-64] Preference Alignment on Diffusion Model: A Comprehensive Survey for Image Generation and Editing

【速读】:该论文旨在解决在图像生成与编辑领域中,将偏好对齐(Preference Alignment)策略与扩散模型(Diffusion Models, DMs)相结合所面临的挑战。论文的关键在于系统性地回顾和探讨了优化技术,如基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO),这些技术在将用户偏好与扩散模型有效结合方面发挥了核心作用。通过这些方法,论文深入分析了偏好对齐在自动驾驶、医学影像、机器人等领域的应用,并全面讨论了相关挑战。

链接: https://arxiv.org/abs/2502.07829
作者: Sihao Wu,Xiaonan Si,Chi Xing,Jianhong Wang,Gaojie Jin,Guangliang Cheng,Lijun Zhang,Xiaowei Huang
机构: University of Liverpool(利物浦大学); Institute of Software Chinese Academy of Sciences(中国科学院软件研究所); University of Edinburgh(爱丁堡大学); University of Bristol(布里斯托尔大学); University of Exeter(埃克塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The integration of preference alignment with diffusion models (DMs) has emerged as a transformative approach to enhance image generation and editing capabilities. Although integrating diffusion models with preference alignment strategies poses significant challenges for novices at this intersection, comprehensive and systematic reviews of this subject are still notably lacking. To bridge this gap, this paper extensively surveys preference alignment with diffusion models in image generation and editing. First, we systematically review cutting-edge optimization techniques such as reinforcement learning with human feedback (RLHF), direct preference optimization (DPO), and others, highlighting their pivotal role in aligning preferences with DMs. Then, we thoroughly explore the applications of aligning preferences with DMs in autonomous driving, medical imaging, robotics, and more. Finally, we comprehensively discuss the challenges of preference alignment with DMs. To our knowledge, this is the first survey centered on preference alignment with DMs, providing insights to drive future innovation in this dynamic area.
zh

[CV-65] Deep Learning in Automated Power Line Inspection: A Review

【速读】:该论文旨在解决电力线检测过程中数据分析方法的改进问题。论文的关键在于系统地总结和分析现有的深度学习技术在电力线组件检测和故障诊断中的应用,并探索其基本原理与实际应用。通过详细阐述这些方法,论文提出了未来研究方向,强调了如边缘云计算和多模态分析等技术的重要性。

链接: https://arxiv.org/abs/2502.07826
作者: Md. Ahasan Atick Faisal,Imene Mecheter,Yazan Qiblawey,Javier Hernandez Fernandez,Muhammad E. H. Chowdhury,Serkan Kiranyaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 40 pages, 12 figures

点击查看摘要

Abstract:In recent years, power line maintenance has seen a paradigm shift by moving towards computer vision-powered automated inspection. The utilization of an extensive collection of videos and images has become essential for maintaining the reliability, safety, and sustainability of electricity transmission. A significant focus on applying deep learning techniques for enhancing power line inspection processes has been observed in recent research. A comprehensive review of existing studies has been conducted in this paper, to aid researchers and industries in developing improved deep learning-based systems for analyzing power line data. The conventional steps of data analysis in power line inspections have been examined, and the body of current research has been systematically categorized into two main areas: the detection of components and the diagnosis of faults. A detailed summary of the diverse methods and techniques employed in these areas has been encapsulated, providing insights into their functionality and use cases. Special attention has been given to the exploration of deep learning-based methodologies for the analysis of power line inspection data, with an exposition of their fundamental principles and practical applications. Moreover, a vision for future research directions has been outlined, highlighting the need for advancements such as edge-cloud collaboration, and multi-modal analysis among others. Thus, this paper serves as a comprehensive resource for researchers delving into deep learning for power line analysis, illuminating the extent of current knowledge and the potential areas for future investigation.
zh

[CV-66] Pre-Trained Video Generative Models as World Simulators

【速读】:该论文旨在解决预训练视频生成模型在处理交互性和动态场景时的局限性,即这些模型通常基于静态提示(如文本或图像)生成视频片段。论文的关键解决方案在于引入了一种名为Dynamic World Simulation (DWS) 的新方法,通过集成一个轻量级的、通用的动作条件模块,使预训练视频生成模型能够执行指定的动作轨迹,并实现动作与视觉变化之间的精确对齐。此外,通过引入一种增强动作可控性的运动强化损失,论文展示了通过一致的动态过渡建模来构建强大的世界模拟器的重要性。这一方法不仅适用于扩散模型,也适用于自回归变换模型,在游戏和机器人领域中生成具有高动作可控性和动态一致性视频方面取得了显著改进。

链接: https://arxiv.org/abs/2502.07825
作者: Haoran He,Yang Zhang,Liang Lin,Zhongwen Xu,Ling Pan
机构: Hong Kong University of Science and Technology; Tsinghua University; Sun Yat-sen University; Tencent AI Lab (腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Video generative models pre-trained on large-scale internet datasets have achieved remarkable success, excelling at producing realistic synthetic videos. However, they often generate clips based on static prompts (e.g., text or images), limiting their ability to model interactive and dynamic scenarios. In this paper, we propose Dynamic World Simulation (DWS), a novel approach to transform pre-trained video generative models into controllable world simulators capable of executing specified action trajectories. To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module that seamlessly integrates into any existing model. Instead of focusing on complex visual details, we demonstrate that consistent dynamic transition modeling is the key to building powerful world simulators. Building upon this insight, we further introduce a motion-reinforced loss that enhances action controllability by compelling the model to capture dynamic changes more effectively. Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models, achieving significant improvements in generating action-controllable, dynamically consistent videos across games and robotics domains. Moreover, to facilitate the applications of the learned world simulator in downstream tasks such as model-based reinforcement learning, we propose prioritized imagination to improve sample efficiency, demonstrating competitive performance compared with state-of-the-art methods.
zh

[CV-67] PDM-SSD: Single-Stage Three-Dimensional Object Detector With Point Dilation

【速读】:该论文旨在解决基于点的检测器仅限于从提供的点中学习,导致感受野有限及全局学习能力不足的问题。解决方案的关键在于引入了一种新颖的点膨胀机制(Point Dilation Mechanism, PDM),通过扩展特征空间,采用点膨胀和特征填充等步骤来增强模型的全局感知能力。PDM不仅提高了检测精度,还保持了较快的推理速度。

链接: https://arxiv.org/abs/2502.07822
作者: Ao Liang,Haiyang Hua,Jian Fang,Wenyu Chen,Huaici Zhao
机构: Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences(中国科学院光电信息处理重点实验室), 110016, Shenyang; Shenyang Institute of Automation, Chinese Academy of Sciences(中国科学院沈阳自动化研究所), 110016, Shenyang; University of Chinese Academy of Sciences(中国科学院大学), 100049, Beijing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Point-based detectors can only learn from the provided points, with limited receptive fields and insufficient global learning capabilities for such targets. In this paper, we present a novel Point Dilation Mechanism for single-stage 3D detection (PDM-SSD) that takes advantage of these two representations. Specifically, we first use a PointNet-style 3D backbone for efficient feature encoding. Then, a neck with Point Dilation Mechanism (PDM) is used to expand the feature space, which involves two key steps: point dilation and feature filling. The former expands points to a certain size grid centered around the sampled points in Euclidean space. The latter fills the unoccupied grid with feature for backpropagation using spherical harmonic coefficients and Gaussian density function in terms of direction and scale. Next, we associate multiple dilation centers and fuse coefficients to obtain sparse grid features through height compression. Finally, we design a hybrid detection head for joint learning, where on one hand, the scene heatmap is predicted to complement the voting point set for improved detection accuracy, and on the other hand, the target probability of detected boxes are calibrated through feature fusion. On the challenging Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, PDM-SSD achieves state-of-the-art results for multi-class detection among single-modal methods with an inference speed of 68 frames. We also demonstrate the advantages of PDM-SSD in detecting sparse and incomplete objects through numerous object-level instances. Additionally, PDM can serve as an auxiliary network to establish a connection between sampling points and object centers, thereby improving the accuracy of the model without sacrificing inference speed. Our code will be available at this https URL.
zh

[CV-68] Amnesia as a Catalyst for Enhancing Black Box Pixel Attacks in Image Classification and Object Detection NEURIPS2024

【速读】:该论文旨在解决现有查询式像素攻击(Query-based Pixel Attacks)在图像分类中的局限性,特别是依赖随机性和补丁(patch dependency)的问题。同时,该研究领域尚未探索查询式像素攻击在目标检测中的应用。论文的关键解决方案是提出了一种基于像素的黑盒攻击方法——利用强化学习的“记忆与遗忘像素攻击”(Remember and Forget Pixel Attack using Reinforcement Learning, RFPAR)。RFPAR通过引入“记忆”和“遗忘”过程,并利用单步强化学习算法生成的奖励来扰动像素,从而减少了随机性和补丁依赖性。此外,RFPAR还扩展到目标检测任务中,以减少检测对象的置信度分数,避免被检测。实验结果表明,RFPAR在ImageNet-1K数据集上的分类任务中优于现有的查询式像素攻击,在MSCOCO数据集上的目标检测任务中也表现出色,且所需的查询次数更少。

链接: https://arxiv.org/abs/2502.07821
作者: Dongsu Song,Daehwa Ko,Jay Hoon Jung
机构: Korea Aerospace University (韩国航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as a poster at NeurIPS 2024

点击查看摘要

Abstract:It is well known that query-based attacks tend to have relatively higher success rates in adversarial black-box attacks. While research on black-box attacks is actively being conducted, relatively few studies have focused on pixel attacks that target only a limited number of pixels. In image classification, query-based pixel attacks often rely on patches, which heavily depend on randomness and neglect the fact that scattered pixels are more suitable for adversarial attacks. Moreover, to the best of our knowledge, query-based pixel attacks have not been explored in the field of object detection. To address these issues, we propose a novel pixel-based black-box attack called Remember and Forget Pixel Attack using Reinforcement Learning(RFPAR), consisting of two main components: the Remember and Forget processes. RFPAR mitigates randomness and avoids patch dependency by leveraging rewards generated through a one-step RL algorithm to perturb pixels. RFPAR effectively creates perturbed images that minimize the confidence scores while adhering to limited pixel constraints. Furthermore, we advance our proposed attack beyond image classification to object detection, where RFPAR reduces the confidence scores of detected objects to avoid detection. Experiments on the ImageNet-1K dataset for classification show that RFPAR outperformed state-of-the-art query-based pixel attacks. For object detection, using the MSCOCO dataset with YOLOv8 and DDQ, RFPAR demonstrates comparable mAP reduction to state-of-the-art query-based attack while requiring fewer query. Further experiments on the Argoverse dataset using YOLOv8 confirm that RFPAR effectively removed objects on a larger scale dataset. Our code is available at this https URL.
zh

[CV-69] Unpaired Image Dehazing via Kolmogorov-Arnold Transformation of Latent Features

【速读】:该论文旨在解决无监督图像去雾(Unsupervised Image Dehazing)这一复杂且病态的视觉任务。关键在于提出了一种基于Kolmogorov-Arnold变换的创新框架——UID-KAT,利用Kolmogorov-Arnold网络(KANs)结合对抗训练(Adversarial Training)和对比学习(Contrastive Learning),以更高效地逼近复杂函数,并在不使用配对数据的情况下实现高质量的图像去雾效果。

链接: https://arxiv.org/abs/2502.07812
作者: Le-Anh Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:This paper proposes an innovative framework for Unsupervised Image Dehazing via Kolmogorov-Arnold Transformation, termed UID-KAT. Image dehazing is recognized as a challenging and ill-posed vision task that requires complex transformations and interpretations in the feature space. Recent advancements have introduced Kolmogorov-Arnold Networks (KANs), inspired by the Kolmogorov-Arnold representation theorem, as promising alternatives to Multi-Layer Perceptrons (MLPs) since KANs can leverage their polynomial foundation to more efficiently approximate complex functions while requiring fewer layers than MLPs. Motivated by this potential, this paper explores the use of KANs combined with adversarial training and contrastive learning to model the intricate relationship between hazy and clear images. Adversarial training is employed due to its capacity in producing high-fidelity images, and contrastive learning promotes the model’s emphasis on significant features while suppressing the influence of irrelevant information. The proposed UID-KAT framework is trained in an unsupervised setting to take advantage of the abundance of real-world data and address the challenge of preparing paired hazy/clean images. Experimental results show that UID-KAT achieves state-of-the-art dehazing performance across multiple datasets and scenarios, outperforming existing unpaired methods while reducing model complexity. The source code for this work is publicly available at this https URL.
zh

[CV-70] CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

【速读】:该论文旨在解决现有视频掩码自编码器(Masked Autoencoders, MAEs)在学习时空表示时侧重于一般的空间-时间模式,而忽视了特定交互或序列等细微语义属性的问题。这些细微语义属性对于理解某些具有丰富上下文和连续性的动作至关重要,但现有方法通常未能充分捕捉。为了解决这一问题,论文提出了一种端到端的跨模态对比学习MAE——CrossVideoMAE。该方法的关键在于整合视频中的互信息与采样帧中的空间信息,并在特征不变的空间内鼓励增强视频域内的变换不变性。通过联合嵌入可见标记的特征并结合跨模态的特征对应关系,实现从视频和帧图像模态中无标签引导信号的自监督获取。实验结果表明,该方法超越了先前的最先进方法。

链接: https://arxiv.org/abs/2502.07811
作者: Shihab Aaqil Ahamed,Malitha Gunawardhana,Liel David,Michael Sidorov,Daniel Harari,Muhammad Haris Khan
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Auckland (奥克兰大学); Weizmann Institute of Science (魏茨曼科学研究所以及科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective spatiotemporal representations from a visual perspective, which may lead the model to prioritize general spatial-temporal patterns but often overlook nuanced semantic attributes like specific interactions or sequences that define actions - such as action-specific features that align more closely with human cognition for space-time correspondence. This can limit the model’s ability to capture the essence of certain actions that are contextually rich and continuous. Humans are capable of mapping visual concepts, object view invariance, and semantic attributes available in static instances to comprehend natural dynamic scenes or videos. Existing MAEs for videos and static images rely on separate datasets for videos and images, which may lack the rich semantic attributes necessary for fully understanding the learned concepts, especially when compared to using video and corresponding sampled frame images together. To this end, we propose CrossVideoMAE an end-to-end self-supervised cross-modal contrastive learning MAE that effectively learns both video-level and frame-level rich spatiotemporal representations and semantic attributes. Our method integrates mutual spatiotemporal information from videos with spatial information from sampled frames within a feature-invariant space, while encouraging invariance to augmentations within the video domain. This objective is achieved through jointly embedding features of visible tokens and combining feature correspondence within and across modalities, which is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner. Extensive experiments demonstrate that our approach surpasses previous state-of-the-art methods and ablation studies validate the effectiveness of our approach.
zh

[CV-71] CP-Guard: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception

【速读】:该论文旨在解决在协作感知(Collaborative Perception, CP)系统中,由于其开放性导致的易受恶意攻击的问题,这些攻击可能注入虚假信息,误导单车辆的感知,从而危及安全驾驶。论文的关键解决方案在于提出了一种新的恶意代理检测范式,该范式能够在特征层面上有效识别恶意代理,而无需验证最终的感知结果,从而显著降低了计算开销。在此基础上,论文引入了CP-GuardBench数据集,并开发了一种名为CP-Guard+的鲁棒防御方法,通过精心设计的双中心对比损失(Dual-Centered Contrastive Loss, DCCLoss)增强良性与恶意特征表示之间的间隔。

链接: https://arxiv.org/abs/2502.07807
作者: Senkang Hu,Yihang Tao,Zihan Fang,Guowen Xu,Yiqin Deng,Sam Kwong,Yuguang Fang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Collaborative perception (CP) is a promising method for safe connected and autonomous driving, which enables multiple vehicles to share sensing information to enhance perception performance. However, compared with single-vehicle perception, the openness of a CP system makes it more vulnerable to malicious attacks that can inject malicious information to mislead the perception of an ego vehicle, resulting in severe risks for safe driving. To mitigate such vulnerability, we first propose a new paradigm for malicious agent detection that effectively identifies malicious agents at the feature level without requiring verification of final perception results, significantly reducing computational overhead. Building on this paradigm, we introduce CP-GuardBench, the first comprehensive dataset provided to train and evaluate various malicious agent detection methods for CP systems. Furthermore, we develop a robust defense method called CP-Guard+, which enhances the margin between the representations of benign and malicious features through a carefully designed Dual-Centered Contrastive Loss (DCCLoss). Finally, we conduct extensive experiments on both CP-GuardBench and V2X-Sim, and demonstrate the superiority of CP-Guard+.
zh

[CV-72] Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts

【速读】:该论文旨在解决视频个性化生成中多概念融合的问题,现有方法通常只能实现单个概念的个性化,且在扩展到多个概念时容易导致身份融合。论文的关键解决方案是引入了锚点提示(anchored prompts),将图像锚点作为独特标记嵌入文本提示中,以准确引导生成过程中的参考。此外,论文还提出了概念嵌入(concept embeddings)来编码参考图像的顺序。这种方法称为Movie Weaver,能够无缝地将包括人脸、身体及动物在内的多种概念融合进一个视频中,从而实现在单一模型中的灵活组合。

链接: https://arxiv.org/abs/2502.07802
作者: Feng Liang,Haoyu Ma,Zecheng He,Tingbo Hou,Ji Hou,Kunpeng Li,Xiaoliang Dai,Felix Juefei-Xu,Samaneh Azadi,Animesh Sinha,Peizhao Zhang,Peter Vajda,Diana Marculescu
机构: Meta GenAI; The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Video personalization, which generates customized videos using reference images, has gained significant attention. However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration. Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources. This challenge arises due to the lack of a mechanism to link each concept with its specific reference image. We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation. Additionally, we introduce concept embeddings to encode the order of reference images. Our approach, Movie Weaver, seamlessly weaves multiple concepts-including face, body, and animal images-into one video, allowing flexible combinations in a single model. The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.
zh

[CV-73] CTR-Driven Advertising Image Generation with Multimodal Large Language Models WWW2025

【速读】:该论文旨在解决现有方法在生成产品广告图像时主要关注美学质量而未能充分优化点击率(Click-Through Rate, CTR)的问题。为了解决这一局限,论文的关键在于利用多模态大型语言模型(Multimodal Large Language Models, MLLMs),通过优化CTR作为主要目标来生成广告图像。首先,构建针对性的预训练任务,并利用大规模电子商务多模态数据集使MLLMs具备初步的广告图像生成能力。进一步,提出了一种新颖的奖励模型,通过强化学习(Reinforcement Learning, RL)微调预训练的MLLMs,以联合利用多模态特征并准确反映用户的点击偏好。同时,开发了一种以产品为中心的偏好优化策略,确保生成的背景内容与产品特性一致,从而提高广告图像的整体相关性和有效性。

链接: https://arxiv.org/abs/2502.06823
作者: Xingye Chen,Wei Feng,Zhenbang Du,Weizhen Wang,Yanyin Chen,Haohan Wang,Linkai Liu,Yaoyu Li,Jinyuan Zhao,Yu Li,Zheng Zhang,Jingjing Lv,Junjie Shen,Zhangang Lin,Jingping Shao,Yuanjie Shao,Xinge You,Changxin Gao,Nong Sang
机构: Huazhong University of Science and Technology(Wuhan, 中国); JD.COM(Beijing, 中国)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Information Retrieval (cs.IR)
备注: Accepted to WWW 2025

点击查看摘要

Abstract:In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences. Meanwhile, a product-centric preference optimization strategy is developed to ensure that the generated background content aligns with the product characteristics after fine-tuning, enhancing the overall relevance and effectiveness of the advertising images. Extensive experiments have demonstrated that our method achieves state-of-the-art performance in both online and offline metrics. Our code and pre-trained models are publicly available at: this https URL.
zh

[CV-74] AdjointDEIS: Efficient Gradients for Diffusion Models NEURIPS2024

【速读】:该论文旨在解决扩散模型(Diffusion Models)在优化潜变量和参数时所面临的挑战性且复杂的问题。具体而言,扩散模型的采样过程涉及到概率流常微分方程(Probability Flow ODE)或扩散随机微分方程(Diffusion SDE),其中神经网络近似得分函数(Score Function),从而允许使用数值ODE/SDE求解器。然而,传统的反向传播技术内存消耗大,需要存储所有中间状态,并且处理扩散项中的注入噪声(Diffusion Term)时面临额外的复杂性。论文的关键解决方案是提出了一种新颖的定制化常伴随方程(Adjoint Equations)求解器——AdjointDEIS,通过利用扩散SDE的独特构造,采用指数积分器(Exponential Integrators)简化连续伴随方程的表述,并提供了定制求解器的收敛阶保证。论文还表明,扩散SDE的连续伴随方程实际上简化为一个简单的ODE。最后,通过人脸变形问题中的对抗攻击演示了AdjointDEIS在引导生成中的有效性。

链接: https://arxiv.org/abs/2405.15020
作者: Zander W. Blasingame,Chen Liu
机构: Clarkson University (克拉克森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
备注: NeurIPS 2024 conference paper

点击查看摘要

Abstract:The optimization of the latents and parameters of diffusion models with respect to some differentiable metric defined on the output of the model is a challenging and complex problem. The sampling for diffusion models is done by solving either the probability flow ODE or diffusion SDE wherein a neural network approximates the score function allowing a numerical ODE/SDE solver to be used. However, naive backpropagation techniques are memory intensive, requiring the storage of all intermediate states, and face additional complexity in handling the injected noise from the diffusion term of the diffusion SDE. We propose a novel family of bespoke ODE solvers to the continuous adjoint equations for diffusion models, which we call AdjointDEIS. We exploit the unique construction of diffusion SDEs to further simplify the formulation of the continuous adjoint equations using exponential integrators. Moreover, we provide convergence order guarantees for our bespoke solvers. Significantly, we show that continuous adjoint equations for diffusion SDEs actually simplify to a simple ODE. Lastly, we demonstrate the effectiveness of AdjointDEIS for guided generation with an adversarial attack in the form of the face morphing problem. Our code will be released at https: //github.com/zblasingame/AdjointDEIS.
zh

[CV-75] Rapid Whole Brain Mesoscale In-vivo MR Imaging using Multi-scale Implicit Neural Representation

【速读】:该论文旨在开发和验证一种使用隐式神经表示(Implicit Neural Representations, INR)的新图像重建技术,以实现多视角厚层采集的同时减少扫描时间并保持高信噪比(SNR)。解决方案的关键在于提出了一种名为旋转视图超分辨率(ROVER-MRI)的无监督神经网络算法,该算法能够从多视角厚层切片数据中有效重建MRI图像,将扫描时间减半同时保持精细的解剖细节。

链接: https://arxiv.org/abs/2502.08634
作者: Jun Lyu,Lipeng Ning,William Consagra,Qiang Liu,Richard J. Rushmore,Berkin Bilgic,Yogesh Rathi
机构: Mass General Brigham (麻省总布里格姆), Harvard Medical School (哈佛医学院); University of South Carolina (南卡罗来纳大学); Boston University Chobanian & Avedisian School of Medicine (波士顿大学乔博尼亚诺和阿维迪桑学校医学院); Athinoula A. Martinos Center for Biomedical Imaging (阿索利亚·马丁诺斯生物医学成像中心), Massachusetts General Hospital (马萨诸塞州总医院); Harvard Medical School (哈佛医学院); Harvard/MIT Health Sciences and Technology (哈佛/麻省理工健康科学与技术)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Purpose: To develop and validate a novel image reconstruction technique using implicit neural representations (INR) for multi-view thick-slice acquisitions while reducing the scan time but maintaining high signal-to-noise ratio (SNR). Methods: We propose Rotating-view super-resolution (ROVER)-MRI, an unsupervised neural network-based algorithm designed to reconstruct MRI data from multi-view thick slices, effectively reducing scan time by 2-fold while maintaining fine anatomical details. We compare our method to both bicubic interpolation and the current state-of-the-art regularized least-squares super-resolution reconstruction (LS-SRR) technique. Validation is performed using ground-truth ex-vivo monkey brain data, and we demonstrate superior reconstruction quality across several in-vivo human datasets. Notably, we achieve the reconstruction of a whole human brain in-vivo T2-weighted image with an unprecedented 180\mum isotropic spatial resolution, accomplished in just 17 minutes of scan time on a 7T MRI scanner. Results: ROVER-MRI outperformed LS-SRR method in terms of reconstruction quality with 22.4% lower relative error (RE) and 7.5% lower full-width half maximum (FWHM) indicating better preservation of fine structural details in nearly half the scan time. Conclusion: ROVER-MRI offers an efficient and robust approach for mesoscale MR imaging, enabling rapid, high-resolution whole-brain scans. Its versatility holds great promise for research applications requiring anatomical details and time-efficient imaging.
zh

[CV-76] BCDDM: Branch-Corrected Denoising Diffusion Model for Black Hole Image Generation

【速读】:该论文旨在解决通过事件视界望远镜(EHT)数据拟合生成的黑洞图像与辐射非效率吸积流(RIAF)模型物理参数之间的高计算成本问题。解决方案的关键在于引入分支校正去噪扩散模型(Branch Correction Denoising Diffusion Model, BCDDM),该模型利用分支校正机制和加权混合损失函数提高基于RIAF模型七个物理参数生成的黑洞图像的准确性。通过增强广义相对论光线追踪(GRRT)数据集并结合BCDDM生成的图像,使用ResNet50进行参数回归,显著提升了参数预测性能,从而降低了计算成本并提供了更快、更高效的参数估计和模型拟合方法。

链接: https://arxiv.org/abs/2502.08528
作者: Ao liu,Zelin Zhang,Songbai Chen,Cuihong Wen
机构: College of information science and engineering, Hunan Normal University, Changsha,410081, People’s Republic of China(Hunan Normal University,信息科学与工程学院); Department of Physics, Institute of Interdisciplinary Studies, Key Laboratory of Low Dimensional Quantum Structures and Quantum Control of Ministry of Education, Synergetic Innovation Center for Quantum Effects and Applications, Hunan Normal University, Changsha, Hunan 410081, People’s Republic of China(Hunan Normal University,物理系,交叉学科研究所,低维量子结构与量子控制教育部重点实验室,量子效应及其应用协同创新中心); Center for Gravitation and Cosmology, College of Physical Science and Technology, Yangzhou University, Yangzhou 225009, People’s Republic of China(Yangzhou University,引力和宇宙学中心,物理科学与技术学院)
类目: Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The properties of black holes and accretion flows can be inferred by fitting Event Horizon Telescope (EHT) data to simulated images generated through general relativistic ray tracing (GRRT). However, due to the computationally intensive nature of GRRT, the efficiency of generating specific radiation flux images needs to be improved. This paper introduces the Branch Correction Denoising Diffusion Model (BCDDM), which uses a branch correction mechanism and a weighted mixed loss function to improve the accuracy of generated black hole images based on seven physical parameters of the radiatively inefficient accretion flow (RIAF) model. Our experiments show a strong correlation between the generated images and their physical parameters. By enhancing the GRRT dataset with BCDDM-generated images and using ResNet50 for parameter regression, we achieve significant improvements in parameter prediction performance. This approach reduces computational costs and provides a faster, more efficient method for dataset expansion, parameter estimation, and model fitting.
zh

[CV-77] CRISP: A Framework for Cryo-EM Image Segmentation and Processing with Conditional Random Field

【速读】:该论文旨在解决在低温电子显微镜(Cryo-EM)图像中区分信号与背景的问题,尤其面临低信噪比(SNR)、污染物存在及颗粒密集且尺寸各异等挑战。论文的关键解决方案在于提出了一种模块化框架,能够自动从Cryo-EM数据中生成高质量的分割图作为真实标签,并支持选择不同的分割模型和损失函数。此外,通过整合带有不同求解器和特征集的条件随机场(CRF),进一步优化粗略预测,实现精细分割。这种灵活性有助于针对Cryo-EM数据集进行最优配置。

链接: https://arxiv.org/abs/2502.08287
作者: Szu-Chi Chung,Po-Cheng Chou
机构: National Sun Yat-sen University (中山大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 28 Figures

点击查看摘要

Abstract:Differentiating signals from the background in micrographs is a critical initial step for cryogenic electron microscopy (cryo-EM), yet it remains laborious due to low signal-to-noise ratio (SNR), the presence of contaminants and densely packed particles of varying sizes. Although image segmentation has recently been introduced to distinguish particles at the pixel level, the low SNR complicates the automated generation of accurate annotations for training supervised models. Moreover, platforms for systematically comparing different design choices in pipeline construction are lacking. Thus, a modular framework is essential to understand the advantages and limitations of this approach and drive further development. To address these challenges, we present a pipeline that automatically generates high-quality segmentation maps from cryo-EM data to serve as ground truth labels. Our modular framework enables the selection of various segmentation models and loss functions. We also integrate Conditional Random Fields (CRFs) with different solvers and feature sets to refine coarse predictions, thereby producing fine-grained segmentation. This flexibility facilitates optimal configurations tailored to cryo-EM datasets. When trained on a limited set of micrographs, our approach achieves over 90% accuracy, recall, precision, Intersection over Union (IoU), and F1-score on synthetic data. Furthermore, to demonstrate our framework’s efficacy in downstream analyses, we show that the particles extracted by our pipeline produce 3D density maps with higher resolution than those generated by existing particle pickers on real experimental datasets, while achieving performance comparable to that of manually curated datasets from experts.
zh

[CV-78] Automatic Prostate Volume Estimation in Transabdominal Ultrasound Images

【速读】:该论文旨在解决前列腺癌早期检测中前列腺体积(Prostate Volume, PV)准确且无创估计的问题。解决方案的关键在于引入了一种基于深度学习的框架,利用经腹超声(Transabdominal Ultrasound, TAUS)视频自动估算前列腺体积。该框架通过在轴向和平面视图中进行前列腺分割、自动估算前列腺直径以及计算前列腺体积,实现了对前列腺体积的高精度估算,其平均体积累计误差为-5.5 mL,相对误差在5%到15%之间。

链接: https://arxiv.org/abs/2502.07859
作者: Tiziano Natali,Liza M. Kurucz,Matteo Fusaglia,Laura S. Mertens,Theo J.M. Ruers,Pim J. van Leeuwen,Behdad Dashtbozorg
机构: Netherlands Cancer Institute (荷兰癌症研究所); University of Twente (特温特大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prostate cancer is a leading health concern among men, requiring accurate and accessible methods for early detection and risk stratification. Prostate volume (PV) is a key parameter in multivariate risk stratification for early prostate cancer detection, commonly estimated using transrectal ultrasound (TRUS). While TRUS provides precise prostate volume measurements, its invasive nature often compromises patient comfort. Transabdominal ultrasound (TAUS) provides a non-invasive alternative but faces challenges such as lower image quality, complex interpretation, and reliance on operator expertise. This study introduces a new deep-learning-based framework for automatic PV estimation using TAUS, emphasizing its potential to enable accurate and non-invasive prostate cancer risk stratification. A dataset of TAUS videos from 100 individual patients was curated, with manually delineated prostate boundaries and calculated diameters by an expert clinician as ground truth. The introduced framework integrates deep-learning models for prostate segmentation in both axial and sagittal planes, automatic prostate diameter estimation, and PV calculation. Segmentation performance was evaluated using Dice correlation coefficient (%) and Hausdorff distance (mm). Framework’s volume estimation capabilities were evaluated on volumetric error (mL). The framework demonstrates that it can estimate PV from TAUS videos with a mean volumetric error of -5.5 mL, which results in an average relative error between 5 and 15%. The introduced framework for automatic PV estimation from TAUS images, utilizing deep learning models for prostate segmentation, shows promising results. It effectively segments the prostate and estimates its volume, offering potential for reliable, non-invasive risk stratification for early prostate detection.
zh

[CV-79] he establishment of static digital humans and the integration with spinal models

【速读】:该论文旨在解决传统成像技术(如X射线、CT和MRI)在捕捉脊柱动态变化及其与整体身体运动相互作用方面的局限性。论文的关键解决方案在于构建一个精确的静态数字人体模型,以集成脊柱,从而实现高精度模拟,并为后续针对青少年特发性脊柱侧弯(AIS)的动态数字人体研究奠定基础。具体而言,通过结合三维高斯方法与Skinned Multi-Person Linear (SMPL)模型从多视角图像中生成人体点云数据,然后拟合标准骨骼模型,再将从CT图像重建的实际脊柱模型与标准骨骼模型对齐。验证结果显示,所得到的个性化脊柱模型在Cobb角测量中的误差在实际测量值的1度以内。

链接: https://arxiv.org/abs/2502.07844
作者: Fujiao Ju,Yuxuan Wang,Shuo Wang,Chengyin Wang,Yinbo Chen,Jianfeng Li,Mingjie Dong,Bin Fang,Qianyu Zhuang
机构: College of Computer Science, Beijing University of Technology (北京工业大学计算机科学学院); Department of Engineering Physics, Tsinghua University (清华大学工程物理系); Beijing Key Laboratory of Advanced Manufacturing Technology, College of Mechanical & Energy Engineering, Beijing University of Technology (北京工业大学机械与能源工程学院先进制造技术北京市重点实验室); Department of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院); Department of Orthopedics, Peking Union Medical College Hospital (北京协和医院骨科)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adolescent idiopathic scoliosis (AIS), a prevalent spinal deformity, significantly affects individuals’ health and quality of life. Conventional imaging techniques, such as X - rays, computed tomography (CT), and magnetic resonance imaging (MRI), offer static views of the spine. However, they are restricted in capturing the dynamic changes of the spine and its interactions with overall body motion. Therefore, developing new techniques to address these limitations has become extremely important. Dynamic digital human modeling represents a major breakthrough in digital medicine. It enables a three - dimensional (3D) view of the spine as it changes during daily activities, assisting clinicians in detecting deformities that might be missed in static imaging. Although dynamic modeling holds great potential, constructing an accurate static digital human model is a crucial initial step for high - precision simulations. In this study, our focus is on constructing an accurate static digital human model integrating the spine, which is vital for subsequent dynamic digital human research on AIS. First, we generate human point - cloud data by combining the 3D Gaussian method with the Skinned Multi - Person Linear (SMPL) model from the patient’s multi - view images. Then, we fit a standard skeletal model to the generated human model. Next, we align the real spine model reconstructed from CT images with the standard skeletal model. We validated the resulting personalized spine model using X - ray data from six AIS patients, with Cobb angles (used to measure the severity of scoliosis) as evaluation metrics. The results indicate that the model’s error was within 1 degree of the actual measurements. This study presents an important method for constructing digital humans.
zh

人工智能

[AI-0] Rhythmic sharing: A bio-inspired paradigm for zero-shot adaptation and learning in neural networks

链接: https://arxiv.org/abs/2502.08644
作者: Hoony Kang,Wolfgang Losert
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Adaptation and Self-Organizing Systems (nlin.AO); Biological Physics (physics.bio-ph)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:The brain can rapidly adapt to new contexts and learn from limited data, a coveted characteristic that artificial intelligence algorithms have struggled to mimic. Inspired by oscillatory rhythms of the mechanical structures of neural cells, we developed a learning paradigm that is based on oscillations in link strengths and associates learning with the coordination of these oscillations. We find that this paradigm yields rapid adaptation and learning in artificial neural networks. Link oscillations can rapidly change coordination, endowing the network with the ability to sense subtle context changes in an unsupervised manner. In other words, the network generates the missing contextual tokens required to perform as a generalist AI architecture capable of predicting dynamics in multiple contexts. Oscillations also allow the network to extrapolate dynamics to never-seen-before contexts. These capabilities make our learning paradigm a powerful starting point for novel models of learning and cognition. Furthermore, learning through link coordination is agnostic to the specifics of the neural network architecture, hence our study opens the door for introducing rapid adaptation and learning capabilities into leading AI models.

[AI-1] Ensemble based approach to quantifying uncertainty of LLM based classifications

链接: https://arxiv.org/abs/2502.08631
作者: Srijith Rajamohan,Ahmed Salhin,Josh Frazier,Rohit Kumar,Yu-Cheng Tsai,Todd Cook
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The output of Large Language Models (LLMs) are a function of the internal model’s parameters and the input provided into the context window. The hypothesis presented here is that under a greedy sampling strategy the variance in the LLM’s output is a function of the conceptual certainty embedded in the model’s parametric knowledge, as well as the lexical variance in the input. Finetuning the model results in reducing the sensitivity of the model output to the lexical input variations. This is then applied to a classification problem and a probabilistic method is proposed for estimating the certainties of the predicted classes.

[AI-2] Quantifying Security Vulnerabilities: A Metric-Driven Security Analysis of Gaps in Current AI Standards

链接: https://arxiv.org/abs/2502.08610
作者: Keerthana Madhavan,Abbas Yazdinejad,Fattane Zarrinkalam,Ali Dehghantanha
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As AI systems integrate into critical infrastructure, security gaps in AI compliance frameworks demand urgent attention. This paper audits and quantifies security risks in three major AI governance standards: NIST AI RMF 1.0, UK’s AI and Data Protection Risk Toolkit, and the EU’s ALTAI. Using a novel risk assessment methodology, we develop four key metrics: Risk Severity Index (RSI), Attack Potential Index (AVPI), Compliance-Security Gap Percentage (CSGP), and Root Cause Vulnerability Score (RCVS). Our analysis identifies 136 concerns across the frameworks, exposing significant gaps. NIST fails to address 69.23 percent of identified risks, ALTAI has the highest attack vector vulnerability (AVPI = 0.51) and the ICO Toolkit has the largest compliance-security gap, with 80.00 percent of high-risk concerns remaining unresolved. Root cause analysis highlights under-defined processes (ALTAI RCVS = 033) and weak implementation guidance (NIST and ICO RCVS = 0.25) as critical weaknesses. These findings emphasize the need for stronger, enforceable security controls in AI compliance. We offer targeted recommendations to enhance security posture and bridge the gap between compliance and real-world AI risks.

[AI-3] CurvGAD: Leverag ing Curvature for Enhanced Graph Anomaly Detection

链接: https://arxiv.org/abs/2502.08605
作者: Karish Grover,Geoffrey J. Gordon,Christos Faloutsos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Does the intrinsic curvature of complex networks hold the key to unveiling graph anomalies that conventional approaches overlook? Reconstruction-based graph anomaly detection (GAD) methods overlook such geometric outliers, focusing only on structural and attribute-level anomalies. To this end, we propose CurvGAD - a mixed-curvature graph autoencoder that introduces the notion of curvature-based geometric anomalies. CurvGAD introduces two parallel pipelines for enhanced anomaly interpretability: (1) Curvature-equivariant geometry reconstruction, which focuses exclusively on reconstructing the edge curvatures using a mixed-curvature, Riemannian encoder and Gaussian kernel-based decoder; and (2) Curvature-invariant structure and attribute reconstruction, which decouples structural and attribute anomalies from geometric irregularities by regularizing graph curvature under discrete Ollivier-Ricci flow, thereby isolating the non-geometric anomalies. By leveraging curvature, CurvGAD refines the existing anomaly classifications and identifies new curvature-driven anomalies. Extensive experimentation over 10 real-world datasets (both homophilic and heterophilic) demonstrates an improvement of up to 6.5% over state-of-the-art GAD methods.

[AI-4] Learning in Markets with Heterogeneous Agents : Dynamics and Survival of Bayesian vs. No-Regret Learners

链接: https://arxiv.org/abs/2502.08597
作者: David Easley,Yoav Kolumbus,Eva Tardos
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
*备注: Learning in Markets, Heterogeneous Agents, Regret and Survival

点击查看摘要

Abstract:We analyze the performance of heterogeneous learning agents in asset markets with stochastic payoffs. Our agents aim to maximize the expected growth rate of their wealth but have different theories on how to learn this best. We focus on comparing Bayesian and no-regret learners in market dynamics. Bayesian learners with a prior over a finite set of models that assign positive prior probability to the correct model have posterior probabilities that converge exponentially to the correct model. Consequently, they survive even in the presence of agents who invest according to the correct model of the stochastic process. Bayesians with a continuum prior converge to the correct model at a rate of O((\log T)/T) . Online learning theory provides no-regret algorithms for maximizing the log of wealth in this setting, achieving a worst-case regret bound of O(\log T) without assuming a steady underlying stochastic process but comparing to the best fixed investment rule. This regret, as we observe, is of the same order of magnitude as that of a Bayesian learner with a continuum prior. However, we show that even such low regret may not be sufficient for survival in asset markets: an agent can have regret as low as O(\log T) , but still vanish in market dynamics when competing against agents who invest according to the correct model or even against a perfect Bayesian with a finite prior. On the other hand, we show that Bayesian learning is fragile, while no-regret learning requires less knowledge of the environment and is therefore more robust. Any no-regret learner will drive out of the market an imperfect Bayesian whose finite prior or update rule has even small errors. We formally establish the relationship between notions of survival, vanishing, and market domination studied in economics and the framework of regret minimization, thus bridging these theories.

[AI-5] Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks

链接: https://arxiv.org/abs/2502.08586
作者: Ang Li,Yin Zhou,Vethavikashini Chithrra Raghuram,Tom Goldstein,Micah Goldblum
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs). These attacks may extract private information or coerce the model into producing harmful outputs. In real-world deployments, LLMs are often part of a larger agentic pipeline including memory systems, retrieval, web access, and API calling. Such additional components introduce vulnerabilities that make these LLM-powered agents much easier to attack than isolated LLMs, yet relatively little work focuses on the security of LLM agents. In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We first provide a taxonomy of attacks categorized by threat actors, objectives, entry points, attacker observability, attack strategies, and inherent vulnerabilities of agent pipelines. We then conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities. Notably, our attacks are trivial to implement and require no understanding of machine learning.

[AI-6] FBFL: A Field-Based Coordination Approach for Data Heterogeneity in Federated Learning

链接: https://arxiv.org/abs/2502.08577
作者: Davide Domini,Gianluca Aguzzi,Lukas Esterle,Mirko Viroli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the last years, Federated learning (FL) has become a popular solution to train machine learning models in domains with high privacy concerns. However, FL scalability and performance face significant challenges in real-world deployments where data across devices are non-independently and identically distributed (non-IID). The heterogeneity in data distribution frequently arises from spatial distribution of devices, leading to degraded model performance in the absence of proper handling. Additionally, FL typical reliance on centralized architectures introduces bottlenecks and single-point-of-failure risks, particularly problematic at scale or in dynamic environments. To close this gap, we propose Field-Based Federated Learning (FBFL), a novel approach leveraging macroprogramming and field coordination to address these limitations through: (i) distributed spatial-based leader election for personalization to mitigate non-IID data challenges; and (ii) construction of a self-organizing, hierarchical architecture using advanced macroprogramming patterns. Moreover, FBFL not only overcomes the aforementioned limitations, but also enables the development of more specialized models tailored to the specific data distribution in each subregion. This paper formalizes FBFL and evaluates it extensively using MNIST, FashionMNIST, and Extended MNIST datasets. We demonstrate that, when operating under IID data conditions, FBFL performs comparably to the widely-used FedAvg algorithm. Furthermore, in challenging non-IID scenarios, FBFL not only outperforms FedAvg but also surpasses other state-of-the-art methods, namely FedProx and Scaffold, which have been specifically designed to address non-IID data distributions. Additionally, we showcase the resilience of FBFL’s self-organizing hierarchical architecture against server failures.

[AI-7] Mapping the Landscape of Generative AI in Network Monitoring and Management

链接: https://arxiv.org/abs/2502.08576
作者: Giampaolo Bovenzi,Francesco Cerasuolo,Domenico Ciuonzo,Davide Di Monda,Idio Guarino,Antonio Montieri,Valerio Persico,Antonio Pescapè
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 32 pages, 9 figure, 10 tables

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) models such as LLMs, GPTs, and Diffusion Models have recently gained widespread attention from both the research and the industrial communities. This survey explores their application in network monitoring and management, focusing on prominent use cases, as well as challenges and opportunities. We discuss how network traffic generation and classification, network intrusion detection, networked system log analysis, and network digital assistance can benefit from the use of GenAI models. Additionally, we provide an overview of the available GenAI models, datasets for large-scale training phases, and platforms for the development of such models. Finally, we discuss research directions that potentially mitigate the roadblocks to the adoption of GenAI for network monitoring and management. Our investigation aims to map the current landscape and pave the way for future research in leveraging GenAI for network monitoring and management.

[AI-8] COAST: Intelligent Time-Adaptive Neural Operators

链接: https://arxiv.org/abs/2502.08574
作者: Zhikai Wu,Shiyang Zhang,Sizhuang He,Sifan Wang,Min Zhu,Anran Jiao,Lu Lu,David van Dijk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Causal Operator with Adaptive Solver Transformer (COAST), a novel neural operator learning method that leverages a causal language model (CLM) framework to dynamically adapt time steps. Our method predicts both the evolution of a system and its optimal time step, intelligently balancing computational efficiency and accuracy. We find that COAST generates variable step sizes that correlate with the underlying system intrinsicities, both within and across dynamical systems. Within a single trajectory, smaller steps are taken in regions of high complexity, while larger steps are employed in simpler regions. Across different systems, more complex dynamics receive more granular time steps. Benchmarked on diverse systems with varied dynamics, COAST consistently outperforms state-of-the-art methods, achieving superior performance in both efficiency and accuracy. This work underscores the potential of CLM-based intelligent adaptive solvers for scalable operator learning of dynamical systems.

[AI-9] Fostering Appropriate Reliance on Large Language Models : The Role of Explanations Sources and Inconsistencies

链接: https://arxiv.org/abs/2502.08554
作者: Sunnie S. Y. Kim,Jennifer Wortman Vaughan,Q. Vera Liao,Tania Lombrozo,Olga Russakovsky
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: CHI 2025. This version includes the appendix

点击查看摘要

Abstract:Large language models (LLMs) can produce erroneous responses that sound fluent and convincing, raising the risk that users will rely on these responses as if they were correct. Mitigating such overreliance is a key challenge. Through a think-aloud study in which participants use an LLM-infused application to answer objective questions, we identify several features of LLM responses that shape users’ reliance: explanations (supporting details for answers), inconsistencies in explanations, and sources. Through a large-scale, pre-registered, controlled experiment (N=308), we isolate and study the effects of these features on users’ reliance, accuracy, and other measures. We find that the presence of explanations increases reliance on both correct and incorrect responses. However, we observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies. We discuss the implications of these findings for fostering appropriate reliance on LLMs.

[AI-10] Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data

链接: https://arxiv.org/abs/2502.08547
作者: Doudou Zhou,Han Tong,Linshanshan Wang,Suqi Liu,Xin Xiong,Ziming Gan,Romain Griffier,Boris Hejblum,Yun-Chung Liu,Chuan Hong,Clara-Lea Bonzel,Tianrun Cai,Kevin Pan,Yuk-Lam Ho,Lauren Costa,Vidul A. Panickan,J. Michael Gaziano,Kenneth Mandl,Vianney Jouhet,Rodolphe Thiebaut,Zongqi Xia,Kelly Cho,Katherine Liao,Tianxi Cai
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The adoption of EHRs has expanded opportunities to leverage data-driven algorithms in clinical care and research. A major bottleneck in effectively conducting multi-institutional EHR studies is the data heterogeneity across systems with numerous codes that either do not exist or represent different clinical concepts across institutions. The need for data privacy further limits the feasibility of including multi-institutional patient-level data required to study similarities and differences across patient subgroups. To address these challenges, we developed the GAME algorithm. Tested and validated across 7 institutions and 2 languages, GAME integrates data in several levels: (1) at the institutional level with knowledge graphs to establish relationships between codes and existing knowledge sources, providing the medical context for standard codes and their relationship to each other; (2) between institutions, leveraging language models to determine the relationships between institution-specific codes with established standard codes; and (3) quantifying the strength of the relationships between codes using a graph attention network. Jointly trained embeddings are created using transfer and federated learning to preserve data privacy. In this study, we demonstrate the applicability of GAME in selecting relevant features as inputs for AI-driven algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. We then highlight the application of GAME harmonized multi-institutional EHR data in a study of Alzheimer’s disease outcomes and suicide risk among patients with mental health disorders, without sharing patient-level data outside individual institutions.

[AI-11] Input convex neural networks: universal approximation theorem and implementation for isotropic polyconvex hyperelastic energies

链接: https://arxiv.org/abs/2502.08534
作者: Gian-Luca Geuken,Patrick Kurzeja,David Wiedemann,Jörn Mosler
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel framework of neural networks for isotropic hyperelasticity that enforces necessary physical and mathematical constraints while simultaneously satisfying the universal approximation theorem. The two key ingredients are an input convex network architecture and a formulation in the elementary polynomials of the signed singular values of the deformation gradient. In line with previously published networks, it can rigorously capture frame-indifference and polyconvexity - as well as further constraints like balance of angular momentum and growth conditions. However and in contrast to previous networks, a universal approximation theorem for the proposed approach is proven. To be more explicit, the proposed network can approximate any frame-indifferent, isotropic polyconvex energy (provided the network is large enough). This is possible by working with a sufficient and necessary criterion for frame-indifferent, isotropic polyconvex functions. Comparative studies with existing approaches identify the advantages of the proposed method, particularly in approximating non-polyconvex energies as well as computing polyconvex hulls.

[AI-12] FedMHO: Heterogeneous One-Shot Federated Learning Towards Resource-Constrained Edge Devices

链接: https://arxiv.org/abs/2502.08518
作者: Dezhong Yao,Yuexin Shi,Tongtong Liu,Zhiqiang Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is increasingly adopted in edge computing scenarios, where a large number of heterogeneous clients operate under constrained or sufficient resources. The iterative training process in conventional FL introduces significant computation and communication overhead, which is unfriendly for resource-constrained edge devices. One-shot FL has emerged as a promising approach to mitigate communication overhead, and model-heterogeneous FL solves the problem of diverse computing resources across clients. However, existing methods face challenges in effectively managing model-heterogeneous one-shot FL, often leading to unsatisfactory global model performance or reliance on auxiliary datasets. To address these challenges, we propose a novel FL framework named FedMHO, which leverages deep classification models on resource-sufficient clients and lightweight generative models on resource-constrained devices. On the server side, FedMHO involves a two-stage process that includes data generation and knowledge fusion. Furthermore, we introduce FedMHO-MD and FedMHO-SD to mitigate the knowledge-forgetting problem during the knowledge fusion stage, and an unsupervised data optimization solution to improve the quality of synthetic samples. Comprehensive experiments demonstrate the effectiveness of our methods, as they outperform state-of-the-art baselines in various experimental setups.

[AI-13] Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?

链接: https://arxiv.org/abs/2502.08503
作者: Jiahe Jin,Yanheng He,Mingyan Yang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we identify the “2D-Cheating” problem in 3D LLM evaluation, where these tasks might be easily solved by VLMs with rendered images of point clouds, exposing ineffective evaluation of 3D LLMs’ unique 3D capabilities. We test VLM performance across multiple 3D LLM benchmarks and, using this as a reference, propose principles for better assessing genuine 3D understanding. We also advocate explicitly separating 3D abilities from 1D or 2D aspects when evaluating 3D LLMs.

[AI-14] Proceedings 40th International Conference on Logic Programming

链接: https://arxiv.org/abs/2502.08453
作者: Pedro Cabalar(University of Coruña),Francesco Fabiano(New Mexico State University),Martin Gebser(University of Klagenfurt),Gopal Gupta(University of Texas at Dallas),Theresa Swift(Johns Hopkins Applied Physics Laboratory)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Since the first conference In Marseille in 1982, the International Conference on Logic Programming (ICLP) has been the premier international event for presenting research in logic programming. These proceedings include technical communications about, and abstracts for presentations given at the 40th ICLP held October 14-17, in Dallas Texas, USA. The papers and abstracts in this volume include the following areas and topics. Formal and operational semantics: including non-monotonic reasoning, probabilistic reasoning, argumentation, and semantic issues of combining logic with neural models. Language design and programming methodologies such as answer set programming. inductive logic programming, and probabilistic programming. Program analysis and logic-based validation of generated programs. Implementation methodologies including constraint implementation, tabling, Logic-based prompt engineering, and the interaction of logic programming with LLMs.

[AI-15] CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World

链接: https://arxiv.org/abs/2502.08449
作者: Yankai Fu,Qiuxuan Feng,Ning Chen,Zichen Zhou,Mengzhen Liu,Mingdong Wu,Tianxing Chen,Shanyu Rong,Jiaming Liu,Hao Dong,Shanghang Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Achieving human-level dexterity in robots is a key objective in the field of robotic manipulation. Recent advancements in 3D-based imitation learning have shown promising results, providing an effective pathway to achieve this goal. However, obtaining high-quality 3D representations presents two key problems: (1) the quality of point clouds captured by a single-view camera is significantly affected by factors such as camera resolution, positioning, and occlusions caused by the dexterous hand; (2) the global point clouds lack crucial contact information and spatial correspondences, which are necessary for fine-grained dexterous manipulation tasks. To eliminate these limitations, we propose CordViP, a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception. Specifically, we first introduce the interaction-aware point clouds, which establish correspondences between the object and the hand. These point clouds are then used for our pre-training policy, where we also incorporate object-centric contact maps and hand-arm coordination information, effectively capturing both spatial and temporal dynamics. Our method demonstrates exceptional dexterous manipulation capabilities with an average success rate of 90% in four real-world tasks, surpassing other baselines by a large margin. Experimental results also highlight the superior generalization and robustness of CordViP to different objects, viewpoints, and scenarios. Code and videos are available on this https URL.

[AI-16] Learning Humanoid Standing-up Control across Diverse Postures

链接: https://arxiv.org/abs/2502.08378
作者: Tao Huang,Junli Ren,Huayi Wang,Zirui Wang,Qingwei Ben,Muning Wen,Xiao Chen,Jianan Li,Jiangmiao Pang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Humanoid Standing-up Control, 12 pages

点击查看摘要

Abstract:Standing-up control is crucial for humanoid robots, with the potential for integration into current locomotion and loco-manipulation systems, such as fall recovery. Existing approaches are either limited to simulations that overlook hardware constraints or rely on predefined ground-specific motion trajectories, failing to enable standing up across postures in real-world scenes. To bridge this gap, we present HoST (Humanoid Standing-up Control), a reinforcement learning framework that learns standing-up control from scratch, enabling robust sim-to-real transfer across diverse postures. HoST effectively learns posture-adaptive motions by leveraging a multi-critic architecture and curriculum-based training on diverse simulated terrains. To ensure successful real-world deployment, we constrain the motion with smoothness regularization and implicit motion speed bound to alleviate oscillatory and violent motions on physical hardware, respectively. After simulation-based training, the learned control policies are directly deployed on the Unitree G1 humanoid robot. Our experimental results demonstrate that the controllers achieve smooth, stable, and robust standing-up motions across a wide range of laboratory and outdoor environments. Videos are available at this https URL.

[AI-17] owards Principled Multi-Agent Task Agnostic Exploration

链接: https://arxiv.org/abs/2502.08365
作者: Riccardo Zamboni,Mirco Mutti,Marcello Restelli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In reinforcement learning, we typically refer to task-agnostic exploration when we aim to explore the environment without access to the task specification a priori. In a single-agent setting the problem has been extensively studied and mostly understood. A popular approach cast the task-agnostic objective as maximizing the entropy of the state distribution induced by the agent’s policy, from which principles and methods follows. In contrast, little is known about task-agnostic exploration in multi-agent settings, which are ubiquitous in the real world. How should different agents explore in the presence of others? In this paper, we address this question through a generalization to multiple agents of the problem of maximizing the state distribution entropy. First, we investigate alternative formulations, highlighting respective positives and negatives. Then, we present a scalable, decentralized, trust-region policy search algorithm to address the problem in practical settings. Finally, we provide proof of concept experiments to both corroborate the theoretical findings and pave the way for task-agnostic exploration in challenging multi-agent settings.

[AI-18] rustworthy GNNs with LLM s: A Systematic Review and Taxonomy IJCAI2025

链接: https://arxiv.org/abs/2502.08353
作者: Ruizhan Xue,Huimin Deng,Fang He,Maojun Wang,Zeyu Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to IJCAI 2025

点击查看摘要

Abstract:With the extensive application of Graph Neural Networks (GNNs) across various domains, their trustworthiness has emerged as a focal point of research. Some existing studies have shown that the integration of large language models (LLMs) can improve the semantic understanding and generation capabilities of GNNs, which in turn improves the trustworthiness of GNNs from various aspects. Our review introduces a taxonomy that offers researchers a clear framework for comprehending the principles and applications of different methods and helps clarify the connections and differences among various approaches. Then we systematically survey representative approaches along the four categories of our taxonomy. Through our taxonomy, researchers can understand the applicable scenarios, potential advantages, and limitations of each approach for the the trusted integration of GNNs with LLMs. Finally, we present some promising directions of work and future trends for the integration of LLMs and GNNs to improve model trustworthiness.

[AI-19] Graph Foundation Models for Recommendation: A Comprehensive Survey

链接: https://arxiv.org/abs/2502.08346
作者: Bin Wu,Yihang Wang,Yuanhao Zeng,Jiawei Liu,Jiashu Zhao,Cheng Yang,Yawen Li,Long Xia,Dawei Yin,Chuan Shi
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems (RS) serve as a fundamental tool for navigating the vast expanse of online information, with deep learning advancements playing an increasingly important role in improving ranking accuracy. Among these, graph neural networks (GNNs) excel at extracting higher-order structural information, while large language models (LLMs) are designed to process and comprehend natural language, making both approaches highly effective and widely adopted. Recent research has focused on graph foundation models (GFMs), which integrate the strengths of GNNs and LLMs to model complex RS problems more efficiently by leveraging the graph-based structure of user-item relationships alongside textual understanding. In this survey, we provide a comprehensive overview of GFM-based RS technologies by introducing a clear taxonomy of current approaches, diving into methodological details, and highlighting key challenges and future directions. By synthesizing recent advancements, we aim to offer valuable insights into the evolving landscape of GFM-based recommender systems.

[AI-20] Hierarchical Learning-based Graph Partition for Large-scale Vehicle Routing Problems AAMAS2025

链接: https://arxiv.org/abs/2502.08340
作者: Yuxin Pan,Ruohong Liu,Yize Chen,Zhiguang Cao,Fangzhen Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted as a Full Paper at AAMAS 2025 (24th International Conference on Autonomous Agents and Multiagent Systems)

点击查看摘要

Abstract:Neural solvers based on the divide-and-conquer approach for Vehicle Routing Problems (VRPs) in general, and capacitated VRP (CVRP) in particular, integrates the global partition of an instance with local constructions for each subproblem to enhance generalization. However, during the global partition phase, misclusterings within subgraphs have a tendency to progressively compound throughout the multi-step decoding process of the learning-based partition policy. This suboptimal behavior in the global partition phase, in turn, may lead to a dramatic deterioration in the performance of the overall decomposition-based system, despite using optimal local constructions. To address these challenges, we propose a versatile Hierarchical Learning-based Graph Partition (HLGP) framework, which is tailored to benefit the partition of CVRP instances by synergistically integrating global and local partition policies. Specifically, the global partition policy is tasked with creating the coarse multi-way partition to generate the sequence of simpler two-way partition subtasks. These subtasks mark the initiation of the subsequent K local partition levels. At each local partition level, subtasks exclusive for this level are assigned to the local partition policy which benefits from the insensitive local topological features to incrementally alleviate the compounded errors. This framework is versatile in the sense that it optimizes the involved partition policies towards a unified objective harmoniously compatible with both reinforcement learning (RL) and supervised learning (SL). (*Due to the notification of arXiv “The Abstract field cannot be longer than 1,920 characters”, the appeared Abstract is shortened. For the full Abstract, please download the Article.)

[AI-21] Hierarchical Multi-Agent Framework for Carbon-Efficient Liquid-Cooled Data Center Clusters

链接: https://arxiv.org/abs/2502.08337
作者: Soumyendu Sarkar,Avisek Naug,Antonio Guillen,Vineet Gundecha,Ricardo Luna Gutierrez,Sahand Ghorbanpour,Sajad Mousavi,Ashwin Ramesh Babu,Desik Rengarajan,Cullen Bash
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Reducing the environmental impact of cloud computing requires efficient workload distribution across geographically dispersed Data Center Clusters (DCCs) and simultaneously optimizing liquid and air (HVAC) cooling with time shift of workloads within individual data centers (DC). This paper introduces Green-DCC, which proposes a Reinforcement Learning (RL) based hierarchical controller to optimize both workload and liquid cooling dynamically in a DCC. By incorporating factors such as weather, carbon intensity, and resource availability, Green-DCC addresses realistic constraints and interdependencies. We demonstrate how the system optimizes multiple data centers synchronously, enabling the scope of digital twins, and compare the performance of various RL approaches based on carbon emissions and sustainability metrics while also offering a framework and benchmark simulation for broader ML research in sustainability.

[AI-22] Salience-Invariant Consistent Policy Learning for Generalization in Visual Reinforcement Learning

链接: https://arxiv.org/abs/2502.08336
作者: Sun Jingbo,Tu Songjun,Zhang Qichao,Chen Ke,Zhao Dongbin
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generalizing policies to unseen scenarios remains a critical challenge in visual reinforcement learning, where agents often overfit to the specific visual observations of the training environment. In unseen environments, distracting pixels may lead agents to extract representations containing task-irrelevant information. As a result, agents may deviate from the optimal behaviors learned during training, thereby hindering visual this http URL address this issue, we propose the Salience-Invariant Consistent Policy Learning (SCPL) algorithm, an efficient framework for zero-shot generalization. Our approach introduces a novel value consistency module alongside a dynamics module to effectively capture task-relevant representations. The value consistency module, guided by saliency, ensures the agent focuses on task-relevant pixels in both original and perturbed observations, while the dynamics module uses augmented data to help the encoder capture dynamic- and reward-relevant representations. Additionally, our theoretical analysis highlights the importance of policy consistency for generalization. To strengthen this, we introduce a policy consistency module with a KL divergence constraint to maintain consistent policies across original and perturbed this http URL experiments on the DMC-GB, Robotic Manipulation, and CARLA benchmarks demonstrate that SCPL significantly outperforms state-of-the-art methods in terms of generalization. Notably, SCPL achieves average performance improvements of 14%, 39%, and 69% in the challenging DMC video hard setting, the Robotic hard setting, and the CARLA benchmark, this http URL Page: this https URL.

[AI-23] Modification and Generated-Text Detection: Achieving Dual Detection Capabilities for the Outputs of LLM by Watermark

链接: https://arxiv.org/abs/2502.08332
作者: Yuhang Cai,Yaofei Wang,Donghui Hu,Gu Chen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The development of large language models (LLMs) has raised concerns about potential misuse. One practical solution is to embed a watermark in the text, allowing ownership verification through watermark extraction. Existing methods primarily focus on defending against modification attacks, often neglecting other spoofing attacks. For example, attackers can alter the watermarked text to produce harmful content without compromising the presence of the watermark, which could lead to false attribution of this malicious content to the LLM. This situation poses a serious threat to the LLMs service providers and highlights the significance of achieving modification detection and generated-text detection simultaneously. Therefore, we propose a technique to detect modifications in text for unbiased watermark which is sensitive to modification. We introduce a new metric called ``discarded tokens", which measures the number of tokens not included in watermark detection. When a modification occurs, this metric changes and can serve as evidence of the modification. Additionally, we improve the watermark detection process and introduce a novel method for unbiased watermark. Our experiments demonstrate that we can achieve effective dual detection capabilities: modification detection and generated-text detection by watermark.

[AI-24] HDT: Hierarchical Discrete Transformer for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2502.08302
作者: Shibo Feng,Peilin Zhao,Liu Liu,Pengcheng Wu,Zhiqi Shen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative models have gained significant attention in multivariate time series forecasting (MTS), particularly due to their ability to generate high-fidelity samples. Forecasting the probability distribution of multivariate time series is a challenging yet practical task. Although some recent attempts have been made to handle this task, two major challenges persist: 1) some existing generative methods underperform in high-dimensional multivariate time series forecasting, which is hard to scale to higher dimensions; 2) the inherent high-dimensional multivariate attributes constrain the forecasting lengths of existing generative models. In this paper, we point out that discrete token representations can model high-dimensional MTS with faster inference time, and forecasting the target with long-term trends of itself can extend the forecasting length with high accuracy. Motivated by this, we propose a vector quantized framework called Hierarchical Discrete Transformer (HDT) that models time series into discrete token representations with l2 normalization enhanced vector quantized strategy, in which we transform the MTS forecasting into discrete tokens generation. To address the limitations of generative models in long-term forecasting, we propose a hierarchical discrete Transformer. This model captures the discrete long-term trend of the target at the low level and leverages this trend as a condition to generate the discrete representation of the target at the high level that introduces the features of the target itself to extend the forecasting length in high-dimensional MTS. Extensive experiments on five popular MTS datasets verify the effectiveness of our proposed method.

[AI-25] Individualised Treatment Effects Estimation with Composite Treatments and Composite Outcomes

链接: https://arxiv.org/abs/2502.08282
作者: Vinod Kumar Chauhan,Lei Clifton,Gaurav Nigam,David A. Clifton
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages (double column), 4 figures

点击查看摘要

Abstract:Estimating individualised treatment effect (ITE) – that is the causal effect of a set of variables (also called exposures, treatments, actions, policies, or interventions), referred to as \textitcomposite treatments, on a set of outcome variables of interest, referred to as \textitcomposite outcomes, for a unit from observational data – remains a fundamental problem in causal inference with applications across disciplines, such as healthcare, economics, education, social science, marketing, and computer science. Previous work in causal machine learning for ITE estimation is limited to simple settings, like single treatments and single outcomes. This hinders their use in complex real-world scenarios; for example, consider studying the effect of different ICU interventions, such as beta-blockers and statins for a patient admitted for heart surgery, on different outcomes of interest such as atrial fibrillation and in-hospital mortality. The limited research into composite treatments and outcomes is primarily due to data scarcity for all treatments and outcomes. To address the above challenges, we propose a novel and innovative hypernetwork-based approach, called \emphH-Learner, to solve ITE estimation under composite treatments and composite outcomes, which tackles the data scarcity issue by dynamically sharing information across treatments and outcomes. Our empirical analysis with binary and arbitrary composite treatments and outcomes demonstrates the effectiveness of the proposed approach compared to existing methods.

[AI-26] Balancing optimism and pessimism in offline-to-online learning

链接: https://arxiv.org/abs/2502.08259
作者: Sentenac Flore,Lee Albin,Szepesvari Csaba
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider what we call the offline-to-online learning setting, focusing on stochastic finite-armed bandit problems. In offline-to-online learning, a learner starts with offline data collected from interactions with an unknown environment in a way that is not under the learner’s control. Given this data, the learner begins interacting with the environment, gradually improving its initial strategy as it collects more data to maximize its total reward. The learner in this setting faces a fundamental dilemma: if the policy is deployed for only a short period, a suitable strategy (in a number of senses) is the Lower Confidence Bound (LCB) algorithm, which is based on pessimism. LCB can effectively compete with any policy that is sufficiently “covered” by the offline data. However, for longer time horizons, a preferred strategy is the Upper Confidence Bound (UCB) algorithm, which is based on optimism. Over time, UCB converges to the performance of the optimal policy at a rate that is nearly the best possible among all online algorithms. In offline-to-online learning, however, UCB initially explores excessively, leading to worse short-term performance compared to LCB. This suggests that a learner not in control of how long its policy will be in use should start with LCB for short horizons and gradually transition to a UCB-like strategy as more rounds are played. This article explores how and why this transition should occur. Our main result shows that our new algorithm performs nearly as well as the better of LCB and UCB at any point in time. The core idea behind our algorithm is broadly applicable, and we anticipate that our results will extend beyond the multi-armed bandit setting.

[AI-27] he Danger of Overthinking: Examining the Reasoning -Action Dilemma in Agent ic Tasks

链接: https://arxiv.org/abs/2502.08235
作者: Alejandro Cuadron,Dacheng Li,Wenjie Ma,Xingyao Wang,Yichuan Wang,Siyuan Zhuang,Shu Liu,Luis Gaspar Schroeder,Tian Xia,Huanzhi Mao,Nicholas Thumiger,Aditya Desai,Ion Stoica,Ana Klimovic,Graham Neubig,Joseph E. Gonzalez
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon where models favor extended internal reasoning chains over environmental interaction. Through experiments on software engineering tasks using SWE Bench Verified, we observe three recurring patterns: Analysis Paralysis, Rogue Actions, and Premature Disengagement. We propose a framework to study these behaviors, which correlates with human expert assessments, and analyze 4018 trajectories. We observe that higher overthinking scores correlate with decreased performance, with reasoning models exhibiting stronger tendencies toward overthinking compared to non-reasoning models. Our analysis reveals that simple efforts to mitigate overthinking in agentic environments, such as selecting the solution with the lower overthinking score, can improve model performance by almost 30% while reducing computational costs by 43%. These results suggest that mitigating overthinking has strong practical implications. We suggest that by leveraging native function-calling capabilities and selective reinforcement learning overthinking tendencies could be mitigated. We also open-source our evaluation framework and dataset to facilitate research in this direction at this https URL.

[AI-28] Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation

链接: https://arxiv.org/abs/2502.08211
作者: Jinda Xu,Yuhao Song,Daming Wang,Weiwei Zhao,Minghua Chen,Kangliang Chen,Qinya Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In an era overwhelmed by vast amounts of data, the effective curation of web-crawl datasets is essential for optimizing model performance. This paper tackles the challenges associated with the unstructured and heterogeneous nature of such datasets. Traditional heuristic curation methods often inadequately capture complex features, resulting in biases and the exclusion of relevant data. We introduce an advanced, learning-driven approach, Ensemble Curation Of DAta ThroUgh Multimodal Operators (EcoDatum), incorporating a novel quality-guided deduplication method to ensure balanced feature distributions. EcoDatum strategically integrates various unimodal and multimodal data curation operators within a weak supervision ensemble framework, utilizing automated optimization to score each data point effectively. EcoDatum, which significantly improves the data curation quality and efficiency, outperforms existing state-of-the-art (SOTA) techniques, ranked 1st on the DataComp leaderboard, with an average performance score of 0.182 across 38 diverse evaluation datasets. This represents a 28% improvement over the DataComp baseline method, demonstrating its effectiveness in improving dataset curation and model training efficiency.

[AI-29] Equivariant Masked Position Prediction for Efficient Molecular Representation

链接: https://arxiv.org/abs/2502.08209
作者: Junyi An,Chao Qu,Yun-Fei Shi,XinHao Liu,Qianwei Tang,Fenglei Cao,Yuan Qi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages, 6 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) have shown considerable promise in computational chemistry. However, the limited availability of molecular data raises concerns regarding GNNs’ ability to effectively capture the fundamental principles of physics and chemistry, which constrains their generalization capabilities. To address this challenge, we introduce a novel self-supervised approach termed Equivariant Masked Position Prediction (EMPP), grounded in intramolecular potential and force theory. Unlike conventional attribute masking techniques, EMPP formulates a nuanced position prediction task that is more well-defined and enhances the learning of quantum mechanical features. EMPP also bypasses the approximation of the Gaussian mixture distribution commonly used in denoising methods, allowing for more accurate acquisition of physical properties. Experimental results indicate that EMPP significantly enhances performance of advanced molecular architectures, surpassing state-of-the-art self-supervised approaches. Our code is released in this https URL.

[AI-30] SycEval: Evaluating LLM Sycophancy

链接: https://arxiv.org/abs/2502.08177
作者: Aaron Fanous,Jacob Goldberg(1),Ank A. Agarwal(1),Joanna Lin(1),Anson Zhou(1),Roxana Daneshjou(1),Sanmi Koyejo(1) ((1) Stanford University)
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied in educational, clinical, and professional settings, but their tendency for sycophancy – prioritizing user agreement over independent reasoning – poses risks to reliability. This study introduces a framework to evaluate sycophantic behavior in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19% of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred in 43.52% of cases, while regressive sycophancy, leading to incorrect answers, was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, Z=5.87 , p0.001 ), particularly in computational tasks, where regressive sycophancy increased significantly (preemptive: 8.13%, in-context: 3.54%, p0.001 ). Simple rebuttals maximized progressive sycophancy ( Z=6.59 , p0.001 ), while citation-based rebuttals exhibited the highest regressive rates ( Z=6.59 , p0.001 ). Sycophantic behavior showed high persistence (78.5%, 95% CI: [77.2%, 79.8%]) regardless of context or model. These findings emphasize the risks and opportunities of deploying LLMs in structured and dynamic domains, offering insights into prompt programming and model optimization for safer AI applications.

[AI-31] MixDec Sampling: A Soft Link-based Sampling Method of Graph Neural Network for Recommendation

链接: https://arxiv.org/abs/2502.08161
作者: Xiangjin Xie,Yuxin Chen,Ruipeng Wang,Kai Ouyang,Zihan Zhang,Hai-Tao Zheng,Buyue Qian,Hansen Zheng,Bo Hu,Chengxiang Zhuo,Zang Li
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Graph neural networks have been widely used in recent recommender systems, where negative sampling plays an important role. Existing negative sampling methods restrict the relationship between nodes as either hard positive pairs or hard negative pairs. This leads to the loss of structural information, and lacks the mechanism to generate positive pairs for nodes with few neighbors. To overcome limitations, we propose a novel soft link-based sampling method, namely MixDec Sampling, which consists of Mixup Sampling module and Decay Sampling module. The Mixup Sampling augments node features by synthesizing new nodes and soft links, which provides sufficient number of samples for nodes with few neighbors. The Decay Sampling strengthens the digestion of graph structure information by generating soft links for node embedding learning. To the best of our knowledge, we are the first to model sampling relationships between nodes by soft links in GNN-based recommender systems. Extensive experiments demonstrate that the proposed MixDec Sampling can significantly and consistently improve the recommendation performance of several representative GNN-based models on various recommendation benchmarks.

[AI-32] Vertical Federated Learning in Practice: The Good the Bad and the Ugly

链接: https://arxiv.org/abs/2502.08160
作者: Zhaomin Wu,Zhen Qin,Junyi Hou,Haodong Zhao,Qinbin Li,Bingsheng He,Lixin Fan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vertical Federated Learning (VFL) is a privacy-preserving collaborative learning paradigm that enables multiple parties with distinct feature sets to jointly train machine learning models without sharing their raw data. Despite its potential to facilitate cross-organizational collaborations, the deployment of VFL systems in real-world applications remains limited. To investigate the gap between existing VFL research and practical deployment, this survey analyzes the real-world data distributions in potential VFL applications and identifies four key findings that highlight this gap. We propose a novel data-oriented taxonomy of VFL algorithms based on real VFL data distributions. Our comprehensive review of existing VFL algorithms reveals that some common practical VFL scenarios have few or no viable solutions. Based on these observations, we outline key research directions aimed at bridging the gap between current VFL research and real-world applications.

[AI-33] DGSense: A Domain Generalization Framework for Wireless Sensing

链接: https://arxiv.org/abs/2502.08155
作者: Rui Zhou,Yu Cheng,Songlin Li,Hongwang Zhang,Chenxu Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:Wireless sensing is of great benefits to our daily lives. However, wireless signals are sensitive to the surroundings. Various factors, e.g. environments, locations, and individuals, may induce extra impact on wireless propagation. Such a change can be regarded as a domain, in which the data distribution shifts. A vast majority of the sensing schemes are learning-based. They are dependent on the training domains, resulting in performance degradation in unseen domains. Researchers have proposed various solutions to address this issue. But these solutions leverage either semi-supervised or unsupervised domain adaptation techniques. They still require some data in the target domains and do not perform well in unseen domains. In this paper, we propose a domain generalization framework DGSense, to eliminate the domain dependence problem in wireless sensing. The framework is a general solution working across diverse sensing tasks and wireless technologies. Once the sensing model is built, it can generalize to unseen domains without any data from the target domain. To achieve the goal, we first increase the diversity of the training set by a virtual data generator, and then extract the domain independent features via episodic training between the main feature extractor and the domain feature extractors. The feature extractors employ a pre-trained Residual Network (ResNet) with an attention mechanism for spatial features, and a 1D Convolutional Neural Network (1DCNN) for temporal features. To demonstrate the effectiveness and generality of DGSense, we evaluated on WiFi gesture recognition, Millimeter Wave (mmWave) activity recognition, and acoustic fall detection. All the systems exhibited high generalization capability to unseen domains, including new users, locations, and environments, free of new data and retraining.

[AI-34] ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning

链接: https://arxiv.org/abs/2502.08148
作者: Vy Vo,Lizhen Qu,Tao Feng,Yuncheng Hua,Xiaoxi Kang,Songhai Fan,Tim Dwyer,Lay-Ki Soon,Gholamreza Haffari
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Identifying cause-and-effect relationships is critical to understanding real-world dynamics and ultimately causal reasoning. Existing methods for identifying event causality in NLP, including those based on Large Language Models (LLMs), exhibit difficulties in out-of-distribution settings due to the limited scale and heavy reliance on lexical cues within available benchmarks. Modern benchmarks, inspired by probabilistic causal inference, have attempted to construct causal graphs of events as a robust representation of causal knowledge, where \textttCRAB \citepromanou2023crab is one such recent benchmark along this line. In this paper, we introduce \textttACCESS, a benchmark designed for discovery and reasoning over abstract causal events. Unlike existing resources, \textttACCESS focuses on causality of everyday life events on the abstraction level. We propose a pipeline for identifying abstractions for event generalizations from \textttGLUCOSE \citepmostafazadeh-etal-2020-glucose, a large-scale dataset of implicit commonsense causal knowledge, from which we subsequently extract 1,4 K causal pairs. Our experiments highlight the ongoing challenges of using statistical methods and/or LLMs for automatic abstraction identification and causal discovery in NLP. Nonetheless, we demonstrate that the abstract causal knowledge provided in \textttACCESS can be leveraged for enhancing QA reasoning performance in LLMs.

[AI-35] Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers

链接: https://arxiv.org/abs/2502.08145
作者: Siddharth Singh,Prajwal Singhania,Aditya Ranjan,John Kirchenbauer,Jonas Geiping,Yuxin Wen,Neel Jain,Abhimanyu Hans,Manli Shu,Aditya Tomar,Tom Goldstein,Abhinav Bhatele
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters requires tens of thousands of GPUs, and a highly scalable software stack. In this work, we present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN. We describe several performance optimizations in AxoNN to improve matrix multiply kernel performance, overlap non-blocking collectives with computation, and performance modeling to choose performance optimal configurations. These have resulted in unprecedented scaling and peak flop/s (bf16) for training of GPT-style transformer models on Perlmutter (620.1 Petaflop/s), Frontier (1.381 Exaflop/s) and Alps (1.423 Exaflop/s). While the abilities of LLMs improve with the number of trainable parameters, so do privacy and copyright risks caused by memorization of training data, which can cause disclosure of sensitive or private information at inference time. We highlight this side effect of scale through experiments that explore “catastrophic memorization”, where models are sufficiently large to memorize training data in a single pass, and present an approach to prevent it. As part of this study, we demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2502.08145 [cs.LG] (or arXiv:2502.08145v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.08145 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-36] Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences

链接: https://arxiv.org/abs/2502.08142
作者: Shanshan Han,Salman Avestimehr,Chaoyang He
类目: Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2406.10847

点击查看摘要

Abstract:We present Wildflare GuardRail, a guardrail pipeline designed to enhance the safety and reliability of Large Language Model (LLM) inferences by systematically addressing risks across the entire processing workflow. Wildflare GuardRail integrates several core functional modules, including Safety Detector that identifies unsafe inputs and detects hallucinations in model outputs while generating root-cause explanations, Grounding that contextualizes user queries with information retrieved from vector databases, Customizer that adjusts outputs in real time using lightweight, rule-based wrappers, and Repairer that corrects erroneous LLM outputs using hallucination explanations provided by Safety Detector. Results show that our unsafe content detection model in Safety Detector achieves comparable performance with OpenAI API, though trained on a small dataset constructed with several public datasets. Meanwhile, the lightweight wrappers can address malicious URLs in model outputs in 1.06s per query with 100% accuracy without costly model calls. Moreover, the hallucination fixing model demonstrates effectiveness in reducing hallucinations with an accuracy of 80.7%.

[AI-37] Hookpad Aria: A Copilot for Songwriters

链接: https://arxiv.org/abs/2502.08122
作者: Chris Donahue,Shih-Lun Wu,Yewon Kim,Dave Carlton,Ryan Miyakawa,John Thickstun
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Extended abstract presented in the Late-Breaking Demo Session at ISMIR 2024 (ISMIR LBD 2024)

点击查看摘要

Abstract:We present Hookpad Aria, a generative AI system designed to assist musicians in writing Western pop songs. Our system is seamlessly integrated into Hookpad, a web-based editor designed for the composition of lead sheets: symbolic music scores that describe melody and harmony. Hookpad Aria has numerous generation capabilities designed to assist users in non-sequential composition workflows, including: (1) generating left-to-right continuations of existing material, (2) filling in missing spans in the middle of existing material, and (3) generating harmony from melody and vice versa. Hookpad Aria is also a scalable data flywheel for music co-creation – since its release in March 2024, Aria has generated 318k suggestions for 3k users who have accepted 74k into their songs. More information about Hookpad Aria is available at this https URL Comments: Extended abstract presented in the Late-Breaking Demo Session at ISMIR 2024 (ISMIR LBD 2024) Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.08122 [cs.SD] (or arXiv:2502.08122v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2502.08122 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-38] Generative AI-Enhanced Cooperative MEC of UAVs and Ground Stations for Unmanned Surface Vehicles

链接: https://arxiv.org/abs/2502.08119
作者: Jiahao You,Ziye Jia,Chao Dong,Qihui Wu,Zhu Han
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The increasing deployment of unmanned surface vehicles (USVs) require computational support and coverage in applications such as maritime search and rescue. Unmanned aerial vehicles (UAVs) can offer low-cost, flexible aerial services, and ground stations (GSs) can provide powerful supports, which can cooperate to help the USVs in complex scenarios. However, the collaboration between UAVs and GSs for USVs faces challenges of task uncertainties, USVs trajectory uncertainties, heterogeneities, and limited computational resources. To address these issues, we propose a cooperative UAV and GS based robust multi-access edge computing framework to assist USVs in completing computational tasks. Specifically, we formulate the optimization problem of joint task offloading and UAV trajectory to minimize the total execution time, which is in the form of mixed integer nonlinear programming and NP-hard to tackle. Therefore, we propose the algorithm of generative artificial intelligence-enhanced heterogeneous agent proximal policy optimization (GAI-HAPPO). The proposed algorithm integrates GAI models to enhance the actor network ability to model complex environments and extract high-level features, thereby allowing the algorithm to predict uncertainties and adapt to dynamic conditions. Additionally, GAI stabilizes the critic network, addressing the instability of multi-agent reinforcement learning approaches. Finally, extensive simulations demonstrate that the proposed algorithm outperforms the existing benchmark methods, thus highlighting the potentials in tackling intricate, cross-domain issues in the considered scenarios.

[AI-39] Generative AI and Empirical Software Engineering: A Paradigm Shift

链接: https://arxiv.org/abs/2502.08108
作者: Christoph Treude,Margaret-Anne Storey
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The widespread adoption of generative AI in software engineering marks a paradigm shift, offering new opportunities to design and utilize software engineering tools while influencing both developers and the artifacts they create. Traditional empirical methods in software engineering, including quantitative, qualitative, and mixed-method approaches, are well established. However, this paradigm shift introduces novel data types and redefines many concepts in the software engineering process. The roles of developers, users, agents, and researchers increasingly overlap, blurring the distinctions between these social and technical actors within the field. This paper examines how integrating AI into software engineering challenges traditional research paradigms. It focuses on the research phenomena that we investigate, the methods and theories that we employ, the data we analyze, and the threats to validity that emerge in this new context. Through this exploration, our goal is to understand how AI adoption disrupts established software development practices that creates new opportunities for empirical software engineering research. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.08108 [cs.SE] (or arXiv:2502.08108v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2502.08108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] Rethinking Tokenized Graph Transformers for Node Classification

链接: https://arxiv.org/abs/2502.08101
作者: Jinsong Chen,Chenyang Li,GaiChao Li,John E. Hopcroft,Kun He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint version

点击查看摘要

Abstract:Node tokenized graph Transformers (GTs) have shown promising performance in node classification. The generation of token sequences is the key module in existing tokenized GTs which transforms the input graph into token sequences, facilitating the node representation learning via Transformer. In this paper, we observe that the generations of token sequences in existing GTs only focus on the first-order neighbors on the constructed similarity graphs, which leads to the limited usage of nodes to generate diverse token sequences, further restricting the potential of tokenized GTs for node classification. To this end, we propose a new method termed SwapGT. SwapGT first introduces a novel token swapping operation based on the characteristics of token sequences that fully leverages the semantic relevance of nodes to generate more informative token sequences. Then, SwapGT leverages a Transformer-based backbone to learn node representations from the generated token sequences. Moreover, SwapGT develops a center alignment loss to constrain the representation learning from multiple token sequences, further enhancing the model performance. Extensive empirical results on various datasets showcase the superiority of SwapGT for node classification.

[AI-41] Cognify: Supercharging Gen-AI Workflows With Hierarchical Autotuning

链接: https://arxiv.org/abs/2502.08056
作者: Zijian He,Reyna Abhyankar,Vikranth Srivatsa,Yiying Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Today’s gen-AI workflows that involve multiple ML model calls, tool/API calls, data retrieval, or generic code execution are often tuned manually in an ad-hoc way that is both time-consuming and error-prone. In this paper, we propose a systematic approach for automatically tuning gen-AI workflows. Our key insight is that gen-AI workflows can benefit from structure, operator, and prompt changes, but unique properties of gen-AI workflows require new optimization techniques. We propose AdaSeek, an adaptive hierarchical search algorithm for autotuning gen-AI workflows. AdaSeek organizes workflow tuning methods into different layers based on the user-specified total search budget and distributes the budget across different layers based on the complexity of each layer. During its hierarchical search, AdaSeek redistributes the search budget from less useful to more promising tuning configurations based on workflow-level evaluation results. We implement AdaSeek in a workflow autotuning framework called Cognify and evaluate Cognify using six types of workflows such as RAG-based QA and text-to-SQL transformation. Overall, Cognify improves these workflows’ generation quality by up to 2.8x, reduces execution monetary cost by up to 10x, and reduces end-to-end latency by 2.7x.

[AI-42] WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

链接: https://arxiv.org/abs/2502.08047
作者: Henry Hengyuan Zhao,Difei Gao,Mike Zheng Shou
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 19 pages, 18 figures

点击查看摘要

Abstract:Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.

[AI-43] Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

链接: https://arxiv.org/abs/2502.08021
作者: Pai Liu,Lingfeng Zhao,Shivangi Agarwal,Jinghan Liu,Audrey Huang,Philip Amortila,Nan Jiang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). In this work we focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions (“model-free”) or dynamics (“model-based”) to best assess the performance of a target policy. Our contributions are two fold. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation of candidate value functions, better control of misspecification, and evaluation of model-free and model-based methods alike. We exemplify the protocol on a Gym environment, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.

[AI-44] raining-Free Safe Denoisers for Safe Use of Diffusion Models

链接: https://arxiv.org/abs/2502.08011
作者: Mingyu Kim,Dongjun Kim,Amman Yusuf,Stefano Ermon,Mi Jung Park
类目: Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:There is growing concern over the safety of powerful diffusion models (DMs), as they are often misused to produce inappropriate, not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or extensively retraining DMs to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or datapoints needed to be excluded) to avoid specific regions of data distribution, without needing to retrain or fine-tune DMs. We formally derive the relationship between the expected denoised samples that are safe and those that are not safe, leading to our \textitsafe denoiser which ensures its final samples are away from the area to be negated. Inspired by the derivation, we develop a practical algorithm that successfully produces high-quality samples while avoiding negation areas of the data distribution in text-conditional, class-conditional, and unconditional image generation scenarios. These results hint at the great potential of our training-free safe denoiser for using DMs more safely.

[AI-45] Greed is Good: Guided Generation from a Greedy Perspective

链接: https://arxiv.org/abs/2502.08006
作者: Zander W. Blasingame,Chen Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Initial preprint

点击查看摘要

Abstract:Training-free guided generation is a widely used and powerful technique that allows the end user to exert further control over the generative process of diffusion models. In this work, we explore the guided generation from the perspective of optimizing the solution trajectory of a neural differential equation in a greedy manner. We present such a strategy as a unifying view on training-free guidance by showing that the greedy strategy is a first-order discretization of end-to-end optimization techniques. We show that a greedy guidance strategy makes good decisions and compare it to a guidance strategy using the ideal gradients found via the continuous adjoint equations. We then show how other popular training-free guidance strategies can be viewed in a unified manner from this perspective.

[AI-46] Universal Adversarial Attack on Aligned Multimodal LLM s

链接: https://arxiv.org/abs/2502.07987
作者: Temurbek Rahmatullaev,Polina Druzhinina,Matvey Mikhalchuk,Andrey Kuznetsov,Anton Razzhigaev
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a universal adversarial attack on multimodal Large Language Models (LLMs) that leverages a single optimized image to override alignment safeguards across diverse queries and even multiple models. By backpropagating through the vision encoder and language head, we craft a synthetic image that forces the model to respond with a targeted phrase (e.g., ‘‘Sure, here it is’’) or otherwise unsafe content-even for harmful prompts. In experiments on the SafeBench benchmark, our method achieves significantly higher attack success rates than existing baselines, including text-only universal prompts (e.g., up to 93% on certain models). We further demonstrate cross-model transferability by training on several multimodal LLMs simultaneously and testing on unseen architectures. Additionally, a multi-answer variant of our approach produces more natural-sounding (yet still malicious) responses. These findings underscore critical vulnerabilities in current multimodal alignment and call for more robust adversarial defenses. We will release code and datasets under the Apache-2.0 license. Warning: some content generated by Multimodal LLMs in this paper may be offensive to some readers.

[AI-47] Deep Semantic Graph Learning via LLM based Node Enhancement

链接: https://arxiv.org/abs/2502.07982
作者: Chuanqi Shi,Yiyi Tao,Hang Zhang,Lun Wang,Shaoshuai Du,Yixian Shen,Yanxin Shen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph learning has attracted significant attention due to its widespread real-world applications. Current mainstream approaches rely on text node features and obtain initial node embeddings through shallow embedding learning using GNNs, which shows limitations in capturing deep textual semantics. Recent advances in Large Language Models (LLMs) have demonstrated superior capabilities in understanding text semantics, transforming traditional text feature processing. This paper proposes a novel framework that combines Graph Transformer architecture with LLM-enhanced node features. Specifically, we leverage LLMs to generate rich semantic representations of text nodes, which are then processed by a multi-head self-attention mechanism in the Graph Transformer to capture both local and global graph structural information. Our model utilizes the Transformer’s attention mechanism to dynamically aggregate neighborhood information while preserving the semantic richness provided by LLM embeddings. Experimental results demonstrate that the LLM-enhanced node features significantly improve the performance of graph learning models on node classification tasks. This approach shows promising results across multiple graph learning tasks, offering a practical direction for combining graph networks with language models.

[AI-48] CIRCUIT: A Benchmark for Circuit Interpretation and Reasoning Capabilities of LLM s

链接: https://arxiv.org/abs/2502.07980
作者: Lejla Skelic,Yan Xu,Matthew Cox,Wenjie Lu,Tao Yu,Ruonan Han
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The role of Large Language Models (LLMs) has not been extensively explored in analog circuit design, which could benefit from a reasoning-based approach that transcends traditional optimization techniques. In particular, despite their growing relevance, there are no benchmarks to assess LLMs’ reasoning capability about circuits. Therefore, we created the CIRCUIT dataset consisting of 510 question-answer pairs spanning various levels of analog-circuit-related subjects. The best-performing model on our dataset, GPT-4o, achieves 48.04% accuracy when evaluated on the final numerical answer. To evaluate the robustness of LLMs on our dataset, we introduced a unique feature that enables unit-test-like evaluation by grouping questions into unit tests. In this case, GPT-4o can only pass 27.45% of the unit tests, highlighting that the most advanced LLMs still struggle with understanding circuits, which requires multi-level reasoning, particularly when involving circuit topologies. This circuit-specific benchmark highlights LLMs’ limitations, offering valuable insights for advancing their application in analog integrated circuit design.

[AI-49] From Hazard Identification to Controller Design: Proactive and LLM -Supported Safety Engineering for ML-Powered Systems

链接: https://arxiv.org/abs/2502.07974
作者: Yining Hong,Christopher S. Timperley,Christian Kästner
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication at the International Conference on AI Engineering (CAIN) 2025

点击查看摘要

Abstract:Machine learning (ML) components are increasingly integrated into software products, yet their complexity and inherent uncertainty often lead to unintended and hazardous consequences, both for individuals and society at large. Despite these risks, practitioners seldom adopt proactive approaches to anticipate and mitigate hazards before they occur. Traditional safety engineering approaches, such as Failure Mode and Effects Analysis (FMEA) and System Theoretic Process Analysis (STPA), offer systematic frameworks for early risk identification but are rarely adopted. This position paper advocates for integrating hazard analysis into the development of any ML-powered software product and calls for greater support to make this process accessible to developers. By using large language models (LLMs) to partially automate a modified STPA process with human oversight at critical steps, we expect to address two key challenges: the heavy dependency on highly experienced safety engineering experts, and the time-consuming, labor-intensive nature of traditional hazard analysis, which often impedes its integration into real-world development workflows. We illustrate our approach with a running example, demonstrating that many seemingly unanticipated issues can, in fact, be anticipated.

[AI-50] ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval

链接: https://arxiv.org/abs/2502.07971
作者: Shubham Gupta,Zichao Li,Tianyi Chen,Cem Subakan,Siva Reddy,Perouz Taslakian,Valentina Zantedeschi
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale corpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.

[AI-51] Generative Risk Minimization for Out-of-Distribution Generalization on Graphs

链接: https://arxiv.org/abs/2502.07968
作者: Song Wang,Zhen Tan,Yaochen Zhu,Chuxu Zhang,Jundong Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: TMLR 02/2025

点击查看摘要

Abstract:Out-of-distribution (OOD) generalization on graphs aims at dealing with scenarios where the test graph distribution differs from the training graph distributions. Compared to i.i.d. data like images, the OOD generalization problem on graph-structured data remains challenging due to the non-i.i.d. property and complex structural information on graphs. Recently, several works on graph OOD generalization have explored extracting invariant subgraphs that share crucial classification information across different distributions. Nevertheless, such a strategy could be suboptimal for entirely capturing the invariant information, as the extraction of discrete structures could potentially lead to the loss of invariant information or the involvement of spurious information. In this paper, we propose an innovative framework, named Generative Risk Minimization (GRM), designed to generate an invariant subgraph for each input graph to be classified, instead of extraction. To address the challenge of optimization in the absence of optimal invariant subgraphs (i.e., ground truths), we derive a tractable form of the proposed GRM objective by introducing a latent causal variable, and its effectiveness is validated by our theoretical analysis. We further conduct extensive experiments across a variety of real-world graph datasets for both node-level and graph-level OOD generalization, and the results demonstrate the superiority of our framework GRM.

[AI-52] Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders NAACL

链接: https://arxiv.org/abs/2502.07957
作者: Kshitish Ghate,Isaac Slaughter,Kyra Wilson,Mona Diab,Aylin Caliskan
类目: Artificial Intelligence (cs.AI)
*备注: Accepted to NAACL Main, 2025

点击查看摘要

Abstract:While recent work has found that vision-language models trained under the Contrastive Language Image Pre-training (CLIP) framework contain intrinsic social biases, the extent to which different upstream pre-training features of the framework relate to these biases, and hence how intrinsic bias and downstream performance are connected has been unclear. In this work, we present the largest comprehensive analysis to-date of how the upstream pre-training factors and downstream performance of CLIP models relate to their intrinsic biases. Studying 131 unique CLIP models, trained on 26 datasets, using 55 architectures, and in a variety of sizes, we evaluate bias in each model using 26 well-established unimodal and cross-modal principled Embedding Association Tests. We find that the choice of pre-training dataset is the most significant upstream predictor of bias, whereas architectural variations have minimal impact. Additionally, datasets curated using sophisticated filtering techniques aimed at enhancing downstream model performance tend to be associated with higher levels of intrinsic bias. Finally, we observe that intrinsic bias is often significantly correlated with downstream performance ( 0.3 \leq r \leq 0.8 ), suggesting that models optimized for performance inadvertently learn to amplify representational biases. Comparisons between unimodal and cross-modal association tests reveal that social group bias depends heavily on the modality. Our findings imply that more sophisticated strategies are needed to address intrinsic model bias for vision-language models across the entire model development pipeline.

[AI-53] VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning

链接: https://arxiv.org/abs/2502.07949
作者: Qingyuan Wu,Jianheng Liu,Jianye Hao,Jun Wang,Kun Shao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:State-of-the-art (SOTA) reinforcement learning (RL) methods enable the vision-language agents to learn from interactions with the environment without human supervision. However, they struggle with learning inefficiencies in tackling real-world complex sequential decision-making tasks, especially with sparse reward signals and long-horizon dependencies. To effectively address the issue, we introduce Variational Subgoal-Conditioned RL (VSC-RL), which reformulates the vision-language sequential decision-making task as a variational goal-conditioned RL problem, allowing us to leverage advanced optimization methods to enhance learning efficiency. Specifically, VSC-RL optimizes the SubGoal Evidence Lower BOund (SGC-ELBO), which consists of (a) maximizing the subgoal-conditioned return via RL and (b) minimizing the subgoal-conditioned difference with the reference policy. We theoretically demonstrate that SGC-ELBO is equivalent to the original optimization objective, ensuring improved learning efficiency without sacrificing performance guarantees. Additionally, for real-world complex decision-making tasks, VSC-RL leverages the vision-language model to autonomously decompose the goal into feasible subgoals, enabling efficient learning. Across various benchmarks, including challenging real-world mobile device control tasks, VSC-RL significantly outperforms the SOTA vision-language agents, achieving superior performance and remarkable improvement in learning efficiency.

[AI-54] SHACL-SKOS Based Knowledge Representation of Material Safety Data Sheet (SDS) for the Pharmaceutical Industry

链接: https://arxiv.org/abs/2502.07944
作者: Brian Lu,Dennis Pham,Ti-Chiun Chang,Michael Lovette,Terri Bui,Stephen Ma
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, 10 figures, IEEE ICSC

点击查看摘要

Abstract:We report the development of a knowledge representation and reasoning (KRR) system built on hybrid SHACL-SKOS ontologies for globally harmonized system (GHS) material Safety Data Sheets (SDS) to enhance chemical safety communication and regulatory compliance. SDS are comprehensive documents containing safety and handling information for chemical substances. Thus, they are an essential part of workplace safety and risk management. However, the vast number of Safety Data Sheets from multiple organizations, manufacturers, and suppliers that produce and distribute chemicals makes it challenging to centralize and access SDS documents through a single repository. To accomplish the underlying issues of data exchange related to chemical shipping and handling, we construct SDS related controlled vocabulary and conditions validated by SHACL, and knowledge systems of similar domains linked via SKOS. The resulting hybrid ontologies aim to provide standardized yet adaptable representations of SDS information, facilitating better data sharing, retrieval, and integration across various platforms. This paper outlines our SHACL-SKOS system architectural design and showcases our implementation for an industrial application streamlining the generation of a composite shipping cover sheet.

[AI-55] CREDAL: Close Reading of Data Models

链接: https://arxiv.org/abs/2502.07943
作者: George Fletcher,Olha Nahurna,Matvii Prytula,Julia Stoyanovich
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Data models are necessary for the birth of data and of any data-driven system. Indeed, every algorithm, every machine learning model, every statistical model, and every database has an underlying data model without which the system would not be usable. Hence, data models are excellent sites for interrogating the (material, social, political, …) conditions giving rise to a data system. Towards this, drawing inspiration from literary criticism, we propose to closely read data models in the same spirit as we closely read literary artifacts. Close readings of data models reconnect us with, among other things, the materiality, the genealogies, the techne, the closed nature, and the design of technical systems. While recognizing from literary theory that there is no one correct way to read, it is nonetheless critical to have systematic guidance for those unfamiliar with close readings. This is especially true for those trained in the computing and data sciences, who too often are enculturated to set aside the socio-political aspects of data work. A systematic methodology for reading data models currently does not exist. To fill this gap, we present the CREDAL methodology for close readings of data models. We detail our iterative development process and present results of a qualitative evaluation of CREDAL demonstrating its usability, usefulness, and effectiveness in the critical study of data. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2502.07943 [cs.DB] (or arXiv:2502.07943v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2502.07943 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-56] Educating a Responsible AI Workforce: Piloting a Curricular Module on AI Policy in a Graduate Machine Learning Course

链接: https://arxiv.org/abs/2502.07931
作者: James Weichert,Hoda Eldardiry
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Accepted at 2025 ASEE Annual Conference Exposition

点击查看摘要

Abstract:As artificial intelligence (AI) technologies begin to permeate diverse fields-from healthcare to education-consumers, researchers and policymakers are increasingly raising concerns about whether and how AI is regulated. It is therefore reasonable to anticipate that alignment with principles of ‘ethical’ or ‘responsible’ AI, as well as compliance with law and policy, will form an increasingly important part of AI development. Yet, for the most part, the conventional computer science curriculum is ill-equipped to prepare students for these challenges. To this end, we seek to explore how new educational content related to AI ethics and AI policy can be integrated into both ethics- and technical-focused courses. This paper describes a two-lecture ‘AI policy module’ that was piloted in a graduate-level introductory machine learning course in 2024. The module, which includes an in-class active learning game, is evaluated using data from student surveys before and after the lectures, and pedagogical motivations and considerations are discussed. We find that the module is successful in engaging otherwise technically-oriented students on the topic of AI policy, increasing student awareness of the social impacts of a variety of AI technologies and developing student interest in the field of AI regulation.

[AI-57] ransMLA: Multi-head Latent Attention Is All You Need

链接: https://arxiv.org/abs/2502.07864
作者: Fanxu Meng,Zengwei Yao,Muhan Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce TransMLA, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.

[AI-58] BalanceKV: KV Cache Compression through Discrepancy Theory

链接: https://arxiv.org/abs/2502.07861
作者: Insu Han,Michael Kapralov,Ekaterina Kochetkova,Kshiteej Sheth,Amir Zandieh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive success, but their high memory requirements present challenges for long-context token generation. The memory complexity of long-context LLMs is primarily due to the need to store Key-Value (KV) embeddings in their KV cache. We present BalanceKV, a KV cache compression method based on geometric sampling process stemming from Banaszczyk’s vector balancing theory, which introduces dependencies informed by the geometry of keys and value tokens, and improves precision. BalanceKV offers both theoretically proven and empirically validated performance improvements over existing methods.

[AI-59] Mathematical reasoning and the computer DATE

链接: https://arxiv.org/abs/2502.07850
作者: Kevin Buzzard
类目: Artificial Intelligence (cs.AI)
*备注: This article was written in 2023 and is thus now rather out of date. Apologies for taking so long to upload to ArXiv

点击查看摘要

Abstract:Computers have already changed the way that humans do mathematics: they enable us to compute efficiently. But will they soon be helping us to reason? And will they one day start reasoning themselves? We give an overview of recent developments in neural networks, computer theorem provers and large language models.

[AI-60] Understanding Classifier-Free Guidance: High-Dimensional Theory and Non-Linear Generalizations

链接: https://arxiv.org/abs/2502.07849
作者: Krunoslav Lehman Pavasovic,Jakob Verbeek,Giulio Biroli,Marc Mezard
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent studies have raised concerns about the effectiveness of Classifier-Free Guidance (CFG), indicating that in low-dimensional settings, it can lead to overshooting the target distribution and reducing sample diversity. In this work, we demonstrate that in infinite and sufficiently high-dimensional contexts CFG effectively reproduces the target distribution, revealing a blessing-of-dimensionality result. Additionally, we explore finite-dimensional effects, precisely characterizing overshoot and variance reduction. Based on our analysis, we introduce non-linear generalizations of CFG. Through numerical simulations on Gaussian mixtures and experiments on class-conditional and text-to-image diffusion models, we validate our analysis and show that our non-linear CFG offers improved flexibility and generation quality without additional computation cost.

[AI-61] Column-wise Quantization of Weights and Partial Sums for Accurate and Efficient Compute-In-Memory Accelerators

链接: https://arxiv.org/abs/2502.07842
作者: Jiyoon Kim,Kang Eun Jeon,Yulhwa Kim,Jong Hwan Ko
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Compute-in-memory (CIM) is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead from analog-to-digital converters (ADCs), especially as ADC precision increases. Low-precision ADCs can re- duce this overhead but introduce partial-sum quantization errors degrading accuracy. Additionally, low-bit weight constraints, im- posed by cell limitations and the need for multiple cells for higher- bit weights, present further challenges. While fine-grained partial- sum quantization has been studied to lower ADC resolution effectively, weight granularity, which limits overall partial-sum quantized accuracy, remains underexplored. This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level. Our method improves accuracy while maintaining dequantization overhead, simplifies training by removing two-stage processes, and ensures robustness to memory cell variations via independent column-wise scale factors. We also propose an open-source CIM-oriented convolution framework to handle fine-grained weights and partial-sums effi- ciently, incorporating a novel tiling method and group convolution. Experimental results on ResNet-20 (CIFAR-10, CIFAR-100) and ResNet-18 (ImageNet) show accuracy improvements of 0.99%, 2.69%, and 1.01%, respectively, compared to the best-performing related works. Additionally, variation analysis reveals the robust- ness of our method against memory cell variations. These findings highlight the effectiveness of our quantization scheme in enhancing accuracy and robustness while maintaining hardware efficiency in CIM-based DNN implementations. Our code is available at this https URL.

[AI-62] Bridging LLM -Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights

链接: https://arxiv.org/abs/2502.07835
作者: Ahilan Ayyachamy Nadar Ponnusamy
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) in software engineering, particularly in code generation, has garnered significant attention. However, assessing the quality of AI-generated code remains a challenge due to the inherent complexity of programming tasks and the lack of robust evaluation metrics that align well with human judgment. Traditional token-based metrics such as BLEU and ROUGE, while commonly used in natural language processing, exhibit weak correlations with human assessments in code intelligence and verification tasks. Furthermore, these metrics are primarily research focused and are not designed for seamless integration into the software development lifecycle, limiting their practical utility for developers seeking to improve code quality and security. AI-assisted coding has been shown to be more beneficial for senior developers, as they possess the expertise to critically evaluate the generated code for correctness, completeness, and compliance. In contrast, junior developers may struggle to identify hallucinations, missing functionality, or incorrect logic in AI-generated code. To bridge this gap, This paper introduces a novel scoring mechanism called the SBC score, which is based on a reverse generation technique that leverages the natural language generation capabilities of LLMs. Unlike direct code analysis, our approach reconstructs system requirements from AI-generated code and compares them with the original specifications to quantify accuracy. The SBC score combines semantic similarity, BLEU, and completeness analysis, providing actionable insights to developers by highlighting missing features and hallucinations. Our code and datasets are available on GitHub Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.07835 [cs.SE] (or arXiv:2502.07835v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2502.07835 Focus to learn more arXiv-issued DOI via DataCite

[AI-63] MEMHD: Memory-Efficient Multi-Centroid Hyperdimensional Computing for Fully-Utilized In-Memory Computing Architectures DATE2025

链接: https://arxiv.org/abs/2502.07834
作者: Do Yeong Kang,Yeong Hwan Oh,Chanwook Hwang,Jinhee Kim,Kang Eun Jeon,Jong Hwan Ko
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to appear at DATE 2025

点击查看摘要

Abstract:The implementation of Hyperdimensional Computing (HDC) on In-Memory Computing (IMC) architectures faces significant challenges due to the mismatch between highdimensional vectors and IMC array sizes, leading to inefficient memory utilization and increased computation cycles. This paper presents MEMHD, a Memory-Efficient Multi-centroid HDC framework designed to address these challenges. MEMHD introduces a clustering-based initialization method and quantization aware iterative learning for multi-centroid associative memory. Through these approaches and its overall architecture, MEMHD achieves a significant reduction in memory requirements while maintaining or improving classification accuracy. Our approach achieves full utilization of IMC arrays and enables one-shot (or few-shot) associative search. Experimental results demonstrate that MEMHD outperforms state-of-the-art binary HDC models, achieving up to 13.69% higher accuracy with the same memory usage, or 13.25x more memory efficiency at the same accuracy level. Moreover, MEMHD reduces computation cycles by up to 80x and array usage by up to 71x compared to baseline IMC mapping methods when mapped to 128x128 IMC arrays, while significantly improving energy and computation cycle efficiency.

[AI-64] SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters

链接: https://arxiv.org/abs/2502.07832
作者: Yiping Wang,Hanxian Huang,Yifang Chen,Jishen Zhao,Simon Shaolei Du,Yuandong Tian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages

点击查看摘要

Abstract:While Large language models (LLMs) have advanced natural language processing tasks, their growing computational and memory demands make deployment on resource-constrained devices like mobile phones increasingly challenging. In this paper, we propose SHARP (SHaring Adjacent Layers with Recovery Parameters), a novel approach to accelerate LLM inference by sharing parameters across adjacent layers, thus reducing memory load overhead, while introducing low-rank recovery parameters to maintain performance. Inspired by observations that consecutive layers have similar outputs, SHARP employs a two-stage recovery process: Single Layer Warmup (SLW), and Supervised Fine-Tuning (SFT). The SLW stage aligns the outputs of the shared layers using L_2 loss, providing a good initialization for the following SFT stage to further restore the model performance. Extensive experiments demonstrate that SHARP can recover the model’s perplexity on various in-distribution tasks using no more than 50k fine-tuning data while reducing the number of stored MLP parameters by 38% to 65%. We also conduct several ablation studies of SHARP and show that replacing layers towards the later parts of the model yields better performance retention, and that different recovery parameterizations perform similarly when parameter counts are matched. Furthermore, SHARP saves 42.8% in model storage and reduces the total inference time by 42.2% compared to the original Llama2-7b model on mobile devices. Our results highlight SHARP as an efficient solution for reducing inference costs in deploying LLMs without the need for pretraining-scale resources.

[AI-65] Implicit Language Models are RNNs: Balancing Parallelization and Expressivity

链接: https://arxiv.org/abs/2502.07827
作者: Mark Schöne,Babak Rahmani,Heiner Kremer,Fabian Falck,Hitesh Ballani,Jannes Gladrow
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:State-space models (SSMs) and transformers dominate the language modeling landscape. However, they are constrained to a lower computational complexity than classical recurrent neural networks (RNNs), limiting their expressivity. In contrast, RNNs lack parallelization during training, raising fundamental questions about the trade off between parallelization and expressivity. We propose implicit SSMs, which iterate a transformation until convergence to a fixed point. Theoretically, we show that implicit SSMs implement the non-linear state-transitions of RNNs. Empirically, we find that only approximate fixed-point convergence suffices, enabling the design of a scalable training curriculum that largely retains parallelization, with full convergence required only for a small subset of tokens. Our approach demonstrates superior state-tracking capabilities on regular languages, surpassing transformers and SSMs. We further scale implicit SSMs to natural language reasoning tasks and pretraining of large-scale language models up to 1.3B parameters on 207B tokens - representing, to our knowledge, the largest implicit model trained to date. Notably, our implicit models outperform their explicit counterparts on standard benchmarks.

[AI-66] Runtime Tunable Tsetlin Machines for Edge Inference on eFPGAs

链接: https://arxiv.org/abs/2502.07823
作者: Tousif Rahman,Gang Mao,Bob Pattison,Sidharth Maheshwari,Marcos Sartori,Adrian Wheeldon,Rishad Shafik,Alex Yakovlev
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted as a full paper by the 2025 EDGE AI FOUNDATION Austin

点击查看摘要

Abstract:Embedded Field-Programmable Gate Arrays (eFPGAs) allow for the design of hardware accelerators of edge Machine Learning (ML) applications at a lower power budget compared with traditional FPGA platforms. However, the limited eFPGA logic and memory significantly constrain compute capabilities and model size. As such, ML application deployment on eFPGAs is in direct contrast with the most recent FPGA approaches developing architecture-specific implementations and maximizing throughput over resource frugality. This paper focuses on the opposite side of this trade-off: the proposed eFPGA accelerator focuses on minimizing resource usage and allowing flexibility for on-field recalibration over throughput. This allows for runtime changes in model size, architecture, and input data dimensionality without offline resynthesis. This is made possible through the use of a bitwise compressed inference architecture of the Tsetlin Machine ™ algorithm. TM compute does not require any multiplication operations, being limited to only bitwise AND, OR, NOT, summations and additions. Additionally, TM model compression allows the entire model to fit within the on-chip block RAM of the eFPGA. The paper uses this accelerator to propose a strategy for runtime model tuning in the field. The proposed approach uses 2.5x fewer Look-up-Tables (LUTs) and 3.38x fewer registers than the current most resource-fugal design and achieves up to 129x energy reduction compared with low-power microcontrollers running the same ML application.

[AI-67] Low-Rank Compression for IMC Arrays DATE’25

链接: https://arxiv.org/abs/2502.07820
作者: Kang Eun Jeon,Johnny Rhe,Jong Hwan Ko
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: Accepted to appear at DATE’25 (Lyon, France)

点击查看摘要

Abstract:In this study, we address the challenge of low-rank model compression in the context of in-memory computing (IMC) architectures. Traditional pruning approaches, while effective in model size reduction, necessitate additional peripheral circuitry to manage complex dataflows and mitigate dislocation issues, leading to increased area and energy overheads. To circumvent these drawbacks, we propose leveraging low-rank compression techniques, which, unlike pruning, streamline the dataflow and seamlessly integrate with IMC architectures. However, low-rank compression presents its own set of challenges, namely i) suboptimal IMC array utilization and ii) compromised accuracy. To address these issues, we introduce a novel approach i) employing shift and duplicate kernel (SDK) mapping technique, which exploits idle IMC columns for parallel processing, and ii) group low-rank convolution, which mitigates the information imbalance in the decomposed matrices. Our experimental results demonstrate that our proposed method achieves up to 2.5x speedup or +20.9% accuracy boost over existing pruning techniques.

[AI-68] Enhancing kidney transplantation through multi-agent kidney exchange programs: A comprehensive review and optimization models

链接: https://arxiv.org/abs/2502.07819
作者: Shayan Sharifi
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive review of the last two decades of research on Kidney Exchange Programs (KEPs), systematically categorizing and classifying key contributions to provide readers with a structured understanding of advancements in the field. The review highlights the evolution of KEP methodologies and lays the foundation for our contribution. We propose three mathematical models aimed at improving both the quantity and quality of kidney transplants. Model 1 maximizes the number of transplants by focusing on compatibility based on blood type and PRA, without additional constraints. Model 2 introduces a minimum Human Leukocyte Antigen (HLA) compatibility threshold to enhance transplant quality, though this leads to fewer matches. Model 3 extends the problem to a Multi-Agent Kidney Exchange Program (MKEP), pooling incompatible donor-recipient pairs across multiple agents, resulting in a higher number of successful transplants while ensuring fairness across agents. Sensitivity analyses demonstrate trade-offs between transplant quantity and quality, with Model 3 striking the optimal balance by leveraging multi-agent collaboration to improve both the number and quality of transplants. These findings underscore the potential benefits of more integrated kidney exchange systems.

[AI-69] mporal Model On Quantum Logic

链接: https://arxiv.org/abs/2502.07817
作者: Francesco D’Agostino
类目: Artificial Intelligence (cs.AI); Logic (math.LO); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:This paper introduces a unified theoretical framework for modeling temporal memory dynamics, combining concepts from temporal logic, memory decay models, and hierarchical contexts. The framework formalizes the evolution of propositions over time using linear and branching temporal models, incorporating exponential decay (Ebbinghaus forgetting curve) and reactivation mechanisms via Bayesian updating. The hierarchical organization of memory is represented using directed acyclic graphs to model recall dependencies and interference. Novel insights include feedback dynamics, recursive influences in memory chains, and the integration of entropy-based recall efficiency. This approach provides a foundation for understanding memory processes across cognitive and computational domains.

[AI-70] Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)

链接: https://arxiv.org/abs/2502.07815
作者: Lokesh Koli,Shubham Kalra,Karanpreet Singh
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting sensitive data such as Personally Identifiable Information (PII) and Protected Health Information (PHI) is critical for data security platforms. This study evaluates regex-based pattern matching algorithms and exact-match search techniques to optimize detection speed, accuracy, and scalability. Our benchmarking results indicate that Google RE2 provides the best balance of speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among regex engines, outperforming PCRE while maintaining broader hardware compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated superior performance (8 ms/MB) and scalability for large datasets. Performance analysis revealed that regex processing time scales linearly with dataset size and pattern complexity. A hybrid AI + Regex approach achieved the highest F1 score (91. 6%) by improving recall and minimizing false positives. Device benchmarking confirmed that our solution maintains efficient CPU and memory usage on both high-performance and mid-range systems. Despite its effectiveness, challenges remain, such as limited multilingual support and the need for regular pattern updates. Future work should focus on expanding language coverage, integrating data security and privacy management (DSPM) with data loss prevention (DLP) tools, and enhancing regulatory compliance for broader global adoption.

[AI-71] Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution

链接: https://arxiv.org/abs/2502.07814
作者: Siwei Tu,Ben Fei,Weidong Yang,Fenghua Ling,Hao Chen,Zili Liu,Kun Chen,Hang Fan,Wanli Ouyang,Lei Bai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Accurate acquisition of surface meteorological conditions at arbitrary locations holds significant importance for weather forecasting and climate simulation. Due to the fact that meteorological states derived from satellite observations are often provided in the form of low-resolution grid fields, the direct application of spatial interpolation to obtain meteorological states for specific locations often results in significant discrepancies when compared to actual observations. Existing downscaling methods for acquiring meteorological state information at higher resolutions commonly overlook the correlation with satellite observations. To bridge the gap, we propose Satellite-observations Guided Diffusion Model (SGD), a conditional diffusion model pre-trained on ERA5 reanalysis data with satellite observations (GridSat) as conditions, which is employed for sampling downscaled meteorological states through a zero-shot guided sampling strategy and patch-based methods. During the training process, we propose to fuse the information from GridSat satellite observations into ERA5 maps via the attention mechanism, enabling SGD to generate atmospheric states that align more accurately with actual conditions. In the sampling, we employed optimizable convolutional kernels to simulate the upscale process, thereby generating high-resolution ERA5 maps using low-resolution ERA5 maps as well as observations from weather stations as guidance. Moreover, our devised patch-based method promotes SGD to generate meteorological states at arbitrary resolutions. Experiments demonstrate SGD fulfills accurate meteorological states downscaling to 6.25km.

[AI-72] CryptoX : Compositional Reasoning Evaluation of Large Language Models

链接: https://arxiv.org/abs/2502.07813
作者: Jiajun Shi,Chaoren Wei,Liqun Yang,Zekun Moore Wang,Chenghao Yang,Ge Zhang,Stephen Huang,Tao Peng,Jian Yang,Zhoufutu Wen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The compositional reasoning capacity has long been regarded as critical to the generalization and intelligence emergence of large language models LLMs. However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanical interpretability experiments to reveal the inner mechanism of LLMs’ compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning capabilities of LLMs.

[AI-73] Reasoning -as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment

链接: https://arxiv.org/abs/2502.07803
作者: Cheryl Li,Tianyuan Xu,Yiwen Guo
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) by generating natural language (NL) rationales that lead to the final answer. However, it struggles with numerical computation, which has somehow led to the development of program-aided techniques. Despite their potential, a persistent challenge remains: inconsistencies between LLM-reported reasoning steps and the logic in generated programs, which we term ``reasoning hallucinations." This stems from the inherent ambiguities of NL and the statistical nature of LLMs, which often lack rigorous logical coherence. To address this challenge, we propose a novel test-time scaling framework, Reasoning-as-Logic-Units (RaLU), which constructs a more reliable reasoning path by aligning logical units between the generated program and their corresponding NL descriptions. By decomposing the initially generated program into discrete units using static analysis, RaLU engages in an iterative dialogue with the LLM to judge, refine, and explain each unit. A rewind-and-correct mechanism ensures alignment between code statements and task requirements in each unit, ultimately forming a cohesive reasoning path under the program’s logic, from which the model reaches a final solution. Our experiments demonstrate that RaLU significantly outperforms existing baselines in mathematical reasoning (GSM8K, MATH) and algorithmic reasoning (HumanEval+, MBPP+), underscoring its potential to advance LLM reasoning and programming by offering enhanced accuracy and interpretability.

[AI-74] Regulatory Science Innovation for Generative AI and Large Language Models in Health and Medicine: A Global Call for Action

链接: https://arxiv.org/abs/2502.07794
作者: Jasmine Chiat Ling Ong,Yilin Ning,Mingxuan Liu,Yian Ma,Zhao Liang,Kuldev Singh,Robert T Chang,Silke Vogel,John CW Lim,Iris Siu Kwan Tan,Oscar Freyer,Stephen Gilbert,Danielle S Bitterman,Xiaoxuan Liu,Alastair K Denniston,Nan Liu
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of generative AI (GenAI) and large language models (LLMs) in healthcare presents both unprecedented opportunities and challenges, necessitating innovative regulatory approaches. GenAI and LLMs offer broad applications, from automating clinical workflows to personalizing diagnostics. However, the non-deterministic outputs, broad functionalities and complex integration of GenAI and LLMs challenge existing medical device regulatory frameworks, including the total product life cycle (TPLC) approach. Here we discuss the constraints of the TPLC approach to GenAI and LLM-based medical device regulation, and advocate for global collaboration in regulatory science research. This serves as the foundation for developing innovative approaches including adaptive policies and regulatory sandboxes, to test and refine governance in real-world settings. International harmonization, as seen with the International Medical Device Regulators Forum, is essential to manage implications of LLM on global health, including risks of widening health inequities driven by inherent model biases. By engaging multidisciplinary expertise, prioritizing iterative, data-driven approaches, and focusing on the needs of diverse populations, global regulatory science research enables the responsible and equitable advancement of LLM innovations in healthcare.

[AI-75] Can Generative AI be Egalitarian?

链接: https://arxiv.org/abs/2502.07790
作者: Philip Feldman,James R. Foulds,Shimei Pan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:The recent explosion of “foundation” generative AI models has been built upon the extensive extraction of value from online sources, often without corresponding reciprocation. This pattern mirrors and intensifies the extractive practices of surveillance capitalism, while the potential for enormous profit has challenged technology organizations’ commitments to responsible AI practices, raising significant ethical and societal concerns. However, a promising alternative is emerging: the development of models that rely on content willingly and collaboratively provided by users. This article explores this “egalitarian” approach to generative AI, taking inspiration from the successful model of Wikipedia. We explore the potential implications of this approach for the design, development, and constraints of future foundation models. We argue that such an approach is not only ethically sound but may also lead to models that are more responsive to user needs, more diverse in their training data, and ultimately more aligned with societal values. Furthermore, we explore potential challenges and limitations of this approach, including issues of scalability, quality control, and potential biases inherent in volunteer-contributed content.

[AI-76] Do AI assistants help students write formal specifications? A study with ChatGPT and the B-Method

链接: https://arxiv.org/abs/2502.07789
作者: Alfredo Capozucca,Daniil Yampolskyi,Alexander Goldberg,Maximiliano Cristiá
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates the role of AI assistants, specifically OpenAI’s ChatGPT, in teaching formal methods (FM) to undergraduate students, using the B-method as a formal specification technique. While existing studies demonstrate the effectiveness of AI in coding tasks, no study reports on its impact on formal specifications. We examine whether ChatGPT provides an advantage when writing B-specifications and analyse student trust in its outputs. Our findings indicate that the AI does not help students to enhance the correctness of their specifications, with low trust correlating to better outcomes. Additionally, we identify a behavioural pattern with which to interact with ChatGPT which may influence the correctness of B-specifications.

[AI-77] Counterexample Guided Program Repair Using Zero-Shot Learning and MaxSAT-based Fault Localization AAAI2025

链接: https://arxiv.org/abs/2502.07786
作者: Pedro Orvalho,Mikoláš Janota,Vasco Manquinho
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted at AAAI 2025. 11 pages, 4 listings, 2 figures and 5 tables

点击查看摘要

Abstract:Automated Program Repair (APR) for introductory programming assignments (IPAs) is motivated by the large number of student enrollments in programming courses each year. Since providing feedback on IPAs requires substantial time and effort from faculty, personalized feedback often involves suggesting fixes to students’ programs. Formal Methods (FM)-based semantic repair approaches, check a program’s execution against a test suite or reference solution, are effective but limited. These tools excel at identifying buggy parts but can only fix programs if the correct implementation and the faulty one share the same control flow graph. Conversely, Large Language Models (LLMs) are used for APR but often make extensive instead of minimal rewrites. This leads to more invasive fixes, making it harder for students to learn from their mistakes. In summary, LLMs excel at completing strings, while FM-based fault localization excel at identifying buggy parts of a program. In this paper, we propose a novel approach that combines the strengths of both FM-based fault localization and LLMs, via zero-shot learning, to enhance APR for IPAs. Our method uses MaxSAT-based fault localization to identify buggy parts of a program, then presents the LLM with a program sketch devoid of these buggy statements. This hybrid approach follows a CEGIS loop to iteratively refine the program. We ask the LLM to synthesize the missing parts, which are then checked against a test suite. If the suggested program is incorrect, a counterexample from the test suite is fed back to the LLM. Our experiments show that our counterexample guided approach, using MaxSAT-based bug-free program sketches, significantly improves the repair capabilities of all six evaluated LLMs. This method allows LLMs to repair more programs with smaller fixes, outperforming other configurations and state-of-the-art symbolic program repair tools.

[AI-78] NDAI Agreements

链接: https://arxiv.org/abs/2502.07924
作者: Matthew Stephenson,Andrew Miller,Xyn Sun,Bhargav Annem,Rohan Parikh
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI)
*备注: 21 pages, 1 figure

点击查看摘要

Abstract:We study a fundamental challenge in the economics of innovation: an inventor must reveal details of a new idea to secure compensation or funding, yet such disclosure risks expropriation. We present a model in which a seller (inventor) and buyer (investor) bargain over an information good under the threat of hold-up. In the classical setting, the seller withholds disclosure to avoid misappropriation, leading to inefficiency. We show that trusted execution environments (TEEs) combined with AI agents can mitigate and even fully eliminate this hold-up problem. By delegating the disclosure and payment decisions to tamper-proof programs, the seller can safely reveal the invention without risking expropriation, achieving full disclosure and an efficient ex post transfer. Moreover, even if the invention’s value exceeds a threshold that TEEs can fully secure, partial disclosure still improves outcomes compared to no disclosure. Recognizing that real AI agents are imperfect, we model “agent errors” in payments or disclosures and demonstrate that budget caps and acceptance thresholds suffice to preserve most of the efficiency gains. Our results imply that cryptographic or hardware-based solutions can function as an “ironclad NDA,” substantially mitigating the fundamental disclosure-appropriation paradox first identified by Arrow (1962) and Nelson (1959). This has far-reaching policy implications for fostering RD, technology transfer, and collaboration. Comments: 21 pages, 1 figure Subjects: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.07924 [econ.TH] (or arXiv:2502.07924v1 [econ.TH] for this version) https://doi.org/10.48550/arXiv.2502.07924 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-79] SNAP: Sequential Non-Ancestor Pruning for Targeted Causal Effect Estimation With an Unknown Graph AISTATS2025

链接: https://arxiv.org/abs/2502.07857
作者: Mátyás Schubert,Tom Claassen,Sara Magliacane
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at AISTATS 2025

点击查看摘要

Abstract:Causal discovery can be computationally demanding for large numbers of variables. If we only wish to estimate the causal effects on a small subset of target variables, we might not need to learn the causal graph for all variables, but only a small subgraph that includes the targets and their adjustment sets. In this paper, we focus on identifying causal effects between target variables in a computationally and statistically efficient way. This task combines causal discovery and effect estimation, aligning the discovery objective with the effects to be estimated. We show that definite non-ancestors of the targets are unnecessary to learn causal relations between the targets and to identify efficient adjustments sets. We sequentially identify and prune these definite non-ancestors with our Sequential Non-Ancestor Pruning (SNAP) framework, which can be used either as a preprocessing step to standard causal discovery methods, or as a standalone sound and complete causal discovery algorithm. Our results on synthetic and real data show that both approaches substantially reduce the number of independence tests and the computation time without compromising the quality of causal effect estimations.

[AI-80] Some things to know about achieving artificial general intelligence

链接: https://arxiv.org/abs/2502.07828
作者: Herbert Roitblat
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current and foreseeable GenAI models are not capable of achieving artificial general intelligence because they are burdened with anthropogenic debt. They depend heavily on human input to provide well-structured problems, architecture, and training data. They cast every problem as a language pattern learning problem and are thus not capable of the kind of autonomy needed to achieve artificial general intelligence. Current models succeed at their tasks because people solve most of the problems to which these models are directed, leaving only simple computations for the model to perform, such as gradient descent. Another barrier is the need to recognize that there are multiple kinds of problems, some of which cannot be solved by available computational methods (for example, “insight problems”). Current methods for evaluating models (benchmarks and tests) are not adequate to identify the generality of the solutions, because it is impossible to infer the means by which a problem was solved from the fact of its solution. A test could be passed, for example, by a test-specific or a test-general method. It is a logical fallacy (affirming the consequent) to infer a method of solution from the observation of success.

[AI-81] Quantum Powered Credit Risk Assessment: A Novel Approach using hybrid Quantum-Classical Deep Neural Network for Row-Type Dependent Predictive Analysis

链接: https://arxiv.org/abs/2502.07806
作者: Rath Minati,Date Hema
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of Quantum Deep Learning (QDL) techniques into the landscape of financial risk analysis presents a promising avenue for innovation. This study introduces a framework for credit risk assessment in the banking sector, combining quantum deep learning techniques with adaptive modeling for Row-Type Dependent Predictive Analysis (RTDPA). By leveraging RTDPA, the proposed approach tailors predictive models to different loan categories, aiming to enhance the accuracy and efficiency of credit risk evaluation. While this work explores the potential of integrating quantum methods with classical deep learning for risk assessment, it focuses on the feasibility and performance of this hybrid framework rather than claiming transformative industry-wide impacts. The findings offer insights into how quantum techniques can complement traditional financial analysis, paving the way for further advancements in predictive modeling for credit risk.

[AI-82] Machine Learning and Quantum Intelligence for Health Data Scenarios

链接: https://arxiv.org/abs/2410.21339
作者: Sanjeev Naguleswaran
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Presented at Machine Learning and Machine Intelligence (MLMI) Conference, Osaka, Japan 2024

点击查看摘要

Abstract:The advent of quantum computing has opened new possibilities in data science, offering unique capabilities for addressing complex, data-intensive problems. Traditional machine learning algorithms often face challenges in high-dimensional or limited-quality datasets, which are common in healthcare. Quantum Machine Learning leverages quantum properties, such as superposition and entanglement, to enhance pattern recognition and classification, potentially surpassing classical approaches. This paper explores QML’s application in healthcare, focusing on quantum kernel methods and hybrid quantum-classical networks for heart disease prediction and COVID-19 detection, assessing their feasibility and performance.

机器学习

[LG-0] Necessary and Sufficient Oracles: Toward a Computational Taxonomy For Reinforcement Learning

链接: https://arxiv.org/abs/2502.08632
作者: Dhruv Rohatgi,Dylan J. Foster
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注: 84 pages, 2 figures

点击查看摘要

Abstract:Algorithms for reinforcement learning (RL) in large state spaces crucially rely on supervised learning subroutines to estimate objects such as value functions or transition probabilities. Since only the simplest supervised learning problems can be solved provably and efficiently, practical performance of an RL algorithm depends on which of these supervised learning “oracles” it assumes access to (and how they are implemented). But which oracles are better or worse? Is there a minimal oracle? In this work, we clarify the impact of the choice of supervised learning oracle on the computational complexity of RL, as quantified by the oracle strength. First, for the task of reward-free exploration in Block MDPs in the standard episodic access model – a ubiquitous setting for RL with function approximation – we identify two-context regression as a minimal oracle, i.e. an oracle that is both necessary and sufficient (under a mild regularity assumption). Second, we identify one-context regression as a near-minimal oracle in the stronger reset access model, establishing a provable computational benefit of resets in the process. Third, we broaden our focus to Low-Rank MDPs, where we give cryptographic evidence that the analogous oracle from the Block MDP setting is insufficient. Comments: 84 pages, 2 figures Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC) Cite as: arXiv:2502.08632 [cs.LG] (or arXiv:2502.08632v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.08632 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Forecasting Drought Using Machine Learning in California

链接: https://arxiv.org/abs/2502.08622
作者: Nan K. Li,Angela Chang,David Sherman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drought is a frequent and costly natural disaster in California, with major negative impacts on agricultural production and water resource availability, particularly groundwater. This study investigated the performance of applying different machine learning approaches to predicting the U.S. Drought Monitor classification in California. Four approaches were used: a convolutional neural network (CNN), random forest, XGBoost, and long short term memory (LSTM) recurrent neural network, and compared to a baseline persistence model. We evaluated the models’ performance in predicting severe drought (USDM drought category D2 or higher) using a macro F1 binary classification metric. The LSTM model emerged as the top performer, followed by XGBoost, CNN, and random forest. Further evaluation of our results at the county level suggested that the LSTM model would perform best in counties with more consistent drought patterns and where severe drought was more common, and the LSTM model would perform worse where drought scores increased rapidly. Utilizing 30 weeks of historical data, the LSTM model successfully forecasted drought scores for a 12-week period with a Mean Absolute Error (MAE) of 0.33, equivalent to less than half a drought category on a scale of 0 to 5. Additionally, the LSTM achieved a macro F1 score of 0.9, indicating high accuracy in binary classification for severe drought conditions. Evaluation of different window and future horizon sizes in weeks suggested that at least 24 weeks of data would result in the best performance, with best performance for shorter horizon sizes, particularly less than eight weeks.

[LG-2] Continuous Cardiac Arrest Prediction in ICU using PPG Foundation Model

链接: https://arxiv.org/abs/2502.08612
作者: Saurabh Kataria,Ran Xiao,Timothy Ruchti,Matthew Clark,Jiaying Lu,Randall J. Lee,Jocelyn Grunwell,Xiao Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-invasive patient monitoring for tracking and predicting adverse acute health events is an emerging area of research. We pursue in-hospital cardiac arrest (IHCA) prediction using only single-channel finger photoplethysmography (PPG) signals. Our proposed two-stage model Feature Extractor-Aggregator Network (FEAN) leverages powerful representations from pre-trained PPG foundation models (PPG-GPT of size up to 1 Billion) stacked with sequential classification models. We propose two FEAN variants (“1H”, “FH”) which use the latest one-hour and (max) 24-hour history to make decisions respectively. Our study is the first to present IHCA prediction results in ICU patients using only unimodal (continuous PPG signal) waveform deep representations. With our best model, we obtain an average of 0.79 AUROC over 24~h prediction window before CA event onset with our model peaking performance at 0.82 one hour before CA. We also provide a comprehensive analysis of our model through architectural tuning and PaCMAP visualization of patient health trajectory in latent space.

[LG-3] Robustly Learning Monotone Generalized Linear Models via Data Augmentation

链接: https://arxiv.org/abs/2502.08611
作者: Nikos Zarifis,Puqian Wang,Ilias Diakonikolas,Jelena Diakonikolas
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the task of learning Generalized Linear models (GLMs) in the agnostic model under the Gaussian distribution. We give the first polynomial-time algorithm that achieves a constant-factor approximation for \textitany monotone Lipschitz activation. Prior constant-factor GLM learners succeed for a substantially smaller class of activations. Our work resolves a well-known open problem, by developing a robust counterpart to the classical GLMtron algorithm (Kakade et al., 2011). Our robust learner applies more generally, encompassing all monotone activations with bounded (2+\zeta) -moments, for any fixed \zeta0 – a condition that is essentially necessary. To obtain our results, we leverage a novel data augmentation technique with decreasing Gaussian noise injection and prove a number of structural results that may be useful in other settings.

[LG-4] Scalable Thermodynamic Second-order Optimization

链接: https://arxiv.org/abs/2502.08603
作者: Kaelan Donatella,Samuel Duffield,Denis Melanson,Maxwell Aifer,Phoebe Klett,Rajath Salegame,Zach Belateche,Gavin Crooks,Antonio J. Martinez,Patrick J. Coles
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Many hardware proposals have aimed to accelerate inference in AI workloads. Less attention has been paid to hardware acceleration of training, despite the enormous societal impact of rapid training of AI models. Physics-based computers, such as thermodynamic computers, offer an efficient means to solve key primitives in AI training algorithms. Optimizers that normally would be computationally out-of-reach (e.g., due to expensive matrix inversions) on digital hardware could be unlocked with physics-based hardware. In this work, we propose a scalable algorithm for employing thermodynamic computers to accelerate a popular second-order optimizer called Kronecker-factored approximate curvature (K-FAC). Our asymptotic complexity analysis predicts increasing advantage with our algorithm as n , the number of neurons per layer, increases. Numerical experiments show that even under significant quantization noise, the benefits of second-order optimization can be preserved. Finally, we predict substantial speedups for large-scale vision and graph problems based on realistic hardware characteristics.

[LG-5] wo-stage hybrid models for enhancing forecasting accuracy on heterogeneous time series

链接: https://arxiv.org/abs/2502.08600
作者: Junru Ren,Shaomin Wu
类目: Machine Learning (cs.LG)
*备注: 14 pages, 2 figures

点击查看摘要

Abstract:Compared to local models built in a series-by-series manner, global models leverage relevant information across time series, resulting in improved forecasting performance and generalization capacity. Constructing global models on a set of time series is becoming mainstream in the field of time series forecasting. However, the advantages of global models may not always be realized when dealing with heterogeneous data. While they can adapt to heterogeneous datasets by increasing the model complexity, the model cannot be infinitely complex due to the finite sample size, which poses challenges for the application of global models. Additionally, determining whether the time series data is homogeneous or heterogeneous can be ambiguous in practice. To address these research gaps, this paper argues that the heterogeneity of the data should be defined by the global model used, and for each series, the portion not modelled by the global model represents heterogeneity. It further proposes two-stage hybrid models, which include a second stage to identify and model heterogeneous patterns. In this second stage, we can estimate either all local models or sub-global models across different domains divided based on heterogeneity. Experiments on four open datasets reveal that the proposed methods significantly outperform five existing models, indicating they contribute to fully unleash the potential of global models on heterogeneous datasets.

[LG-6] Enhancing Diffusion Models Efficiency by Disentangling Total-Variance and Signal-to-Noise Ratio

链接: https://arxiv.org/abs/2502.08598
作者: Khaled Kahouli,Winfried Ripken,Stefan Gugler,Oliver T. Unke,Klaus-Robert Müller,Shinichi Nakajima
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The long sampling time of diffusion models remains a significant bottleneck, which can be mitigated by reducing the number of diffusion time steps. However, the quality of samples with fewer steps is highly dependent on the noise schedule, i.e., the specific manner in which noise is introduced and the signal is reduced at each step. Although prior work has improved upon the original variance-preserving and variance-exploding schedules, these approaches \textitpassively adjust the total variance, without direct control over it. In this work, we propose a novel total-variance/signal-to-noise-ratio disentangled (TV/SNR) framework, where TV and SNR can be controlled independently. Our approach reveals that different existing schedules, where the TV explodes exponentially, can be \textitimproved by setting a constant TV schedule while preserving the same SNR schedule. Furthermore, generalizing the SNR schedule of the optimal transport flow matching significantly improves the performance in molecular structure generation, achieving few step generation of stable molecules. A similar tendency is observed in image generation, where our approach with a uniform diffusion time grid performs comparably to the highly tailored EDM sampler.

[LG-7] oward Universal Laws of Outlier Propagation

链接: https://arxiv.org/abs/2502.08593
作者: Yuhao Wang,Aram Ebtekar,Dominik Janzing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We argue that Algorithmic Information Theory (AIT) admits a principled way to quantify outliers in terms of so-called randomness deficiency. For the probability distribution generated by a causal Bayesian network, we show that the randomness deficiency of the joint state decomposes into randomness deficiencies of each causal mechanism, subject to the Independence of Mechanisms Principle. Accordingly, anomalous joint observations can be quantitatively attributed to their root causes, i.e., the mechanisms that behaved anomalously. As an extension of Levin’s law of randomness conservation, we show that weak outliers cannot cause strong ones when Independence of Mechanisms holds. We show how these information theoretic laws provide a better understanding of the behaviour of outliers defined with respect to existing scores.

[LG-8] Scalable Bilevel Loss Balancing for Multi-Task Learning

链接: https://arxiv.org/abs/2502.08585
作者: Peiyao Xiao,Chaosheng Dong,Shaofeng Zou,Kaiyi Ji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task learning (MTL) has been widely adopted for its ability to simultaneously learn multiple tasks. While existing gradient manipulation methods often yield more balanced solutions than simple scalarization-based approaches, they typically incur a significant computational overhead of \mathcalO(K) in both time and memory, where K is the number of tasks. In this paper, we propose BiLB4MTL, a simple and scalable loss balancing approach for MTL, formulated from a novel bilevel optimization perspective. Our method incorporates three key components: (i) an initial loss normalization, (ii) a bilevel loss-balancing formulation, and (iii) a scalable first-order algorithm that requires only \mathcalO(1) time and memory. Theoretically, we prove that BiLB4MTL guarantees convergence not only to a stationary point of the bilevel loss balancing problem but also to an \epsilon -accurate Pareto stationary point for all K loss functions under mild conditions. Extensive experiments on diverse multi-task datasets demonstrate that BiLB4MTL achieves state-of-the-art performance in both accuracy and efficiency. Code is available at this https URL.

[LG-9] A method for classification of data with uncertainty using hypothesis testing

链接: https://arxiv.org/abs/2502.08582
作者: Shoma Yokura,Akihisa Ichiki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Binary classification is a task that involves the classification of data into one of two distinct classes. It is widely utilized in various fields. However, conventional classifiers tend to make overconfident predictions for data that belong to overlapping regions of the two class distributions or for data outside the distributions (out-of-distribution data). Therefore, conventional classifiers should not be applied in high-risk fields where classification results can have significant consequences. In order to address this issue, it is necessary to quantify uncertainty and adopt decision-making approaches that take it into account. Many methods have been proposed for this purpose; however, implementing these methods often requires performing resampling, improving the structure or performance of models, and optimizing the thresholds of classifiers. We propose a new decision-making approach using two types of hypothesis testing. This method is capable of detecting ambiguous data that belong to the overlapping regions of two class distributions, as well as out-of-distribution data that are not included in the training data distribution. In addition, we quantify uncertainty using the empirical distribution of feature values derived from the training data obtained through the trained model. The classification threshold is determined by the \alpha -quantile and ( 1-\alpha )-quantile, where the significance level \alpha is set according to each specific situation.

[LG-10] Beyond Predictions: A Participatory Framework for Multi-Stakeholder Decision-Making

链接: https://arxiv.org/abs/2502.08542
作者: Vittoria Vineis,Giuseppe Perelli,Gabriele Tolomei
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Conventional decision-support systems, primarily based on supervised learning, focus on outcome prediction models to recommend actions. However, they often fail to account for the complexities of multi-actor environments, where diverse and potentially conflicting stakeholder preferences must be balanced. In this paper, we propose a novel participatory framework that redefines decision-making as a multi-stakeholder optimization problem, capturing each actor’s preferences through context-dependent reward functions. Our framework leverages k -fold cross-validation to fine-tune user-provided outcome prediction models and evaluate decision strategies, including compromise functions mediating stakeholder trade-offs. We introduce a synthetic scoring mechanism that exploits user-defined preferences across multiple metrics to rank decision-making strategies and identify the optimal decision-maker. The selected decision-maker can then be used to generate actionable recommendations for new data. We validate our framework using two real-world use cases, demonstrating its ability to deliver recommendations that effectively balance multiple metrics, achieving results that are often beyond the scope of purely prediction-based methods. Ablation studies demonstrate that our framework, with its modular, model-agnostic, and inherently transparent design, integrates seamlessly with various predictive models, reward structures, evaluation metrics, and sample sizes, making it particularly suited for complex, high-stakes decision-making contexts.

[LG-11] Matrix Completion with Graph Information: A Provable Nonconvex Optimization Approach

链接: https://arxiv.org/abs/2502.08536
作者: Yao Wang,Yiyang Yang,Kaidong Wang,Shanxing Gao,Xiuwu Liao
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 41 pages, 6 figures

点击查看摘要

Abstract:We consider the problem of matrix completion with graphs as side information depicting the interrelations between variables. The key challenge lies in leveraging the similarity structure of the graph to enhance matrix recovery. Existing approaches, primarily based on graph Laplacian regularization, suffer from several limitations: (1) they focus only on the similarity between neighboring variables, while overlooking long-range correlations; (2) they are highly sensitive to false edges in the graphs and (3) they lack theoretical guarantees regarding statistical and computational complexities. To address these issues, we propose in this paper a novel graph regularized matrix completion algorithm called GSGD, based on preconditioned projected gradient descent approach. We demonstrate that GSGD effectively captures the higher-order correlation information behind the graphs, and achieves superior robustness and stability against the false edges. Theoretically, we prove that GSGD achieves linear convergence to the global optimum with near-optimal sample complexity, providing the first theoretical guarantees for both recovery accuracy and efficacy in the perspective of nonconvex optimization. Our numerical experiments on both synthetic and real-world data further validate that GSGD achieves superior recovery accuracy and scalability compared with several popular alternatives.

[LG-12] On Different Notions of Redundancy in Conditional-Independence-Based Discovery of Graphical Models

链接: https://arxiv.org/abs/2502.08531
作者: Philipp M. Faller,Dominik Janzing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The goal of conditional-independence-based discovery of graphical models is to find a graph that represents the independence structure of variables in a given dataset. To learn such a representation, conditional-independence-based approaches conduct a set of statistical tests that suffices to identify the graphical representation under some assumptions on the underlying distribution of the data. In this work, we highlight that due to the conciseness of the graphical representation, there are often many tests that are not used in the construction of the graph. These redundant tests have the potential to detect or sometimes correct errors in the learned model. We show that not all tests contain this additional information and that such redundant tests have to be applied with care. Precisely, we argue that particularly those conditional (in)dependence statements are interesting that follow only from graphical assumptions but do not hold for every probability distribution.

[LG-13] he Paradox of Stochasticity: Limited Creativity and Computational Decoupling in Temperature-Varied LLM Outputs of Structured Fictional Data

链接: https://arxiv.org/abs/2502.08515
作者: Evgenii Evstafev
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:This study examines how temperature settings and model architectures affect the generation of structured fictional data (names, birthdates) across three large language models (LLMs): llama3.1:8b, deepseek-r1:8b, and mistral:latest. By systematically testing temperature values from 0.0 to 1.0 in increments of 0.1, we conducted 330 trials yielding 889 structured entities, validated for syntactic consistency. Key findings reveal that model architecture significantly influences computational efficiency, with mistral:latest and llama3.1:8b processing data 8x faster than deepseek-r1:8b. Contrary to expectations, temperature showed no correlation with processing time, challenging assumptions about stochastic sampling costs. Output diversity remained limited, as models consistently defaulted to common name archetypes (e.g., ‘John Doe’ and ‘Jane Smith’) across all temperatures, though rare names clustered at intermediate values (0.3-0.7). These results demonstrate that architectural optimizations, rather than temperature adjustments, dominate performance in structured generation tasks. The findings emphasize prioritizing model selection over hyperparameter tuning for efficiency and suggest explicit diversity constraints are necessary to mitigate default output biases in synthetic data pipelines.

[LG-14] Bridging Domain Adaptation and Graph Neural Networks: A Tensor-Based Framework for Effective Label Propagation

链接: https://arxiv.org/abs/2502.08505
作者: Tao Wen,Elynn Chen,Yuzhou Chen,Qi Lei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have recently become the predominant tools for studying graph data. Despite state-of-the-art performance on graph classification tasks, GNNs are overwhelmingly trained in a single domain under supervision, thus necessitating a prohibitively high demand for labels and resulting in poorly transferable representations. To address this challenge, we propose the Label-Propagation Tensor Graph Neural Network (LP-TGNN) framework to bridge the gap between graph data and traditional domain adaptation methods. It extracts graph topological information holistically with a tensor architecture and then reduces domain discrepancy through label propagation. It is readily compatible with general GNNs and domain adaptation techniques with minimal adjustment through pseudo-labeling. Experiments on various real-world benchmarks show that our LP-TGNN outperforms baselines by a notable margin. We also validate and analyze each component of the proposed framework in the ablation study.

[LG-15] Fine-Tuning Topics through Weighting Aspect Keywords

链接: https://arxiv.org/abs/2502.08496
作者: Ali Nazari,Michael Weiss
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 17 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Topic modeling often requires examining topics from multiple perspectives to uncover hidden patterns, especially in less explored areas. This paper presents an approach to address this need, utilizing weighted keywords from various aspects derived from a domain knowledge. The research method starts with standard topic modeling. Then, it adds a process consisting of four key steps. First, it defines keywords for each aspect. Second, it gives weights to these keywords based on their relevance. Third, it calculates relevance scores for aspect-weighted keywords and topic keywords to create aspect-topic models. Fourth, it uses these scores to tune relevant new documents. Finally, the generated topic models are interpreted and validated. The findings show that top-scoring documents are more likely to be about the same aspect of a topic. This highlights the model’s effectiveness in finding the related documents to the aspects.

[LG-16] One-Shot Federated Learning with Classifier-Free Diffusion Models

链接: https://arxiv.org/abs/2502.08488
作者: Obaidullah Zaland,Shutong Jin,Florian T. Pokorny,Monowar Bhuyan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative learning without data centralization but introduces significant communication costs due to multiple communication rounds between clients and the server. One-shot federated learning (OSFL) addresses this by forming a global model with a single communication round, often relying on the server’s model distillation or auxiliary dataset generation - often through pre-trained diffusion models (DMs). Existing DM-assisted OSFL methods, however, typically employ classifier-guided DMs, which require training auxiliary classifier models at each client, introducing additional computation overhead. This work introduces OSCAR (One-Shot Federated Learning with Classifier-Free Diffusion Models), a novel OSFL approach that eliminates the need for auxiliary models. OSCAR uses foundation models to devise category-specific data representations at each client, seamlessly integrated into a classifier-free diffusion model pipeline for server-side data generation. OSCAR is a simple yet cost-effective OSFL approach that outperforms the state-of-the-art on four benchmarking datasets while reducing the communication load by at least 99%.

[LG-17] Numerical Schemes for Signature Kernels

链接: https://arxiv.org/abs/2502.08470
作者: Thomas Cass,Francesco Piatti,Jeffrey Pei
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:Signature kernels have emerged as a powerful tool within kernel methods for sequential data. In the paper “The Signature Kernel is the solution of a Goursat PDE”, the authors identify a kernel trick that demonstrates that, for continuously differentiable paths, the signature kernel satisfies a Goursat problem for a hyperbolic partial differential equation (PDE) in two independent time variables. While finite difference methods have been explored for this PDE, they face limitations in accuracy and stability when handling highly oscillatory inputs. In this work, we introduce two advanced numerical schemes that leverage polynomial representations of boundary conditions through either approximation or interpolation techniques, and rigorously establish the theoretical convergence of the polynomial approximation scheme. Experimental evaluations reveal that our approaches yield improvements of several orders of magnitude in mean absolute percentage error (MAPE) compared to traditional finite difference schemes, without increasing computational complexity. Furthermore, like finite difference methods, our algorithms can be GPU-parallelized to reduce computational complexity from quadratic to linear in the length of the input sequences, thereby improving scalability for high-frequency data. We have implemented these algorithms in a dedicated Python library, which is publicly available at: this https URL.

[LG-18] Learning Theory for Kernel Bilevel Optimization

链接: https://arxiv.org/abs/2502.08457
作者: Fares El Khoury,Edouard Pauwels,Samuel Vaiter,Michael Arbel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bilevel optimization has emerged as a technique for addressing a wide range of machine learning problems that involve an outer objective implicitly determined by the minimizer of an inner problem. In this paper, we investigate the generalization properties for kernel bilevel optimization problems where the inner objective is optimized over a Reproducing Kernel Hilbert Space. This setting enables rich function approximation while providing a foundation for rigorous theoretical analysis. In this context, we establish novel generalization error bounds for the bilevel problem under finite-sample approximation. Our approach adopts a functional perspective, inspired by (Petrulionyte et al., 2024), and leverages tools from empirical process theory and maximal inequalities for degenerate U -processes to derive uniform error bounds. These generalization error estimates allow to characterize the statistical accuracy of gradient-based methods applied to the empirical discretization of the bilevel problem.

[LG-19] Monge SAM: Robust Reparameterization-Invariant Sharpness-Aware Minimization Based on Loss Geometry

链接: https://arxiv.org/abs/2502.08448
作者: Albert Kjøller Jacobsen,Georgios Arvanitidis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent studies on deep neural networks show that flat minima of the loss landscape correlate with improved generalization. Sharpness-aware minimization (SAM) efficiently finds flat regions by updating the parameters according to the gradient at an adversarial perturbation. The perturbation depends on the Euclidean metric, making SAM non-invariant under reparametrizations, which blurs sharpness and generalization. We propose Monge SAM (M-SAM), a reparametrization invariant version of SAM by considering a Riemannian metric in the parameter space induced naturally by the loss surface. Compared to previous approaches, M-SAM works under any modeling choice, relies only on mild assumptions while being as computationally efficient as SAM. We theoretically argue that M-SAM varies between SAM and gradient descent (GD), which increases robustness to hyperparameter selection and reduces attraction to suboptimal equilibria like saddle points. We demonstrate this behavior both theoretically and empirically on a multi-modal representation alignment task.

[LG-20] textttLucidAtlas: Learning Uncertainty-Aware Covariate-Disentangled Individualized Atlas Representations

链接: https://arxiv.org/abs/2502.08445
作者: Yining Jiao,Sreekalyani Bhamidi,Huaizhi Qu,Carlton Zdanski,Julia Kimbell,Andrew Prince,Cameron Worden,Samuel Kirse,Christopher Rutter,Benjamin Shields,William Dunn,Jisan Mahmud,Tianlong Chen,Marc Niethammer
类目: Machine Learning (cs.LG)
*备注: 28 pages

点击查看摘要

Abstract:The goal of this work is to develop principled techniques to extract information from high dimensional data sets with complex dependencies in areas such as medicine that can provide insight into individual as well as population level variation. We develop \textttLucidAtlas , an approach that can represent spatially varying information, and can capture the influence of covariates as well as population uncertainty. As a versatile atlas representation, \textttLucidAtlas offers robust capabilities for covariate interpretation, individualized prediction, population trend analysis, and uncertainty estimation, with the flexibility to incorporate prior knowledge. Additionally, we discuss the trustworthiness and potential risks of neural additive models for analyzing dependent covariates and then introduce a marginalization approach to explain the dependence of an individual predictor on the models’ response (the atlas). To validate our method, we demonstrate its generalizability on two medical datasets. Our findings underscore the critical role of by-construction interpretable models in advancing scientific discovery. Our code will be publicly available upon acceptance.

[LG-21] Closer through commonality: Enhancing hypergraph contrastive learning with shared groups

链接: https://arxiv.org/abs/2502.08432
作者: Daeyoung Roh,Donghee Han,Daehee Kim,Keejun Han,Mun Yi
类目: Machine Learning (cs.LG)
*备注: 11page, 5 figures, 6 tables, 2024 IEEE International Conference on Big Data

点击查看摘要

Abstract:Hypergraphs provide a superior modeling framework for representing complex multidimensional relationships in the context of real-world interactions that often occur in groups, overcoming the limitations of traditional homogeneous graphs. However, there have been few studies on hypergraphbased contrastive learning, and existing graph-based contrastive learning methods have not been able to fully exploit the highorder correlation information in hypergraphs. Here, we propose a Hypergraph Fine-grained contrastive learning (HyFi) method designed to exploit the complex high-dimensional information inherent in hypergraphs. While avoiding traditional graph augmentation methods that corrupt the hypergraph topology, the proposed method provides a simple and efficient learning augmentation function by adding noise to node features. Furthermore, we expands beyond the traditional dichotomous relationship between positive and negative samples in contrastive learning by introducing a new relationship of weak positives. It demonstrates the importance of fine-graining positive samples in contrastive learning. Therefore, HyFi is able to produce highquality embeddings, and outperforms both supervised and unsupervised baselines in average rank on node classification across 10 datasets. Our approach effectively exploits high-dimensional hypergraph information, shows significant improvement over existing graph-based contrastive learning methods, and is efficient in terms of training speed and GPU memory cost. The source code is available at this https URL.

[LG-22] Enhanced Load Forecasting with GAT-LSTM: Leverag ing Grid and Temporal Features

链接: https://arxiv.org/abs/2502.08376
作者: Ugochukwu Orji,Çiçek Güven,Dan Stowell
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Accurate power load forecasting is essential for the efficient operation and planning of electrical grids, particularly given the increased variability and complexity introduced by renewable energy sources. This paper introduces GAT-LSTM, a hybrid model that combines Graph Attention Networks (GAT) and Long Short-Term Memory (LSTM) networks. A key innovation of the model is the incorporation of edge attributes, such as line capacities and efficiencies, into the attention mechanism, enabling it to dynamically capture spatial relationships grounded in grid-specific physical and operational constraints. Additionally, by employing an early fusion of spatial graph embeddings and temporal sequence features, the model effectively learns and predicts complex interactions between spatial dependencies and temporal patterns, providing a realistic representation of the dynamics of power grids. Experimental evaluations on the Brazilian Electricity System dataset demonstrate that the GAT-LSTM model significantly outperforms state-of-the-art models, achieving reductions of 21. 8% in MAE, 15. 9% in RMSE and 20. 2% in MAPE. These results underscore the robustness and adaptability of the GAT-LSTM model, establishing it as a powerful tool for applications in grid management and energy planning.

[LG-23] A Survey on Pre-Trained Diffusion Model Distillations

链接: https://arxiv.org/abs/2502.08364
作者: Xuhui Fan,Zhangkai Wu,Hongyu Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion Models~(DMs) have emerged as the dominant approach in Generative Artificial Intelligence (GenAI), owing to their remarkable performance in tasks such as text-to-image synthesis. However, practical DMs, such as stable diffusion, are typically trained on massive datasets and thus usually require large storage. At the same time, many steps may be required, i.e., recursively evaluating the trained neural network, to generate a high-quality image, which results in significant computational costs during sample generation. As a result, distillation methods on pre-trained DM have become widely adopted practices to develop smaller, more efficient models capable of rapid, few-step generation in low-resource environment. When these distillation methods are developed from different perspectives, there is an urgent need for a systematic survey, particularly from a methodological perspective. In this survey, we review distillation methods through three aspects: output loss distillation, trajectory distillation and adversarial distillation. We also discuss current challenges and outline future research directions in the conclusion.

[LG-24] Loss Landscape Analysis for Reliable Quantized ML Models for Scientific Sensing

链接: https://arxiv.org/abs/2502.08355
作者: Tommaso Baldi,Javier Campos,Olivia Weng,Caleb Geniesse,Nhan Tran,Ryan Kastner,Alessandro Biondi
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:In this paper, we propose a method to perform empirical analysis of the loss landscape of machine learning (ML) models. The method is applied to two ML models for scientific sensing, which necessitates quantization to be deployed and are subject to noise and perturbations due to experimental conditions. Our method allows assessing the robustness of ML models to such effects as a function of quantization precision and under different regularization techniques – two crucial concerns that remained underexplored so far. By investigating the interplay between performance, efficiency, and robustness by means of loss landscape analysis, we both established a strong correlation between gently-shaped landscapes and robustness to input and weight perturbations and observed other intriguing and non-obvious phenomena. Our method allows a systematic exploration of such trade-offs a priori, i.e., without training and testing multiple models, leading to more efficient development workflows. This work also highlights the importance of incorporating robustness into the Pareto optimization of ML models, enabling more reliable and adaptive scientific sensing systems.

[LG-25] Model-Free Counterfactual Subset Selection at Scale

链接: https://arxiv.org/abs/2502.08326
作者: Minh Hieu Nguyen,Viet Hung Doan,Anh Tuan Nguyen,Jun Jo,Quoc Viet Hung Nguyen
类目: Machine Learning (cs.LG); Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Ensuring transparency in AI decision-making requires interpretable explanations, particularly at the instance level. Counterfactual explanations are a powerful tool for this purpose, but existing techniques frequently depend on synthetic examples, introducing biases from unrealistic assumptions, flawed models, or skewed data. Many methods also assume full dataset availability, an impractical constraint in real-time environments where data flows continuously. In contrast, streaming explanations offer adaptive, real-time insights without requiring persistent storage of the entire dataset. This work introduces a scalable, model-free approach to selecting diverse and relevant counterfactual examples directly from observed data. Our algorithm operates efficiently in streaming settings, maintaining O(\log k) update complexity per item while ensuring high-quality counterfactual selection. Empirical evaluations on both real-world and synthetic datasets demonstrate superior performance over baseline methods, with robust behavior even under adversarial conditions.

[LG-26] Data Pricing for Graph Neural Networks without Pre-purchased Inspection AAMAS-2025

链接: https://arxiv.org/abs/2502.08284
作者: Yiping Liu,Mengxiao Zhang,Jiamou Liu,Song Yang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Accepted by AAMAS-2025

点击查看摘要

Abstract:Machine learning (ML) models have become essential tools in various scenarios. Their effectiveness, however, hinges on a substantial volume of data for satisfactory performance. Model marketplaces have thus emerged as crucial platforms bridging model consumers seeking ML solutions and data owners possessing valuable data. These marketplaces leverage model trading mechanisms to properly incentive data owners to contribute their data, and return a well performing ML model to the model consumers. However, existing model trading mechanisms often assume the data owners are willing to share their data before being paid, which is not reasonable in real world. Given that, we propose a novel mechanism, named Structural Importance based Model Trading (SIMT) mechanism, that assesses the data importance and compensates data owners accordingly without disclosing the data. Specifically, SIMT procures feature and label data from data owners according to their structural importance, and then trains a graph neural network for model consumers. Theoretically, SIMT ensures incentive compatible, individual rational and budget feasible. The experiments on five popular datasets validate that SIMT consistently outperforms vanilla baselines by up to 40% in both MacroF1 and MicroF1.

[LG-27] GenIAS: Generator for Instantiating Anomalies in time Series

链接: https://arxiv.org/abs/2502.08262
作者: Zahra Zamanzadeh Darban,Qizhou Wang,Geoffrey I. Webb,Shirui Pan,Charu C. Aggarwal,Mahsa Salehi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A recent and promising approach for building time series anomaly detection (TSAD) models is to inject synthetic samples of anomalies within real data sets. The existing injection mechanisms have significant limitations - most of them rely on ad hoc, hand-crafted strategies which fail to capture the natural diversity of anomalous patterns, or are restricted to univariate time series settings. To address these challenges, we design a generative model for TSAD using a variational autoencoder, which is referred to as a Generator for Instantiating Anomalies in Time Series (GenIAS). GenIAS is designed to produce diverse and realistic synthetic anomalies for TSAD tasks. By employing a novel learned perturbation mechanism in the latent space and injecting the perturbed patterns in different segments of time series, GenIAS can generate anomalies with greater diversity and varying scales. Further, guided by a new triplet loss function, which uses a min-max margin and a new variance-scaling approach to further enforce the learning of compact normal patterns, GenIAS ensures that anomalies are distinct from normal samples while remaining realistic. The approach is effective for both univariate and multivariate time series. We demonstrate the diversity and realism of the generated anomalies. Our extensive experiments demonstrate that GenIAS - when integrated into a TSAD task - consistently outperforms seventeen traditional and deep anomaly detection models, thereby highlighting the potential of generative models for time series anomaly generation.

[LG-28] Keep your distance: learning dispersed embeddings on mathbbS_d

链接: https://arxiv.org/abs/2502.08231
作者: Evgeniia Tokarchuk,Hua Chang Bakker,Vlad Niculae
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning well-separated features in high-dimensional spaces, such as text or image embeddings, is crucial for many machine learning applications. Achieving such separation can be effectively accomplished through the dispersion of embeddings, where unrelated vectors are pushed apart as much as possible. By constraining features to be on a hypersphere, we can connect dispersion to well-studied problems in mathematics and physics, where optimal solutions are known for limited low-dimensional cases. However, in representation learning we typically deal with a large number of features in high-dimensional space, and moreover, dispersion is usually traded off with some other task-oriented training objective, making existing theoretical and numerical solutions inapplicable. Therefore, it is common to rely on gradient-based methods to encourage dispersion, usually by minimizing some function of the pairwise distances. In this work, we first give an overview of existing methods from disconnected literature, making new connections and highlighting similarities. Next, we introduce some new angles. We propose to reinterpret pairwise dispersion using a maximum mean discrepancy (MMD) motivation. We then propose an online variant of the celebrated Lloyd’s algorithm, of K-Means fame, as an effective alternative regularizer for dispersion on generic domains. Finally, we derive a novel dispersion method that directly exploits properties of the hypersphere. Our experiments show the importance of dispersion in image classification and natural language processing tasks, and how algorithms exhibit different trade-offs in different regimes.

[LG-29] Enhancing Sample Selection by Cutting Mislabeled Easy Examples

链接: https://arxiv.org/abs/2502.08227
作者: Suqin Yuan,Lei Feng,Bo Han,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sample selection is a prevalent approach in learning with noisy labels, aiming to identify confident samples for training. Although existing sample selection methods have achieved decent results by reducing the noise rate of the selected subset, they often overlook that not all mislabeled examples harm the model’s performance equally. In this paper, we demonstrate that mislabeled examples correctly predicted by the model early in the training process are particularly harmful to model performance. We refer to these examples as Mislabeled Easy Examples (MEEs). To address this, we propose Early Cutting, which introduces a recalibration step that employs the model’s later training state to re-select the confident subset identified early in training, thereby avoiding misleading confidence from early learning and effectively filtering out MEEs. Experiments on the CIFAR, WebVision, and full ImageNet-1k datasets demonstrate that our method effectively improves sample selection and model performance by reducing MEEs.

[LG-30] Exploring Exploration in Bayesian Optimization

链接: https://arxiv.org/abs/2502.08208
作者: Leonard Papenmeier,Nuojin Cheng,Stephen Becker,Luigi Nardi
类目: Machine Learning (cs.LG)
*备注: 28 pages, 34 figures

点击查看摘要

Abstract:A well-balanced exploration-exploitation trade-off is crucial for successful acquisition functions in Bayesian optimization. However, there is a lack of quantitative measures for exploration, making it difficult to analyze and compare different acquisition functions. This work introduces two novel approaches - observation traveling salesman distance and observation entropy - to quantify the exploration characteristics of acquisition functions based on their selected observations. Using these measures, we examine the explorative nature of several well-known acquisition functions across a diverse set of black-box problems, uncover links between exploration and empirical performance, and reveal new relationships among existing acquisition functions. Beyond enabling a deeper understanding of acquisition functions, these measures also provide a foundation for guiding their design in a more principled and systematic manner.

[LG-31] Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency

链接: https://arxiv.org/abs/2502.08206
作者: Abdelkrim Alahyane(LAAS-SARA, LAAS-RISC, LAAS),Céline Comte(CNRS, LAAS-SARA, LAAS-RISC, LAAS),Matthieu Jonckheere(CNRS, LAAS-SARA, LAAS-RISC, LAAS),Éric Moulines(X)
类目: Machine Learning (cs.LG); Performance (cs.PF); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Synchronous federated learning (FL) scales poorly with the number of clients due to the straggler effect. Algorithms like FedAsync and GeneralizedFedAsync address this limitation by enabling asynchronous communication between clients and the central server. In this work, we rely on stochastic modeling to better understand the impact of design choices in asynchronous FL algorithms, such as the concurrency level and routing probabilities, and we leverage this knowledge to optimize loss. We characterize in particular a fundamental trade-off for optimizing asynchronous FL: minimizing gradient estimation errors by avoiding model parameter staleness, while also speeding up the system by increasing the throughput of model updates. Our two main contributions can be summarized as follows. First, we prove a discrete variant of Little’s law to derive a closed-form expression for relative delay, a metric that quantifies staleness. This allows us to efficiently minimize the average loss per model update, which has been the gold standard in literature to date. Second, we observe that naively optimizing this metric leads us to slow down the system drastically by overemphazing staleness at the detriment of throughput. This motivates us to introduce an alternative metric that also takes system speed into account, for which we derive a tractable upper-bound that can be minimized numerically. Extensive numerical results show that these optimizations enhance accuracy by 10% to 30%.

[LG-32] Privacy amplification by random allocation

链接: https://arxiv.org/abs/2502.08202
作者: Vitaly Feldman,Moshe Shenfeld
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the privacy guarantees of an algorithm in which a user’s data is used in k steps randomly and uniformly chosen from a sequence (or set) of t differentially private steps. We demonstrate that the privacy guarantees of this sampling scheme can be upper bound by the privacy guarantees of the well-studied independent (or Poisson) subsampling in which each step uses the user’s data with probability (1+ o(1))k/t . Further, we provide two additional analysis techniques that lead to numerical improvements in some parameter regimes. The case of k=1 has been previously studied in the context of DP-SGD in Balle et al. (2020) and very recently in Chua et al. (2024). Privacy analysis of Balle et al. (2020) relies on privacy amplification by shuffling which leads to overly conservative bounds. Privacy analysis of Chua et al. (2024a) relies on Monte Carlo simulations that are computationally prohibitive in many practical scenarios and have additional inherent limitations.

[LG-33] From Individual Experience to Collective Evidence: A Reporting-Based Framework for Identifying Systemic Harms

链接: https://arxiv.org/abs/2502.08166
作者: Jessica Dai,Paula Gradu,Inioluwa Deborah Raji,Benjamin Recht
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When an individual reports a negative interaction with some system, how can their personal experience be contextualized within broader patterns of system behavior? We study the incident database problem, where individual reports of adverse events arrive sequentially, and are aggregated over time. In this work, our goal is to identify whether there are subgroups–defined by any combination of relevant features–that are disproportionately likely to experience harmful interactions with the system. We formalize this problem as a sequential hypothesis test, and identify conditions on reporting behavior that are sufficient for making inferences about disparities in true rates of harm across subgroups. We show that algorithms for sequential hypothesis tests can be applied to this problem with a standard multiple testing correction. We then demonstrate our method on real-world datasets, including mortgage decisions and vaccine side effects; on each, our method (re-)identifies subgroups known to experience disproportionate harm using only a fraction of the data that was initially used to discover them.

[LG-34] Local Differential Privacy is Not Enough: A Sample Reconstruction Attack against Federated Learning with Local Differential Privacy

链接: https://arxiv.org/abs/2502.08151
作者: Zhichao You,Xuewen Dong,Shujun Li,Ximeng Liu,Siqi Ma,Yulong Shen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reconstruction attacks against federated learning (FL) aim to reconstruct users’ samples through users’ uploaded gradients. Local differential privacy (LDP) is regarded as an effective defense against various attacks, including sample reconstruction in FL, where gradients are clipped and perturbed. Existing attacks are ineffective in FL with LDP since clipped and perturbed gradients obliterate most sample information for reconstruction. Besides, existing attacks embed additional sample information into gradients to improve the attack effect and cause gradient expansion, leading to a more severe gradient clipping in FL with LDP. In this paper, we propose a sample reconstruction attack against LDP-based FL with any target models to reconstruct victims’ sensitive samples to illustrate that FL with LDP is not flawless. Considering gradient expansion in reconstruction attacks and noise in LDP, the core of the proposed attack is gradient compression and reconstructed sample denoising. For gradient compression, an inference structure based on sample characteristics is presented to reduce redundant gradients against LDP. For reconstructed sample denoising, we artificially introduce zero gradients to observe noise distribution and scale confidence interval to filter the noise. Theoretical proof guarantees the effectiveness of the proposed attack. Evaluations show that the proposed attack is the only attack that reconstructs victims’ training samples in LDP-based FL and has little impact on the target model’s accuracy. We conclude that LDP-based FL needs further improvements to defend against sample reconstruction attacks effectively.

[LG-35] Knowledge-Guided Wasserstein Distributionally Robust Optimization

链接: https://arxiv.org/abs/2502.08146
作者: Zitao Wang,Ziyuan Wang,Molei Liu,Nian Si
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transfer learning is a popular strategy to leverage external knowledge and improve statistical efficiency, particularly with a limited target sample. We propose a novel knowledge-guided Wasserstein Distributionally Robust Optimization (KG-WDRO) framework that adaptively incorporates multiple sources of external knowledge to overcome the conservativeness of vanilla WDRO, which often results in overly pessimistic shrinkage toward zero. Our method constructs smaller Wasserstein ambiguity sets by controlling the transportation along directions informed by the source knowledge. This strategy can alleviate perturbations on the predictive projection of the covariates and protect against information loss. Theoretically, we establish the equivalence between our WDRO formulation and the knowledge-guided shrinkage estimation based on collinear similarity, ensuring tractability and geometrizing the feasible set. This also reveals a novel and general interpretation for recent shrinkage-based transfer learning approaches from the perspective of distributional robustness. In addition, our framework can adjust for scaling differences in the regression models between the source and target and accommodates general types of regularization such as lasso and ridge. Extensive simulations demonstrate the superior performance and adaptivity of KG-WDRO in enhancing small-sample transfer learning.

[LG-36] Data-dependent Bounds with T-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching

链接: https://arxiv.org/abs/2502.08143
作者: Quan Nguyen,Shinji Ito,Junpei Komiyama,Nishant A. Mehta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal O(\sqrtT\lnT) worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and T -optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order O(\sqrtT) in the adversarial regime and O(\lnT) in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.

[LG-37] In-Context Learning of Linear Dynamical Systems with Transformers: Error Bounds and Depth-Separation

链接: https://arxiv.org/abs/2502.08136
作者: Frank Cole,Yulong Lu,Tianhao Zhang,Yuxuan Zhao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an L^2 -testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.

[LG-38] SS4Rec: Continuous-Time Sequential Recommendation with State Space Models

链接: https://arxiv.org/abs/2502.08132
作者: Wei Xiao,Huiying Wang,Qifeng Zhou,Qing Wang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequential recommendation is a key area in the field of recommendation systems aiming to model user interest based on historical interaction sequences with irregular intervals. While previous recurrent neural network-based and attention-based approaches have achieved significant results, they have limitations in capturing system continuity due to the discrete characteristics. In the context of continuous-time modeling, state space model (SSM) offers a potential solution, as it can effectively capture the dynamic evolution of user interest over time. However, existing SSM-based approaches ignore the impact of irregular time intervals within historical user interactions, making it difficult to model complexed user-item transitions in sequences. To address this issue, we propose a hybrid SSM-based model called SS4Rec for continuous-time sequential recommendation. SS4Rec integrates a time-aware SSM to handle irregular time intervals and a relation-aware SSM to model contextual dependencies, enabling it to infer user interest from both temporal and sequential perspectives. In the training process, the time-aware SSM and the relation-aware SSM are discretized by variable stepsizes according to user interaction time intervals and input data, respectively. This helps capture the continuous dependency from irregular time intervals and provides time-specific personalized recommendations. Experimental studies on five benchmark datasets demonstrate the superiority and effectiveness of SS4Rec.

[LG-39] Incremental Approximate Single-Source Shortest Paths with Predictions

链接: https://arxiv.org/abs/2502.08125
作者: Samuel McCauley,Benjamin Moseley,Aidin Niaparast,Helia Niaparast,Shikha Singh
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The algorithms-with-predictions framework has been used extensively to develop online algorithms with improved beyond-worst-case competitive ratios. Recently, there is growing interest in leveraging predictions for designing data structures with improved beyond-worst-case running times. In this paper, we study the fundamental data structure problem of maintaining approximate shortest paths in incremental graphs in the algorithms-with-predictions model. Given a sequence \sigma of edges that are inserted one at a time, the goal is to maintain approximate shortest paths from the source to each vertex in the graph at each time step. Before any edges arrive, the data structure is given a prediction of the online edge sequence \hat\sigma which is used to ``warm start’’ its state. As our main result, we design a learned algorithm that maintains (1+\epsilon) -approximate single-source shortest paths, which runs in \tildeO(m \eta \log W/\epsilon) time, where W is the weight of the heaviest edge and \eta is the prediction error. We show these techniques immediately extend to the all-pairs shortest-path setting as well. Our algorithms are consistent (performing nearly as fast as the offline algorithm) when predictions are nearly perfect, have a smooth degradation in performance with respect to the prediction error and, in the worst case, match the best offline algorithm up to logarithmic factors. As a building block, we study the offline incremental approximate single-source shortest-paths problem. In this problem, the edge sequence \sigma is known a priori and the goal is to efficiently return the length of the shortest paths in the intermediate graph G_t consisting of the first t edges, for all t . Note that the offline incremental problem is defined in the worst-case setting (without predictions) and is of independent interest. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2502.08125 [cs.DS] (or arXiv:2502.08125v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2502.08125 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Provably Robust Federated Reinforcement Learning

链接: https://arxiv.org/abs/2502.08123
作者: Minghong Fang,Xilong Wang,Neil Zhenqiang Gong
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear in The Web Conference 2025

点击查看摘要

Abstract:Federated reinforcement learning (FRL) allows agents to jointly learn a global decision-making policy under the guidance of a central server. While FRL has advantages, its decentralized design makes it prone to poisoning attacks. To mitigate this, Byzantine-robust aggregation techniques tailored for FRL have been introduced. Yet, in our work, we reveal that these current Byzantine-robust techniques are not immune to our newly introduced Normalized attack. Distinct from previous attacks that targeted enlarging the distance of policy updates before and after an attack, our Normalized attack emphasizes on maximizing the angle of deviation between these updates. To counter these threats, we develop an ensemble FRL approach that is provably secure against both known and our newly proposed attacks. Our ensemble method involves training multiple global policies, where each is learnt by a group of agents using any foundational aggregation rule. These well-trained global policies then individually predict the action for a specific test state. The ultimate action is chosen based on a majority vote for discrete action systems or the geometric median for continuous ones. Our experimental results across different settings show that the Normalized attack can greatly disrupt non-ensemble Byzantine-robust methods, and our ensemble approach offers substantial resistance against poisoning attacks.

[LG-41] Out-of-Distribution Detection on Graphs: A Survey

链接: https://arxiv.org/abs/2502.08105
作者: Tingyi Cai,Yunliang Jiang,Yixin Liu,Ming Li,Changqin Huang,Shirui Pan
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Graph machine learning has witnessed rapid growth, driving advancements across diverse domains. However, the in-distribution assumption, where training and testing data share the same distribution, often breaks in real-world scenarios, leading to degraded model performance under distribution shifts. This challenge has catalyzed interest in graph out-of-distribution (GOOD) detection, which focuses on identifying graph data that deviates from the distribution seen during training, thereby enhancing model robustness. In this paper, we provide a rigorous definition of GOOD detection and systematically categorize existing methods into four types: enhancement-based, reconstruction-based, information propagation-based, and classification-based approaches. We analyze the principles and mechanisms of each approach and clarify the distinctions between GOOD detection and related fields, such as graph anomaly detection, outlier detection, and GOOD generalization. Beyond methodology, we discuss practical applications and theoretical foundations, highlighting the unique challenges posed by graph data. Finally, we discuss the primary challenges and propose future directions to advance this emerging field. The repository of this survey is available at this https URL.

[LG-42] Unsupervised categorization of similarity measures

链接: https://arxiv.org/abs/2502.08098
作者: Yoshiyuki Ohmura,Wataru Shimaya,Yasuo Kuniyoshi
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: arXiv admin note: substantial text overlap with arXiv:2306.00239

点击查看摘要

Abstract:In general, objects can be distinguished on the basis of their features, such as color or shape. In particular, it is assumed that similarity judgments about such features can be processed independently in different metric spaces. However, the unsupervised categorization mechanism of metric spaces corresponding to object features remains unknown. Here, we show that the artificial neural network system can autonomously categorize metric spaces through representation learning to satisfy the algebraic independence between neural networks, and project sensory information onto multiple high-dimensional metric spaces to independently evaluate the differences and similarities between features. Conventional methods often constrain the axes of the latent space to be mutually independent or orthogonal. However, the independent axes are not suitable for categorizing metric spaces. High-dimensional metric spaces that are independent of each other are not uniquely determined by the mutually independent axes, because any combination of independent axes can form mutually independent spaces. In other words, the mutually independent axes cannot be used to naturally categorize different feature spaces, such as color space and shape space. Therefore, constraining the axes to be mutually independent makes it difficult to categorize high-dimensional metric spaces. To overcome this problem, we developed a method to constrain only the spaces to be mutually independent and not the composed axes to be independent. Our theory provides general conditions for the unsupervised categorization of independent metric spaces, thus advancing the mathematical theory of functional differentiation of neural networks.

[LG-43] Mixture of Decoupled Message Passing Experts with Entropy Constraint for General Node Classification

链接: https://arxiv.org/abs/2502.08083
作者: Xuanze Chen,Jiajun Zhou,Jinsong Chen,Shanqing Yu,Qi Xuan
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: arXiv admin note: text overlap with arXiv:2412.08193

点击查看摘要

Abstract:The varying degrees of homophily and heterophily in real-world graphs persistently constrain the universality of graph neural networks (GNNs) for node classification. Adopting a data-centric perspective, this work reveals an inherent preference of different graphs towards distinct message encoding schemes: homophilous graphs favor local propagation, while heterophilous graphs exhibit preference for flexible combinations of propagation and transformation. To address this, we propose GNNMoE, a universal node classification framework based on the Mixture-of-Experts (MoE) mechanism. The framework first constructs diverse message-passing experts through recombination of fine-grained encoding operators, then designs soft and hard gating layers to allocate the most suitable expert networks for each node’s representation learning, thereby enhancing both model expressiveness and adaptability to diverse graphs. Furthermore, considering that soft gating might introduce encoding noise in homophilous scenarios, we introduce an entropy constraint to guide sharpening of soft gates, achieving organic integration of weighted combination and Top-K selection. Extensive experiments demonstrate that GNNMoE significantly outperforms mainstream GNNs, heterophilous GNNs, and graph transformers in both node classification performance and universality across diverse graph datasets.

[LG-44] Cascading Bandits Robust to Adversarial Corruptions

链接: https://arxiv.org/abs/2502.08077
作者: Jize Xie,Cheng Chen,Zhiyong Wang,Shuai Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online learning to rank sequentially recommends a small list of items to users from a large candidate set and receives the users’ click feedback. In many real-world scenarios, users browse the recommended list in order and click the first attractive item without checking the rest. Such behaviors are usually formulated as the cascade model. Many recent works study algorithms for cascading bandits, an online learning to rank framework in the cascade model. However, the performance of existing methods may drop significantly if part of the user feedback is adversarially corrupted (e.g., click fraud). In this work, we study how to resist adversarial corruptions in cascading bandits. We first formulate the ``\textitCascading Bandits with Adversarial Corruptions" (CBAC) problem, which assumes that there is an adaptive adversary that may manipulate the user feedback. Then we propose two robust algorithms for this problem, which assume the corruption level is known and agnostic, respectively. We show that both algorithms can achieve logarithmic regret when the algorithm is not under attack, and the regret increases linearly with the corruption level. The experimental results also verify the robustness of our methods.

[LG-45] Multi-Agent Performative Prediction Beyond the Insensitivity Assumption: A Case Study for Mortgage Competition

链接: https://arxiv.org/abs/2502.08063
作者: Guanghui Wang,Krishna Acharya,Lokranjan Lakshmikanthan,Vidya Muthukumar,Juba Ziani
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Performative prediction models account for feedback loops in decision-making processes where predictions influence future data distributions. While existing work largely assumes insensitivity of data distributions to small strategy changes, this assumption usually fails in real-world competitive (i.e. multi-agent) settings. For example, in Bertrand-type competitions, a small reduction in one firm’s price can lead that firm to capture the entire demand, while all others sharply lose all of their customers. We study a representative setting of multi-agent performative prediction in which insensitivity assumptions do not hold, and investigate the convergence of natural dynamics. To do so, we focus on a specific game that we call the ‘‘Bank Game’’, where two lenders compete over interest rates and credit score thresholds. Consumers act similarly as to in a Bertrand Competition, with each consumer selecting the firm with the lowest interest rate that they are eligible for based on the firms’ credit thresholds. Our analysis characterizes the equilibria of this game and demonstrates that when both firms use a common and natural no-regret learning dynamic – exponential weights – with proper initialization, the dynamics always converge to stable outcomes despite the general-sum structure. Notably, our setting admits multiple stable equilibria, with convergence dependent on initial conditions. We also provide theoretical convergence results in the stochastic case when the utility matrix is not fully known, but each learner can observe sufficiently many samples of consumers at each time step to estimate it, showing robustness to slight mis-specifications. Finally, we provide experimental results that validate our theoretical findings. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2502.08063 [cs.GT] (or arXiv:2502.08063v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2502.08063 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] General Coded Computing: Adversarial Settings

链接: https://arxiv.org/abs/2502.08058
作者: Parsa Moradi,Hanzaleh Akbarinodehi,Mohammad Ali Maddah-Ali
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 18 pages, 1 figure

点击查看摘要

Abstract:Conventional coded computing frameworks are predominantly tailored for structured computations, such as matrix multiplication and polynomial evaluation. Such tasks allow the reuse of tools and techniques from algebraic coding theory to improve the reliability of distributed systems in the presence of stragglers and adversarial servers. This paper lays the foundation for general coded computing, which extends the applicability of coded computing to handle a wide class of computations. In addition, it particularly addresses the challenging problem of managing adversarial servers. We demonstrate that, in the proposed scheme, for a system with N servers, where \mathcalO(N^a) , a \in [0,1) , are adversarial, the supremum of the average approximation error over all adversarial strategies decays at a rate of N^\frac65(a-1) , under minimal assumptions on the computing tasks. Furthermore, we show that within a general framework, the proposed scheme achieves optimal adversarial robustness, in terms of maximum number of adversarial servers it can tolerate. This marks a significant step toward practical and reliable general coded computing. Implementation results further validate the effectiveness of the proposed method in handling various computations, including inference in deep neural networks. Comments: 18 pages, 1 figure Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2502.08058 [cs.DC] (or arXiv:2502.08058v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2502.08058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-47] SLVR: Securely Leverag ing Client Validation for Robust Federated Learning

链接: https://arxiv.org/abs/2502.08055
作者: Jihye Choi,Sai Rahul Rachuri,Ke Wang,Somesh Jha,Yizhen Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training while keeping client data private. However, exposing individual client updates makes FL vulnerable to reconstruction attacks. Secure aggregation mitigates such privacy risks but prevents the server from verifying the validity of each client update, creating a privacy-robustness tradeoff. Recent efforts attempt to address this tradeoff by enforcing checks on client updates using zero-knowledge proofs, but they support limited predicates and often depend on public validation data. We propose SLVR, a general framework that securely leverages clients’ private data through secure multi-party computation. By utilizing clients’ data, SLVR not only eliminates the need for public validation data, but also enables a wider range of checks for robustness, including cross-client accuracy validation. It also adapts naturally to distribution shifts in client data as it can securely refresh its validation data up-to-date. Our empirical evaluations show that SLVR improves robustness against model poisoning attacks, particularly outperforming existing methods by up to 50% under adaptive attacks. Additionally, SLVR demonstrates effective adaptability and stable convergence under various distribution shift scenarios.

[LG-48] COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping

链接: https://arxiv.org/abs/2502.08054
作者: Jun Yamada,Alexander L. Mitchell,Jack Collins,Ingmar Posner
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:This paper addresses the challenge of occluded robot grasping, i.e. grasping in situations where the desired grasp poses are kinematically infeasible due to environmental constraints such as surface collisions. Traditional robot manipulation approaches struggle with the complexity of non-prehensile or bimanual strategies commonly used by humans in these circumstances. State-of-the-art reinforcement learning (RL) methods are unsuitable due to the inherent complexity of the task. In contrast, learning from demonstration requires collecting a significant number of expert demonstrations, which is often infeasible. Instead, inspired by human bimanual manipulation strategies, where two hands coordinate to stabilise and reorient objects, we focus on a bimanual robotic setup to tackle this challenge. In particular, we introduce Constraint-based Manipulation for Bimanual Occluded Grasping (COMBO-Grasp), a learning-based approach which leverages two coordinated policies: a constraint policy trained using self-supervised datasets to generate stabilising poses and a grasping policy trained using RL that reorients and grasps the target object. A key contribution lies in value function-guided policy coordination. Specifically, during RL training for the grasping policy, the constraint policy’s output is refined through gradients from a jointly trained value function, improving bimanual coordination and task performance. Lastly, COMBO-Grasp employs teacher-student policy distillation to effectively deploy point cloud-based policies in real-world environments. Empirical evaluations demonstrate that COMBO-Grasp significantly improves task success rates compared to competitive baseline approaches, with successful generalisation to unseen objects in both simulated and real-world environments.

[LG-49] he Art of Misclassification: Too Many Classes Not Enough Points

链接: https://arxiv.org/abs/2502.08041
作者: Mario Franco,Gerardo Febres,Nelson Fernández,Carlos Gershenson
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Classification is a ubiquitous and fundamental problem in artificial intelligence and machine learning, with extensive efforts dedicated to developing more powerful classifiers and larger datasets. However, the classification task is ultimately constrained by the intrinsic properties of datasets, independently of computational power or model complexity. In this work, we introduce a formal entropy-based measure of classificability, which quantifies the inherent difficulty of a classification problem by assessing the uncertainty in class assignments given feature representations. This measure captures the degree of class overlap and aligns with human intuition, serving as an upper bound on classification performance for classification problems. Our results establish a theoretical limit beyond which no classifier can improve the classification accuracy, regardless of the architecture or amount of data, in a given problem. Our approach provides a principled framework for understanding when classification is inherently fallible and fundamentally ambiguous.

[LG-50] End-to-End Predictive Planner for Autonomous Driving with Consistency Models

链接: https://arxiv.org/abs/2502.08033
作者: Anjian Li,Sangjae Bae,David Isele,Ryne Beeson,Faizan M. Tariq
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trajectory prediction and planning are fundamental components for autonomous vehicles to navigate safely and efficiently in dynamic environments. Traditionally, these components have often been treated as separate modules, limiting the ability to perform interactive planning and leading to computational inefficiency in multi-agent scenarios. In this paper, we present a novel unified and data-driven framework that integrates prediction and planning with a single consistency model. Trained on real-world human driving datasets, our consistency model generates samples from high-dimensional, multimodal joint trajectory distributions of the ego and multiple surrounding agents, enabling end-to-end predictive planning. It effectively produces interactive behaviors, such as proactive nudging and yielding to ensure both safe and efficient interactions with other road users. To incorporate additional planning constraints on the ego vehicle, we propose an alternating direction method for multi-objective guidance in online guided sampling. Compared to diffusion models, our consistency model achieves better performance with fewer sampling steps, making it more suitable for real-time deployment. Experimental results on Waymo Open Motion Dataset (WOMD) demonstrate our method’s superiority in trajectory quality, constraint satisfaction, and interactive behavior compared to various existing approaches.

[LG-51] Initialization Matters: Unraveling the Impact of Pre-Training on Federated Learning

链接: https://arxiv.org/abs/2502.08024
作者: Divyansh Jhunjhunwala,Pranay Sharma,Zheng Xu,Gauri Joshi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Initializing with pre-trained models when learning on downstream tasks is becoming standard practice in machine learning. Several recent works explore the benefits of pre-trained initialization in a federated learning (FL) setting, where the downstream training is performed at the edge clients with heterogeneous data distribution. These works show that starting from a pre-trained model can substantially reduce the adverse impact of data heterogeneity on the test performance of a model trained in a federated setting, with no changes to the standard FedAvg training algorithm. In this work, we provide a deeper theoretical understanding of this phenomenon. To do so, we study the class of two-layer convolutional neural networks (CNNs) and provide bounds on the training error convergence and test error of such a network trained with FedAvg. We introduce the notion of aligned and misaligned filters at initialization and show that the data heterogeneity only affects learning on misaligned filters. Starting with a pre-trained model typically results in fewer misaligned filters at initialization, thus producing a lower test error even when the model is trained in a federated setting with data heterogeneity. Experiments in synthetic settings and practical FL training on CNNs verify our theoretical findings.

[LG-52] An Interactive Framework for Implementing Privacy-Preserving Federated Learning: Experiments on Large Language Models

链接: https://arxiv.org/abs/2502.08008
作者: Kasra Ahmadi,Rouzbeh Behnia,Reza Ebrahimi,Mehran Mozaffari Kermani,Jeremiah Birrell,Jason Pacheco,Attila A Yavuz
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enhances privacy by keeping user data on local devices. However, emerging attacks have demonstrated that the updates shared by users during training can reveal significant information about their data. This has greatly thwart the adoption of FL methods for training robust AI models in sensitive applications. Differential Privacy (DP) is considered the gold standard for safeguarding user data. However, DP guarantees are highly conservative, providing worst-case privacy guarantees. This can result in overestimating privacy needs, which may compromise the model’s accuracy. Additionally, interpretations of these privacy guarantees have proven to be challenging in different contexts. This is further exacerbated when other factors, such as the number of training iterations, data distribution, and specific application requirements, can add further complexity to this problem. In this work, we proposed a framework that integrates a human entity as a privacy practitioner to determine an optimal trade-off between the model’s privacy and utility. Our framework is the first to address the variable memory requirement of existing DP methods in FL settings, where resource-limited devices (e.g., cell phones) can participate. To support such settings, we adopt a recent DP method with fixed memory usage to ensure scalable private FL. We evaluated our proposed framework by fine-tuning a BERT-based LLM model using the GLUE dataset (a common approach in literature), leveraging the new accountant, and employing diverse data partitioning strategies to mimic real-world conditions. As a result, we achieved stable memory usage, with an average accuracy reduction of 1.33% for \epsilon = 10 and 1.9% for \epsilon = 6 , when compared to the state-of-the-art DP accountant which does not support fixed memory usage.

[LG-53] he Role of Randomness in Stability

链接: https://arxiv.org/abs/2502.08007
作者: Max Hopkins,Shay Moran
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Stability is a central property in learning and statistics promising the output of an algorithm A does not change substantially when applied to similar datasets S and S’ . It is an elementary fact that any sufficiently stable algorithm (e.g.\ one returning the same result with high probability, satisfying privacy guarantees, etc.) must be randomized. This raises a natural question: can we quantify how much randomness is needed for algorithmic stability? We study the randomness complexity of two influential notions of stability in learning: replicability, which promises A usually outputs the same result when run over samples from the same distribution (and shared random coins), and differential privacy, which promises the output distribution of A remains similar under neighboring datasets. The randomness complexity of these notions was studied recently in (Dixon et al. ICML 2024) and (Cannone et al. ITCS 2024) for basic d -dimensional tasks (e.g. estimating the bias of d coins), but little is known about the measures more generally or in complex settings like classification. Toward this end, we prove a weak-to-strong' boosting theorem for stability: the randomness complexity of a task M (either under replicability or DP) is tightly controlled by the best replication probability of any deterministic algorithm solving the task, a weak measure called global stability’ that is universally capped at \frac12 (Chase et al. FOCS 2023). Using this, we characterize the randomness complexity of PAC Learning: a class has bounded randomness complexity iff it has finite Littlestone dimension, and moreover scales at worst logarithmically in the excess error of the learner. This resolves a question of (Chase et al. STOC 2024) who asked for such a characterization in the equivalent language of (error-dependent) `list-replicability’. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 68Q32 Cite as: arXiv:2502.08007 [cs.LG] (or arXiv:2502.08007v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.08007 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] Heterogeneous Multi-agent Multi-armed Bandits on Stochastic Block Models

链接: https://arxiv.org/abs/2502.08003
作者: Mengfan Xu,Liren Shan,Fatemeh Ghaffari,Xuchuang Wang,Xutong Liu,Mohammad Hajiesmaili
类目: Machine Learning (cs.LG)
*备注: 55 pages

点击查看摘要

Abstract:We study a novel heterogeneous multi-agent multi-armed bandit problem with a cluster structure induced by stochastic block models, influencing not only graph topology, but also reward heterogeneity. Specifically, agents are distributed on random graphs based on stochastic block models - a generalized Erdos-Renyi model with heterogeneous edge probabilities: agents are grouped into clusters (known or unknown); edge probabilities for agents within the same cluster differ from those across clusters. In addition, the cluster structure in stochastic block model also determines our heterogeneous rewards. Rewards distributions of the same arm vary across agents in different clusters but remain consistent within a cluster, unifying homogeneous and heterogeneous settings and varying degree of heterogeneity, and rewards are independent samples from these distributions. The objective is to minimize system-wide regret across all agents. To address this, we propose a novel algorithm applicable to both known and unknown cluster settings. The algorithm combines an averaging-based consensus approach with a newly introduced information aggregation and weighting technique, resulting in a UCB-type strategy. It accounts for graph randomness, leverages both intra-cluster (homogeneous) and inter-cluster (heterogeneous) information from rewards and graphs, and incorporates cluster detection for unknown cluster settings. We derive optimal instance-dependent regret upper bounds of order \logT under sub-Gaussian rewards. Importantly, our regret bounds capture the degree of heterogeneity in the system (an additional layer of complexity), exhibit smaller constants, scale better for large systems, and impose significantly relaxed assumptions on edge probabilities. In contrast, prior works have not accounted for this refined problem complexity, rely on more stringent assumptions, and exhibit limited scalability.

[LG-55] Unveiling Client Privacy Leakage from Public Dataset Usage in Federated Distillation

链接: https://arxiv.org/abs/2502.08001
作者: Haonan Shi,Tu Ouyang,An Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 pages, 10 figures

点击查看摘要

Abstract:Federated Distillation (FD) has emerged as a popular federated training framework, enabling clients to collaboratively train models without sharing private data. Public Dataset-Assisted Federated Distillation (PDA-FD), which leverages public datasets for knowledge sharing, has become widely adopted. Although PDA-FD enhances privacy compared to traditional Federated Learning, we demonstrate that the use of public datasets still poses significant privacy risks to clients’ private training data. This paper presents the first comprehensive privacy analysis of PDA-FD in presence of an honest-but-curious server. We show that the server can exploit clients’ inference results on public datasets to extract two critical types of private information: label distributions and membership information of the private training dataset. To quantify these vulnerabilities, we introduce two novel attacks specifically designed for the PDA-FD setting: a label distribution inference attack and innovative membership inference methods based on Likelihood Ratio Attack (LiRA). Through extensive evaluation of three representative PDA-FD frameworks (FedMD, DS-FL, and Cronus), our attacks achieve state-of-the-art performance, with label distribution attacks reaching minimal KL-divergence and membership inference attacks maintaining high True Positive Rates under low False Positive Rate constraints. Our findings reveal significant privacy risks in current PDA-FD frameworks and emphasize the need for more robust privacy protection mechanisms in collaborative learning systems.

[LG-56] Adaptive kernel predictors from feature-learning infinite limits of neural networks

链接: https://arxiv.org/abs/2502.07998
作者: Clarissa Lauditi,Blake Bordelon,Cengiz Pehlevan
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Previous influential work showed that infinite width limits of neural networks in the lazy training regime are described by kernel machines. Here, we show that neural networks trained in the rich, feature learning infinite-width regime in two different settings are also described by kernel machines, but with data-dependent kernels. For both cases, we provide explicit expressions for the kernel predictors and prescriptions to numerically calculate them. To derive the first predictor, we study the large-width limit of feature-learning Bayesian networks, showing how feature learning leads to task-relevant adaptation of layer kernels and preactivation densities. The saddle point equations governing this limit result in a min-max optimization problem that defines the kernel predictor. To derive the second predictor, we study gradient flow training of randomly initialized networks trained with weight decay in the infinite-width limit using dynamical mean field theory (DMFT). The fixed point equations of the arising DMFT defines the task-adapted internal representations and the kernel predictor. We compare our kernel predictors to kernels derived from lazy regime and demonstrate that our adaptive kernels achieve lower test loss on benchmark datasets.

[LG-57] What is a Sketch-and-Precondition Derivation for Low-Rank Approximation? Inverse Power Error or Inverse Power Estimation?

链接: https://arxiv.org/abs/2502.07993
作者: Ruihan Xu,Yiping Lu
类目: Numerical Analysis (math.NA); Computational Complexity (cs.CC); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Randomized sketching accelerates large-scale numerical linear algebra by reducing computa- tional complexity. While the traditional sketch-and-solve approach reduces the problem size di- rectly through sketching, the sketch-and-precondition method leverages sketching to construct a computational friendly preconditioner. This preconditioner improves the convergence speed of iterative solvers applied to the original problem, maintaining accuracy in the full space. Further- more, the convergence rate of the solver improves at least linearly with the sketch size. Despite its potential, developing a sketch-and-precondition framework for randomized algorithms in low- rank matrix approximation remains an open challenge. We introduce the Error-Powered Sketched Inverse Iteration (EPSI) Method via run sketched Newton iteration for the Lagrange form as a sketch-and-precondition variant for randomized low-rank approximation. Our method achieves theoretical guarantees, including a convergence rate that improves at least linearly with the sketch size.

[LG-58] Learning Effective Dynamics across Spatio-Temporal Scales of Complex Flows

链接: https://arxiv.org/abs/2502.07990
作者: Han Gao,Sebastian Kaltenbach,Petros Koumoutsakos
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注: Conference on Parsimony and Learning (CPAL)

点击查看摘要

Abstract:Modeling and simulation of complex fluid flows with dynamics that span multiple spatio-temporal scales is a fundamental challenge in many scientific and engineering domains. Full-scale resolving simulations for systems such as highly turbulent flows are not feasible in the foreseeable future, and reduced-order models must capture dynamics that involve interactions across scales. In the present work, we propose a novel framework, Graph-based Learning of Effective Dynamics (Graph-LED), that leverages graph neural networks (GNNs), as well as an attention-based autoregressive model, to extract the effective dynamics from a small amount of simulation data. GNNs represent flow fields on unstructured meshes as graphs and effectively handle complex geometries and non-uniform grids. The proposed method combines a GNN based, dimensionality reduction for variable-size unstructured meshes with an autoregressive temporal attention model that can learn temporal dependencies automatically. We evaluated the proposed approach on a suite of fluid dynamics problems, including flow past a cylinder and flow over a backward-facing step over a range of Reynolds numbers. The results demonstrate robust and effective forecasting of spatio-temporal physics; in the case of the flow past a cylinder, both small-scale effects that occur close to the cylinder as well as its wake are accurately captured.

[LG-59] A Survey of In-Context Reinforcement Learning

链接: https://arxiv.org/abs/2502.07978
作者: Amir Moeini,Jiuqi Wang,Jacob Beck,Ethan Blaser,Shimon Whiteson,Rohan Chandra,Shangtong Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) agents typically optimize their policies by performing expensive backward passes to update their network parameters. However, some agents can solve new tasks without updating any parameters by simply conditioning on additional context such as their action-observation histories. This paper surveys work on such behavior, known as in-context reinforcement learning.

[LG-60] RESIST: Resilient Decentralized Learning Using Consensus Gradient Descent

链接: https://arxiv.org/abs/2502.07977
作者: Cheng Fang,Rishabh Dixit,Waheed U. Bajwa,Mert Gurbuzbalaban
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: preprint of a journal paper; 100 pages and 17 figures

点击查看摘要

Abstract:Empirical risk minimization (ERM) is a cornerstone of modern machine learning (ML), supported by advances in optimization theory that ensure efficient solutions with provable algorithmic convergence rates, which measure the speed at which optimization algorithms approach a solution, and statistical learning rates, which characterize how well the solution generalizes to unseen data. Privacy, memory, computational, and communications constraints increasingly necessitate data collection, processing, and storage across network-connected devices. In many applications, these networks operate in decentralized settings where a central server cannot be assumed, requiring decentralized ML algorithms that are both efficient and resilient. Decentralized learning, however, faces significant challenges, including an increased attack surface for adversarial interference during decentralized learning processes. This paper focuses on the man-in-the-middle (MITM) attack, which can cause models to deviate significantly from their intended ERM solutions. To address this challenge, we propose RESIST (Resilient dEcentralized learning using conSensus gradIent deScenT), an optimization algorithm designed to be robust against adversarially compromised communication links. RESIST achieves algorithmic and statistical convergence for strongly convex, Polyak-Lojasiewicz, and nonconvex ERM problems. Experimental results demonstrate the robustness and scalability of RESIST for real-world decentralized learning in adversarial environments.

[LG-61] Sink equilibria and the attractors of learning in games

链接: https://arxiv.org/abs/2502.07975
作者: Oliver Biggar,Christos Papadimitriou
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Characterizing the limit behavior – that is, the attractors – of learning dynamics is one of the most fundamental open questions in game theory. In recent work in this front, it was conjectured that the attractors of the replicator dynamic are in one-to-one correspondence with the sink equilibria of the game – the sink strongly connected components of a game’s preference graph – , and it was established that they do stand in at least one-to-many correspondence with them. We make threefold progress on the problem of characterizing attractors. First, we show through a topological construction that the one-to-one conjecture is false. Second, we make progress on the attractor characterization problem for two-player games by establishing that the one-to-one conjecture is true in the absence of a local pattern called a weak local source – a pattern that is absent from zero-sum games. Finally, we look – for the first time in this context – at fictitious play, the longest-studied learning dynamic, and examine to what extent the conjecture generalizes there. We establish that under fictitious play, sink equilibria always contain attractors (sometimes strictly), and every attractor corresponds to a strongly connected set of nodes in the preference graph.

[LG-62] New tools for comparing classical and neural ODE models for tumor growth

链接: https://arxiv.org/abs/2502.07964
作者: Anthony D. Blaom,Samuel Okon
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 9 pages, 2 figures. Related software is archived at this https URL

点击查看摘要

Abstract:A new computational tool this http URL for modeling tumor growth is introduced. The tool allows the comparison of standard textbook models, such as General Bertalanffy and Gompertz, with some newer models, including, for the first time, neural ODE models. As an application, we revisit a human meta-study of non-small cell lung cancer and bladder cancer lesions, in patients undergoing two different treatment options, to determine if previously reported performance differences are statistically significant, and if newer, more complex models perform any better. In a population of examples with at least four time-volume measurements available for calibration, and an average of about 6.3, our main conclusion is that the General Bertalanffy model has superior performance, on average. However, where more measurements are available, we argue that more complex models, capable of capturing rebound and relapse behavior, may be better choices.

[LG-63] ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans

链接: https://arxiv.org/abs/2502.07962
作者: Ashkan Shahbazi,Elaheh Akbari,Darian Salehi,Xinran Liu,Navid Naderializadeh,Soheil Kolouri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces double stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications.

[LG-64] Symbiotic Cooperation for Web Agents : Harnessing Complementary Strengths of Large and Small LLM s

链接: https://arxiv.org/abs/2502.07942
作者: Ruichen Zhang,Mufan Qiu,Zhen Tan,Mohan Zhang,Vincent Lu,Jie Peng,Kaidi Xu,Leandro Z. Agudelo,Peter Qian,Tianlong Chen
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Web browsing agents powered by large language models (LLMs) have shown tremendous potential in automating complex web-based tasks. Existing approaches typically rely on large LLMs (e.g., GPT-4o) to explore web environments and generate trajectory data, which is then used either for demonstration retrieval (for large LLMs) or to distill small LLMs (e.g., Llama3) in a process that remains decoupled from the exploration. In this paper, we propose AgentSymbiotic, an iterative framework that couples data synthesis with task-performance, yielding a “symbiotic improvement” for both large and small LLMs. Our study uncovers a complementary dynamic between LLM types: while large LLMs excel at generating high-quality trajectories for distillation, the distilled small LLMs-owing to their distinct reasoning capabilities-often choose actions that diverge from those of their larger counterparts. This divergence drives the exploration of novel trajectories, thereby enriching the synthesized data. However, we also observe that the performance of small LLMs becomes a bottleneck in this iterative enhancement process. To address this, we propose two innovations in LLM distillation: a speculative data synthesis strategy that mitigates off-policy bias, and a multi-task learning approach designed to boost the reasoning capabilities of the student LLM. Furthermore, we introduce a Hybrid Mode for Privacy Preservation to address user privacy concerns. Evaluated on the WEBARENA benchmark, AgentSymbiotic achieves SOTA performance with both LLM types. Our best Large LLM agent reaches 52%, surpassing the previous best of 45%, while our 8B distilled model demonstrates a competitive 49%, exceeding the prior best of 28%. Code will be released upon acceptance.

[LG-65] Active Advantage-Aligned Online Reinforcement Learning with Offline Data

链接: https://arxiv.org/abs/2502.07937
作者: Xuefeng Liu,Hung T. C. Le,Siyu Chen,Rick Stevens,Zhuoran Yang,Matthew R. Walter,Yuxin Chen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Online reinforcement learning (RL) enhances policies through direct interactions with the environment, but faces challenges related to sample efficiency. In contrast, offline RL leverages extensive pre-collected data to learn policies, but often produces suboptimal results due to limited data coverage. Recent efforts have sought to integrate offline and online RL in order to harness the advantages of both approaches. However, effectively combining online and offline RL remains challenging due to issues that include catastrophic forgetting, lack of robustness and sample efficiency. In an effort to address these challenges, we introduce A3 RL , a novel method that actively selects data from combined online and offline sources to optimize policy improvement. We provide theoretical guarantee that validates the effectiveness our active sampling strategy and conduct thorough empirical experiments showing that our method outperforms existing state-of-the-art online RL techniques that utilize offline data. Our code will be publicly available at: this https URL.

[LG-66] MAAT: Mamba Adaptive Anomaly Transformer with association discrepancy for time series

链接: https://arxiv.org/abs/2502.07858
作者: Abdellah Zakaria Sellam,Ilyes Benaissa,Abdelmalik Taleb-Ahmed,Luigi Patrono,Cosimo Distante
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection in time series is essential for industrial monitoring and environmental sensing, yet distinguishing anomalies from complex patterns remains challenging. Existing methods like the Anomaly Transformer and DCdetector have progressed, but they face limitations such as sensitivity to short-term contexts and inefficiency in noisy, non-stationary environments. To overcome these issues, we introduce MAAT, an improved architecture that enhances association discrepancy modeling and reconstruction quality. MAAT features Sparse Attention, efficiently capturing long-range dependencies by focusing on relevant time steps, thereby reducing computational redundancy. Additionally, a Mamba-Selective State Space Model is incorporated into the reconstruction module, utilizing a skip connection and Gated Attention to improve anomaly localization and detection performance. Extensive experiments show that MAAT significantly outperforms previous methods, achieving better anomaly distinguishability and generalization across various time series applications, setting a new standard for unsupervised time series anomaly detection in real-world scenarios. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.07858 [cs.LG] (or arXiv:2502.07858v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.07858 Focus to learn more arXiv-issued DOI via DataCite

[LG-67] Memory Analysis on the Training Course of DeepSeek Models

链接: https://arxiv.org/abs/2502.07846
作者: Ping Zhang,Lei Su
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a theoretical analysis of GPU memory consumption during the training of DeepSeek models such as DeepSeek-v2 and DeepSeek-v3. Our primary objective is to clarify the device-level memory requirements associated with various distributed training configurations. Specifically, we examine critical factors influencing memory usage, including micro-batch size, activation recomputation policies, 3D parallelism, and ZeRO optimizations. It is important to emphasize that the training policies discussed in this report are not representative of DeepSeek’s official configurations. Instead, they are explored to provide a deeper understanding of memory dynamics in training of large-scale mixture-of-experts model.

[LG-68] Emotional EEG Classification using Upscaled Connectivity Matrices

链接: https://arxiv.org/abs/2502.07843
作者: Chae-Won Lee,Jong-Seok Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent studies of emotional EEG classification, connectivity matrices have been successfully employed as input to convolutional neural networks (CNNs), which can effectively consider inter-regional interaction patterns in EEG. However, we find that such an approach has a limitation that important patterns in connectivity matrices may be lost during the convolutional operations in CNNs. To resolve this issue, we propose and validate an idea to upscale the connectivity matrices to strengthen the local patterns. Experimental results demonstrate that this simple idea can significantly enhance the classification performance.

[LG-69] Optimal Actuator Attacks on Autonomous Vehicles Using Reinforcement Learning IROS

链接: https://arxiv.org/abs/2502.07839
作者: Pengyu Wang,Jialu Li,Ling Shi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Workshop

点击查看摘要

Abstract:With the increasing prevalence of autonomous vehicles (AVs), their vulnerability to various types of attacks has grown, presenting significant security challenges. In this paper, we propose a reinforcement learning (RL)-based approach for designing optimal stealthy integrity attacks on AV actuators. We also analyze the limitations of state-of-the-art RL-based secure controllers developed to counter such attacks. Through extensive simulation experiments, we demonstrate the effectiveness and efficiency of our proposed method.

[LG-70] RoboBERT: An End-to-end Multimodal Robotic Manipulation Model

链接: https://arxiv.org/abs/2502.07837
作者: Sicheng Wang,Jianhua Shan,Jianwei Zhang,Haozhang Gao,Hailiang Han,Yipeng Chen,Kang Wei,Chengkun Zhang,Kairos Wong,Jie Zhao,Lei Zhao,Bin Fang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Embodied intelligence integrates multiple modalities, enabling agents to understand images, language, and actions simultaneously. However, existing models always depend on additional datasets or extensive pre-training to maximize performance improvements, consuming abundant training time and expensive hardware cost. To tackle this issue, we present RoboBERT, a novel end-to-end robotic manipulation model integrated with a unique training strategy. This model utilizes a CNN-based diffusion policy, enhancing and stabilizing the effectiveness of this model by separating training processes for different modalities. It also underscores the importance of data augmentation, verifying various techniques to significantly boost performance. Unlike models that depend on extra data or large foundation models, RoboBERT achieves a highly competitive success rate while using only language-labeled expert demonstrations and maintaining a relatively smaller model size. Specifically, RoboBERT achieves an average length of 4.52 on the CALVIN benchmark for (ABCD \rightarrow D) task, setting a new state-of-the-art (SOTA) record. Furthermore, when tested on a real robot, the model demonstrates superior performance, achieving a higher success rate than other methods trained with the same data. We propose that these concepts and methodologies of RoboBERT demonstrate extensive versatility and compatibility, contributing significantly to the development of lightweight multimodal robotic models. The code can be accessed on this https URL

[LG-71] Data-Driven Socio-Economic Deprivation Prediction via Dimensionality Reduction: The Power of Diffusion Maps

链接: https://arxiv.org/abs/2312.09830
作者: June Moh Goo
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: Accepted to 2024 IEEE International Conference on Big Data (IEEE Big Data 2024)

点击查看摘要

Abstract:This research proposes a model to predict the location of the most deprived areas in a city using data from the census. Census data is very high-dimensional and needs to be simplified. We use the diffusion map algorithm to reduce dimensionality and find patterns. Features are defined by eigenvectors of the Laplacian matrix that defines the diffusion map. The eigenvectors corresponding to the smallest eigenvalues indicate specific characteristics of the population. Previous work has found qualitatively that the second most important dimension for describing the census data in Bristol, UK is linked to deprivation. In this research, we analyse how good this dimension is as a model for predicting deprivation by comparing it with the recognised measures. The Pearson correlation coefficient was found to be greater than 0.7. The top 10 per cent of deprived areas in the UK, which are also located in Bristol, are extracted to test the accuracy of the model. There are 52 of the most deprived areas, and 38 areas are correctly identified by comparing them to the model. The influence of scores of IMD domains that do not correlate with the models and Eigenvector 2 entries of non-deprived Output Areas cause the model to fail the prediction of 14 deprived areas. The model demonstrates strong performance in predicting future deprivation in the project areas, which is expected to assist in government resource allocation and funding greatly. The codes can be accessed here: this https URL

[LG-72] Joint Transmit and Pinching Beamforming for PASS: Optimization-Based or Learning-Based?

链接: https://arxiv.org/abs/2502.08637
作者: Xiaoxia Xu,Xidong Mu,Yuanwei Liu,Arumugam Nallanathan
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Submitted to IEEE

点击查看摘要

Abstract:A novel pinching antenna system (PASS)-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed. PASS consists of multiple waveguides spanning over thousands of wavelength, which equip numerous low-cost dielectric particles, named pinching antennas (PAs), to radiate signals into free space. The positions of PAs can be reconfigured to change both the large-scale path losses and phases of signals, thus facilitating the novel pinching beamforming design. A sum rate maximization problem is formulated, which jointly optimizes the transmit and pinching beamforming to adaptively achieve constructive signal enhancement and destructive interference mitigation. To solve this highly coupled and nonconvex problem, both optimization-based and learning-based methods are proposed. 1) For the optimization-based method, a majorization-minimization and penalty dual decomposition (MM-PDD) algorithm is developed, which handles the nonconvex complex exponential component using a Lipschitz surrogate function and then invokes PDD for problem decoupling. 2) For the learning-based method, a novel Karush-Kuhn-Tucker (KKT)-guided dual learning (KDL) approach is proposed, which enables KKT solutions to be reconstructed in a data-driven manner by learning dual variables. Following this idea, a KDL-Tranformer algorithm is developed, which captures both inter-PA/inter-user dependencies and channel-state-information (CSI)-beamforming dependencies by attention mechanisms. Simulation results demonstrate that: i) The proposed PASS framework significantly outperforms conventional massive multiple input multiple output (MIMO) system even with a few PAs. ii) The proposed KDL-Transformer can improve over 30% system performance than MM-PDD algorithm, while achieving a millisecond-level response on modern GPUs.

[LG-73] Concentration Inequalities for the Stochastic Optimization of Unbounded Objectives with Application to Denoising Score Matching

链接: https://arxiv.org/abs/2502.08628
作者: Jeremiah Birrell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages

点击查看摘要

Abstract:We derive novel concentration inequalities that bound the statistical error for a large class of stochastic optimization problems, focusing on the case of unbounded objective functions. Our derivations utilize the following tools: 1) A new form of McDiarmid’s inequality that is based on sample dependent one component difference bounds and which leads to a novel uniform law of large numbers result for unbounded functions. 2) A Rademacher complexity bound for families of functions that satisfy an appropriate local Lipschitz property. As an application of these results, we derive statistical error bounds for denoising score matching (DSM), an application that inherently requires one to consider unbounded objective functions, even when the data distribution has bounded support. In addition, our results establish the benefit of sample reuse in algorithms that employ easily sampled auxiliary random variables in addition to the training data, e.g., as in DSM, which uses auxiliary Gaussian random variables.

[LG-74] Mathematical Data Science

链接: https://arxiv.org/abs/2502.08620
作者: Michael R. Douglas,Kyu-Hwan Lee
类目: History and Overview (math.HO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Can machine learning help discover new mathematical structures? In this article we discuss an approach to doing this which one can call “mathematical data science”. In this paradigm, one studies mathematical objects collectively rather than individually, by creating datasets and doing machine learning experiments and interpretations. After an overview, we present two case studies: murmurations in number theory and loadings of partitions related to Kronecker coefficients in representation theory and combinatorics.

[LG-75] A Machine Learning-Ready Data Processing Tool for Near Real-Time Forecasting

链接: https://arxiv.org/abs/2502.08555
作者: Maher A Dayeh,Michael J Starkey,Subhamoy Chatterjee,Heather Elliott,Samuel Hart,Kimberly Moreland
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Space weather forecasting is critical for mitigating radiation risks in space exploration and protecting Earth-based technologies from geomagnetic disturbances. This paper presents the development of a Machine Learning (ML)- ready data processing tool for Near Real-Time (NRT) space weather forecasting. By merging data from diverse NRT sources such as solar imagery, magnetic field measurements, and energetic particle fluxes, the tool addresses key gaps in current space weather prediction capabilities. The tool processes and structures the data for machine learning models, focusing on time-series forecasting and event detection for extreme solar events. It provides users with a framework to download, process, and label data for ML applications, streamlining the workflow for improved NRT space weather forecasting and scientific research.

[LG-76] Semantic Learning for Molecular Communication in Internet of Bio-Nano Things

链接: https://arxiv.org/abs/2502.08426
作者: Hanlin Cai,Ozgur B. Akan
类目: ignal Processing (eess.SP); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 4 pages, 3 figures, 1 table

点击查看摘要

Abstract:Molecular communication (MC) provides a foundational framework for information transmission in the Internet of Bio-Nano Things (IoBNT), where efficiency and reliability are crucial. However, the inherent limitations of molecular channels, such as low transmission rates, noise, and inter-symbol interference (ISI), limit their ability to support complex data transmission. This paper proposes an end-to-end semantic learning framework designed to optimize task-oriented molecular communication, with a focus on biomedical diagnostic tasks under resource-constrained conditions. The proposed framework employs a deep encoder-decoder architecture to efficiently extract, quantize, and decode semantic features, prioritizing task-relevant semantic information to enhance diagnostic classification performance. Additionally, a probabilistic channel network is introduced to approximate molecular propagation dynamics, enabling gradient-based optimization for end-to-end learning. Experimental results demonstrate that the proposed semantic framework improves diagnostic accuracy by at least 25% compared to conventional JPEG compression with LDPC coding methods under resource-constrained communication scenarios.

[LG-77] Multifidelity Simulation-based Inference for Computationally Expensive Simulators

链接: https://arxiv.org/abs/2502.08416
作者: Anastasia N. Krouglova,Hayden R. Johnson,Basile Confavreux,Michael Deistler,Pedro J. Gonçalves
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Across many domains of science, stochastic models are an essential tool to understand the mechanisms underlying empirically observed data. Models can be of different levels of detail and accuracy, with models of high-fidelity (i.e., high accuracy) to the phenomena under study being often preferable. However, inferring parameters of high-fidelity models via simulation-based inference is challenging, especially when the simulator is computationally expensive. We introduce MF-NPE, a multifidelity approach to neural posterior estimation that leverages inexpensive low-fidelity simulations to infer parameters of high-fidelity simulators within a limited simulation budget. MF-NPE performs neural posterior estimation with limited high-fidelity resources by virtue of transfer learning, with the ability to prioritize individual observations using active learning. On one statistical task with analytical ground-truth and two real-world tasks, MF-NPE shows comparable performance to current approaches while requiring up to two orders of magnitude fewer high-fidelity simulations. Overall, MF-NPE opens new opportunities to perform efficient Bayesian inference on computationally expensive simulators.

[LG-78] Sparse Estimation of Inverse Covariance and Partial Correlation Matrices via Joint Partial Regression

链接: https://arxiv.org/abs/2502.08414
作者: Samuel Erickson,Tobias Rydén
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a new method for estimating high-dimensional sparse partial correlation and inverse covariance matrices, which exploits the connection between the inverse covariance matrix and linear regression. The method is a two-stage estimation method wherein each individual feature is regressed on all other features while positive semi-definiteness is enforced simultaneously. We provide statistical rates of convergence for the proposed method which match, and improve upon, the state-of-the-art for inverse covariance and partial correlation matrix estimation, respectively. We also propose an efficient proximal splitting algorithm for numerically computing the estimate. The effectiveness of the proposed method is demonstrated on both synthetic and real-world data.

[LG-79] Strong bounds for large-scale Minimum Sum-of-Squares Clustering

链接: https://arxiv.org/abs/2502.08397
作者: Anna Livia Croella,Veronica Piccialli,Antonio M. Sudoso
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering is a fundamental technique in data analysis and machine learning, used to group similar data points together. Among various clustering methods, the Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used. MSSC aims to minimize the total squared Euclidean distance between data points and their corresponding cluster centroids. Due to the unsupervised nature of clustering, achieving global optimality is crucial, yet computationally challenging. The complexity of finding the global solution increases exponentially with the number of data points, making exact methods impractical for large-scale datasets. Even obtaining strong lower bounds on the optimal MSSC objective value is computationally prohibitive, making it difficult to assess the quality of heuristic solutions. We address this challenge by introducing a novel method to validate heuristic MSSC solutions through optimality gaps. Our approach employs a divide-and-conquer strategy, decomposing the problem into smaller instances that can be handled by an exact solver. The decomposition is guided by an auxiliary optimization problem, the “anticlustering problem”, for which we design an efficient heuristic. Computational experiments demonstrate the effectiveness of the method for large-scale instances, achieving optimality gaps below 3% in most cases while maintaining reasonable computational times. These results highlight the practicality of our approach in assessing feasible clustering solutions for large datasets, bridging a critical gap in MSSC evaluation.

[LG-80] Multi-View Oriented GPLVM: Expressiveness and Efficiency

链接: https://arxiv.org/abs/2502.08253
作者: Zi Yang,Ying Li,Zhidi Lin,Michael Minyi Zhang,Pablo M. Olmos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:The multi-view Gaussian process latent variable model (MV-GPLVM) aims to learn a unified representation from multi-view data but is hindered by challenges such as limited kernel expressiveness and low computational efficiency. To overcome these issues, we first introduce a new duality between the spectral density and the kernel function. By modeling the spectral density with a bivariate Gaussian mixture, we then derive a generic and expressive kernel termed Next-Gen Spectral Mixture (NG-SM) for MV-GPLVMs. To address the inherent computational inefficiency of the NG-SM kernel, we propose a random Fourier feature approximation. Combined with a tailored reparameterization trick, this approximation enables scalable variational inference for both the model and the unified latent representations. Numerical evaluations across a diverse range of multi-view datasets demonstrate that our proposed method consistently outperforms state-of-the-art models in learning meaningful latent representations.

[LG-81] Optimizing Likelihoods via Mutual Information: Bridging Simulation-Based Inference and Bayesian Optimal Experimental Design

链接: https://arxiv.org/abs/2502.08004
作者: Vincent D. Zaballa,Elliot E. Hui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint. Under Review

点击查看摘要

Abstract:Simulation-based inference (SBI) is a method to perform inference on a variety of complex scientific models with challenging inference (inverse) problems. Bayesian Optimal Experimental Design (BOED) aims to efficiently use experimental resources to make better inferences. Various stochastic gradient-based BOED methods have been proposed as an alternative to Bayesian optimization and other experimental design heuristics to maximize information gain from an experiment. We demonstrate a link via mutual information bounds between SBI and stochastic gradient-based variational inference methods that permits BOED to be used in SBI applications as SBI-BOED. This link allows simultaneous optimization of experimental designs and optimization of amortized inference functions. We evaluate the pitfalls of naive design optimization using this method in a standard SBI task and demonstrate the utility of a well-chosen design distribution in BOED. We compare this approach on SBI-based models in real-world simulators in epidemiology and biology, showing notable improvements in inference.

[LG-82] Discrete Markov Probabilistic Models

链接: https://arxiv.org/abs/2502.07939
作者: Le-Tuyet-Nhi Pham,Dario Shariatian,Antonio Ocello,Giovanni Conforti,Alain Durmus
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces the Discrete Markov Probabilistic Model (DMPM), a novel algorithm for discrete data generation. The algorithm operates in the space of bits \0,1^d , where the noising process is a continuous-time Markov chain that can be sampled exactly via a Poissonian clock that flips labels uniformly at random. The time-reversal process, like the forward noise process, is a jump process, with its intensity governed by a discrete analogue of the classical score function. Crucially, this intensity is proven to be the conditional expectation of a function of the forward process, strengthening its theoretical alignment with score-based generative models while ensuring robustness and efficiency. We further establish convergence bounds for the algorithm under minimal assumptions and demonstrate its effectiveness through experiments on low-dimensional Bernoulli-distributed datasets and high-dimensional binary MNIST data. The results highlight its strong performance in generating discrete structures. This work bridges theoretical foundations and practical applications, advancing the development of effective and theoretically grounded discrete generative modeling.

[LG-83] Sign Operator for Coping with Heavy-Tailed Noise: High Probability Convergence Bounds with Extensions to Distributed Optimization and Comparison Oracle

链接: https://arxiv.org/abs/2502.07923
作者: Nikita Kornilov,Philip Zmushko,Andrei Semenov,Alexander Gasnikov,Alexander Beznosikov
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing popularity of AI optimization problems involving severely corrupted data has increased the demand for methods capable of handling heavy-tailed noise, i.e., noise with bounded \kappa -th moment, \kappa \in (1,2] . For the widely used clipping technique, effectiveness heavily depends on the careful tuning of clipping levels throughout training. In this paper, we demonstrate that using only the sign of the input, without introducing additional hyperparameters, is sufficient to cope with heavy-tailed noise effectively. For smooth non-convex functions, we prove that SignSGD achieves optimal sample complexity \tildeO\left(\varepsilon^-\frac3\kappa - 2\kappa - 1\right) with high probability for attaining an average gradient norm accuracy of \varepsilon . Under the assumption of symmetric noise, we use SignSGD with Majority Voting to extend this bound to the distributed optimization or reduce the sample complexity to \tildeO(\varepsilon^-4) in the case of a single worker with arbitrary parameters. Furthermore, we explore the application of the sign operator in zeroth-order optimization with an oracle that can only compare function values at two different points. We propose a novel method, MajorityVote-CompsSGD, and provide the first-known high-probability bound \tildeO(\varepsilon^-6) for the number of comparisons under symmetric noise assumption. Our theoretical findings are supported by the superior performance of sign-based methods in training Large Language Models.

[LG-84] he Observational Partial Order of Causal Structures with Latent Variables

链接: https://arxiv.org/abs/2502.07891
作者: Marina Maciel Ansanelli,Elie Wolfe,Robert W. Spekkens
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 48 pages, 30 figures

点击查看摘要

Abstract:For two causal structures with the same set of visible variables, one is said to observationally dominate the other if the set of distributions over the visible variables realizable by the first contains the set of distributions over the visible variables realizable by the second. Knowing such dominance relations is useful for adjudicating between these structures given observational data. We here consider the problem of determining the partial order of equivalence classes of causal structures with latent variables relative to observational dominance. We provide a complete characterization of the dominance order in the case of three visible variables, and a partial characterization in the case of four visible variables. Our techniques also help to identify which observational equivalence classes have a set of realizable distributions that is characterized by nontrivial inequality constraints, analogous to Bell inequalities and instrumental inequalities. We find evidence that as one increases the number of visible variables, the equivalence classes satisfying nontrivial inequality constraints become ubiquitous. (Because such classes are the ones for which there can be a difference in the distributions that are quantumly and classically realizable, this implies that the potential for quantum-classical gaps is also ubiquitous.) Furthermore, we find evidence that constraint-based causal discovery algorithms that rely solely on conditional independence constraints have a significantly weaker distinguishing power among observational equivalence classes than algorithms that go beyond these (i.e., algorithms that also leverage nested Markov constraints and inequality constraints).

[LG-85] A unifying account of warm start guarantees for patches of quantum landscapes

链接: https://arxiv.org/abs/2502.07889
作者: Hela Mhiri,Ricard Puig,Sacha Lerch,Manuel S. Rudolph,Thiparat Chotibut,Supanut Thanasilp,Zoë Holmes
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Barren plateaus are fundamentally a statement about quantum loss landscapes on average but there can, and generally will, exist patches of barren plateau landscapes with substantial gradients. Previous work has studied certain classes of parameterized quantum circuits and found example regions where gradients vanish at worst polynomially in system size. Here we present a general bound that unifies all these previous cases and that can tackle physically-motivated ansätze that could not be analyzed previously. Concretely, we analytically prove a lower-bound on the variance of the loss that can be used to show that in a non-exponentially narrow region around a point with curvature the loss variance cannot decay exponentially fast. This result is complemented by numerics and an upper-bound that suggest that any loss function with a barren plateau will have exponentially vanishing gradients in any constant radius subregion. Our work thus suggests that while there are hopes to be able to warm-start variational quantum algorithms, any initialization strategy that cannot get increasingly close to the region of attraction with increasing problem size is likely inadequate.

[LG-86] Advancing Precision Oncology Through Modeling of Longitudinal and Multimodal Data

链接: https://arxiv.org/abs/2502.07836
作者: Luoting Zhuang,Stephen H. Park,Steven J. Skates,Ashley E. Prosper,Denise R. Aberle,William Hsu
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE RBME for potential publication

点击查看摘要

Abstract:Cancer evolves continuously over time through a complex interplay of genetic, epigenetic, microenvironmental, and phenotypic changes. This dynamic behavior drives uncontrolled cell growth, metastasis, immune evasion, and therapy resistance, posing challenges for effective monitoring and treatment. However, today’s data-driven research in oncology has primarily focused on cross-sectional analysis using data from a single modality, limiting the ability to fully characterize and interpret the disease’s dynamic heterogeneity. Advances in multiscale data collection and computational methods now enable the discovery of longitudinal multimodal biomarkers for precision oncology. Longitudinal data reveal patterns of disease progression and treatment response that are not evident from single-timepoint data, enabling timely abnormality detection and dynamic treatment adaptation. Multimodal data integration offers complementary information from diverse sources for more precise risk assessment and targeting of cancer therapy. In this review, we survey methods of longitudinal and multimodal modeling, highlighting their synergy in providing multifaceted insights for personalized care tailored to the unique characteristics of a patient’s cancer. We summarize the current challenges and future directions of longitudinal multimodal analysis in advancing precision oncology.

[LG-87] neuro2voc: Decoding Vocalizations from Neural Activity

链接: https://arxiv.org/abs/2502.07800
作者: Fei Gao
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Master Thesis

点击查看摘要

Abstract:Accurate decoding of neural spike trains and relating them to motor output is a challenging task due to the inherent sparsity and length in neural spikes and the complexity of brain circuits. This master project investigates experimental methods for decoding zebra finch motor outputs (in both discrete syllables and continuous spectrograms), from invasive neural recordings obtained from Neuropixels. There are three major achievements: (1) XGBoost with SHAP analysis trained on spike rates revealed neuronal interaction patterns crucial for syllable classification. (2) Novel method (tokenizing neural data with GPT2) and architecture (Mamba2) demonstrated potential for decoding of syllables using spikes. (3) A combined contrastive learning-VAE framework successfully generated spectrograms from binned neural data. This work establishes a promising foundation for neural decoding of complex motor outputs and offers several novel methodological approaches for processing sparse neural data. Comments: Master Thesis Subjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2502.07800 [q-bio.NC] (or arXiv:2502.07800v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2502.07800 Focus to learn more arXiv-issued DOI via DataCite

[LG-88] Predictive Coresets

链接: https://arxiv.org/abs/2502.05725
作者: Bernardo Flores
类目: Computation (stat.CO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern data analysis often involves massive datasets with hundreds of thousands of observations, making traditional inference algorithms computationally prohibitive. Coresets are selection methods designed to choose a smaller subset of observations while maintaining similar learning performance. Conventional coreset approaches determine these weights by minimizing the Kullback-Leibler (KL) divergence between the likelihood functions of the full and weighted datasets; as a result, this makes them ill-posed for nonparametric models, where the likelihood is often intractable. We propose an alternative variational method which employs randomized posteriors and finds weights to match the unknown posterior predictive distributions conditioned on the full and reduced datasets. Our approach provides a general algorithm based on predictive recursions suitable for nonparametric priors. We evaluate the performance of the proposed coreset construction on diverse problems, including random partitions and density estimation.

[LG-89] On the Sample Complexity of Quantum Boltzmann Machine Learning

链接: https://arxiv.org/abs/2306.14969
作者: Luuk Coopmans,Marcello Benedetti
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Main text: 11 pages, 3 figures. Supplementary information: 16 pages, 2 figures. We correct a mistake that affected both lemmas 6 and 7. We thank Dhrumil Patel and Mark M. Wilde for identifying this mistake

点击查看摘要

Abstract:Quantum Boltzmann machines (QBMs) are machine-learning models for both classical and quantum data. We give an operational definition of QBM learning in terms of the difference in expectation values between the model and target, taking into account the polynomial size of the data set. By using the relative entropy as a loss function this problem can be solved without encountering barren plateaus. We prove that a solution can be obtained with stochastic gradient descent using at most a polynomial number of Gibbs states. We also prove that pre-training on a subset of the QBM parameters can only lower the sample complexity bounds. In particular, we give pre-training strategies based on mean-field, Gaussian Fermionic, and geometrically local Hamiltonians. We verify these models and our theoretical findings numerically on a quantum and a classical data set. Our results establish that QBMs are promising machine learning models.

信息检索

[IR-0] Unlocking Scaling Law in Industrial Recommendation Systems with a Three-step Paradigm based Large User Model

链接: https://arxiv.org/abs/2502.08309
作者: Bencheng Yan,Shilei Liu,Zhiyuan Zeng,Zihao Wang,Yizhen Zhang,Yujin Yuan,Langming Liu,Jiaqi Liu,Di Wang,Wenbo Su,Wang Pengjie,Jian Xu,Bo Zheng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in autoregressive Large Language Models (LLMs) have achieved significant milestones, largely attributed to their scalability, often referred to as the “scaling law”. Inspired by these achievements, there has been a growing interest in adapting LLMs for Recommendation Systems (RecSys) by reformulating RecSys tasks into generative problems. However, these End-to-End Generative Recommendation (E2E-GR) methods tend to prioritize idealized goals, often at the expense of the practical advantages offered by traditional Deep Learning based Recommendation Models (DLRMs) in terms of in features, architecture, and practices. This disparity between idealized goals and practical needs introduces several challenges and limitations, locking the scaling law in industrial RecSys. In this paper, we introduce a large user model (LUM) that addresses these limitations through a three-step paradigm, designed to meet the stringent requirements of industrial settings while unlocking the potential for scalable recommendations. Our extensive experimental evaluations demonstrate that LUM outperforms both state-of-the-art DLRMs and E2E-GR approaches. Notably, LUM exhibits excellent scalability, with performance improvements observed as the model scales up to 7 billion parameters. Additionally, we have successfully deployed LUM in an industrial application, where it achieved significant gains in an A/B test, further validating its effectiveness and practicality.

[IR-1] ChorusCVR: Chorus Supervision for Entire Space Post-Click Conversion Rate Modeling

链接: https://arxiv.org/abs/2502.08277
作者: Wei Cheng,Yucheng Lu,Boyang Xia,Jiangxia Cao,Kuan Xu,Mingxing Wen,Wei Jiang,Jiaming Zhang,Zhaojie Liu,Kun Gai,Guorui Zhou
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: Work in progress

点击查看摘要

Abstract:Post-click conversion rate (CVR) estimation is a vital task in many recommender systems of revenue businesses, e.g., e-commerce and advertising. In a perspective of sample, a typical CVR positive sample usually goes through a funnel of exposure to click to conversion. For lack of post-event labels for un-clicked samples, CVR learning task commonly only utilizes clicked samples, rather than all exposed samples as for click-through rate (CTR) learning task. However, during online inference, CVR and CTR are estimated on the same assumed exposure space, which leads to a inconsistency of sample space between training and inference, i.e., sample selection bias (SSB). To alleviate SSB, previous wisdom proposes to design novel auxiliary tasks to enable the CVR learning on un-click training samples, such as CTCVR and counterfactual CVR, etc. Although alleviating SSB to some extent, none of them pay attention to the discrimination between ambiguous negative samples (un-clicked) and factual negative samples (clicked but un-converted) during modelling, which makes CVR model lacks robustness. To full this gap, we propose a novel ChorusCVR model to realize debiased CVR learning in entire-space.

[IR-2] MoLoRec: A Generalizable and Efficient Framework for LLM -Based Recommendation

链接: https://arxiv.org/abs/2502.08271
作者: Min Hou,Chenxi Bai,Le Wu,Hao Liu,Kun Zhang,Kai Zhang,Richang Hong,Meng Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in recent years, owing to their impressive generalization capabilities and rich world knowledge. To capitalize on the potential of using LLMs as recommender systems, mainstream approaches typically focus on two paradigms. The first paradigm designs multi-domain or multi-task instruction data for generalizable recommendation, so as to align LLMs with general recommendation areas and deal with cold-start recommendation. The second paradigm enhances domain-specific recommendation tasks with parameter-efficient fine-tuning techniques, in order to improve models under the warm recommendation scenarios. While most previous works treat these two paradigms separately, we argue that they have complementary advantages, and combining them together would be helpful. To that end, in this paper, we propose a generalizable and efficient LLM-based recommendation framework MoLoRec. Our approach starts by parameter-efficient fine-tuning a domain-general module with general recommendation instruction data, to align LLM with recommendation knowledge. Then, given users’ behavior of a specific domain, we construct a domain-specific instruction dataset and apply efficient fine-tuning to the pre-trained LLM. After that, we provide approaches to integrate the above domain-general part and domain-specific part with parameters mixture. Please note that, MoLoRec is efficient with plug and play, as the domain-general module is trained only once, and any domain-specific plug-in can be efficiently merged with only domain-specific fine-tuning. Extensive experiments on multiple datasets under both warm and cold-start recommendation scenarios validate the effectiveness and generality of the proposed MoLoRec. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2502.08271 [cs.IR] (or arXiv:2502.08271v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2502.08271 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] Collaborative Filtering Meets Spectrum Shift: Connecting User-Item Interaction with Graph-Structured Side Information

链接: https://arxiv.org/abs/2502.08071
作者: Yunhang He,Cong Xu,Jun Wang,Wei Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph Neural Network (GNN) has demonstrated their superiority in collaborative filtering, where the user-item (U-I) interaction bipartite graph serves as the fundamental data format. However, when graph-structured side information (e.g., multimodal similarity graphs or social networks) is integrated into the U-I bipartite graph, existing graph collaborative filtering methods fall short of achieving satisfactory performance. We quantitatively analyze this problem from a spectral perspective. Recall that a bipartite graph possesses a full spectrum within the range of [-1, 1], with the highest frequency exactly achievable at -1 and the lowest frequency at 1; however, we observe as more side information is incorporated, the highest frequency of the augmented adjacency matrix progressively shifts rightward. This spectrum shift phenomenon has caused previous approaches built for the full spectrum [-1, 1] to assign mismatched importance to different frequencies. To this end, we propose Spectrum Shift Correction (dubbed SSC), incorporating shifting and scaling factors to enable spectral GNNs to adapt to the shifted spectrum. Unlike previous paradigms of leveraging side information, which necessitate tailored designs for diverse data types, SSC directly connects traditional graph collaborative filtering with any graph-structured side information. Experiments on social and multimodal recommendation demonstrate the effectiveness of SSC, achieving relative improvements of up to 23% without incurring any additional computational overhead.

附件下载

点击下载今日全部论文列表